huggingface的transformers与datatsets的安装与使用

1.安装

2.分词

2.1tokenizer.encode（）

2.2tokenizer.encode_plus （）

2.3tokenizer.batch_encode_plus（）

3.添加新词或特殊字符

3.1tokenizer.add_tokens（）

3.2 tokenizer.add_special_tokens（）

4.datasets的使用

4.1加载datasets

4.2从dataset中取数据

4.3对datasets中的label排序

4.11map（）函数，统一在相同位置添加相同信息

4.12设置格式

4.13导出与加载csv这种常见格式

4.14导出与加载json这种常见格式

1.安装

#安装python语句

pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install datasets -i https://pypi.tuna.tsinghua.edu.cn/simple

#在jupyter notebook中安装语句

!pip install transformers -i https://pypi.tuna.tsinghua.edu.cn/simple

!pip install datasets -i https://pypi.tuna.tsinghua.edu.cn/simple

2.分词

2.1tokenizer.encode（）

from transformers import BertTokenizer

#本地磁盘加载bert-base-chinese预训练词向量模型
tokenizer = BertTokenizer.from_pretrained(
    pretrained_model_name_or_path=r'E:\ALOT\10_deep_learning\data\bert-base-chinese',
    cache_dir=None,   #有无缓存目录
    fore_download=False  #需不需要下载
)


sents = [
    '选择珠江花园的原因就是方便。',
    '笔记本的键盘确实爽。',
    '房间太小。其他的都一般。',
    '今天才知道这书还有第6卷,真有点郁闷.',
    '机器背面似乎被撕了张什么标签，残胶还在。',
]


out = tokenizer.encode(
    text=sents[0],
    text_pair=sents[1],
    #句子长度>max_length时， 进行截断操作
    truncation=True,
    #句子长度不够就统一拼接
    padding='max_length',
    add_special_tokens=True,  #添加特殊字符，如<pad>、<unk>.....
    max_length=30,
    #默认None返回一个列表， 或者选择 tf（tensflow），pt(pytorch)， np(numpy)
    return_tensors=None
)


print(out)

[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]

tokenizer.decode(out)

'[CLS] 选择珠江花园的原因就是方便。 [SEP] 笔记本的键盘确实爽。 [SEP] [PAD] [PAD] [PAD]'

2.2tokenizer.encode_plus （）

#增强版编码endoce函数
out = tokenizer.encode_plus(
    text=sents[0],
    text_pair=sents[1],
    truncation=True,
    padding='max_length',
    add_special_tokens=True,
    max_length=15,
    return_tensors=None,
    
    #增强版的增加的参数
    return_token_type_ids=True,  #标记0是第一句话；标记1是第二句话
    return_attention_mask=True,
    return_special_tokens_mask=True,  #返回特殊符号的标识
    return_length=True  #返回length标识长度
)


for k, v in out.items():
    print(k,':', v)

input_ids : [101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
special_tokens_mask : [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]
length : 30

print(out)  #多输出会变成字典

{'input_ids': [101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 511, 102, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'special_tokens_mask': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'length': 30}

2.3tokenizer.batch_encode_plus（）

#批量编码句子
out_batch_encode = tokenizer.batch_encode_plus(
    batch_text_or_text_pairs=[sents[0], sents[1], sents[2]],
    truncation=True,
    padding='max_length',
    add_special_tokens=True,
    max_length=50,
    return_tensors=None,
    
    #增强版的增加的参数
    return_token_type_ids=True,  #标记0是第一句话；标记1是第二句话, 1后面的0表示第二句话补充的pad
    return_attention_mask=True,
    return_special_tokens_mask=True,  #返回特殊符号的标识
    return_length=True  #返回length标识长度
)


for k, v in out_batch_encode.items():
    print(k, ':', v)

input_ids : [[101, 6848, 2885, 4403, 3736, 5709, 1736, 4638, 1333, 1728, 2218, 3221, 3175, 912, 102], [101, 5011, 6381, 3315, 4638, 7241, 4669, 4802, 2141, 4272, 511, 102, 0, 0, 0], [101, 2791, 7313, 1922, 2207, 511, 1071, 800, 4638, 6963, 671, 5663, 511, 102, 0]]
token_type_ids : [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
special_tokens_mask : [[1, 0,

1.安装

2.分词

2.1tokenizer.encode（）

2.2tokenizer.encode_plus （）

2.3tokenizer.batch_encode_plus（）

猜你喜欢

目录

热门文章