基于Python的自然语言处理系列（33）：Huggingface基础与Pipeline

企业开发 2024-11-06 19:34:23 阅读次数: 0

1. 介绍

Huggingface 的 transformers 库是 NLP 领域中一个重要的开源工具。与 scikit-learn 类似，Huggingface 提供了大量内置的功能和模型。其模块化设计允许用户轻松加载预训练模型，并根据需要进行微调。这篇文章将带你探索 Huggingface 的核心功能之一——pipeline，并学习如何使用它进行情感分析、文本生成、翻译、问答等任务。

在开始之前，请确保已安装 Huggingface 所需的依赖：

pip install datasets evaluate transformers sentencepiece

2. Pipeline 基础

Pipeline 是 Huggingface 中最基础的对象，它将模型与必要的预处理和后处理步骤连接起来，使我们可以直接输入文本并获得有意义的结果。

2.1 创建情感分析 Pipeline

from transformers import pipeline

# 加载情感分析模型
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
print(classifier("I've been waiting for a HuggingFace course my whole life."))

你甚至可以传递多条句子进行批量分析：

classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
])

Pipeline 处理过程主要分为三步：

预处理：将文本转换为模型可理解的格式。
模型推理：将预处理后的输入传递给模型。
后处理：将模型输出转换为可解释的结果。

3. 可用的 Pipeline 类型

目前可用的 pipeline 包括：

特征提取（feature-extraction）
填充掩码（fill-mask）
命名实体识别（ner）
问答（question-answering）
情感分析（sentiment-analysis）
文本生成（text-generation）
文本摘要（summarization）
翻译（translation）
零样本分类（zero-shot-classification）

3.1 零样本分类

零样本分类不需要对模型进行微调，能够直接对给定标签进行分类：

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
print(classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"]
))

3.2 文本生成

通过输入提示，模型会自动完成剩余部分的生成：

generator = pipeline("text-generation", model="gpt2", max_length=30, pad_token_id=0)
print(generator("In this course, we will teach you how to"))

3.3 掩码填充

掩码填充是语言模型的重要任务之一：

unmasker = pipeline("fill-mask", model="distilroberta-base")
print(unmasker("This course will teach you all about <mask> models.", top_k=2))

3.4 命名实体识别

Huggingface 也支持多样化的 NER 模型：

ner = pipeline("ner", aggregation_strategy="simple", model="dbmdz/bert-large-cased-finetuned-conll03-english")
print(ner("My name is Sylvain and I work at Hugging Face in Brooklyn."))

3.5 问答系统

基于模型的问答功能如下：

question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
print(question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn."
))

3.6 文本摘要

使用预训练模型生成文本摘要：

summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", max_length=100)
print(summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of 
    graduates in traditional engineering disciplines such as mechanical, civil, 
    electrical, chemical, and aeronautical engineering declined, but in most of 
    the premier American universities engineering curricula now concentrate on 
    and encourage largely the study of engineering science...
    """
))

3.7 翻译

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
print(translator("Ce cours est produit par Hugging Face."))

4. Transformers 模型简介

Huggingface 的 transformer 模型主要分为三类：

BERT-like（编码器模型）：适合分类、NER 等任务。
GPT-like（解码器模型）：适合生成任务。
BART/T5-like（编码器-解码器模型）：适合翻译、摘要等任务。

这些模型在大量文本上进行了预训练，可以通过微调进一步提升在特定任务上的表现。

5. 模型中的偏见与局限

由于模型基于真实世界的数据进行训练，因此可能会带有一些潜在的偏见。例如：

unmasker = pipeline("fill-mask", model="distilroberta-base")
print(unmasker("This man works as a <mask>."))
print(unmasker("This woman works as a <mask>."))

6. Pipeline 的背后

Pipeline 连接了预处理、模型推理和后处理三部分。以下是情感分析的具体实现：

from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

inputs = tokenizer(["I love this!", "I hate this!"], padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits)

通过 SoftMax 将 logits 转换为概率：

import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

结语

在本篇文章中，我们学习了如何使用 Huggingface 的 pipeline 执行多种 NLP 任务，包括情感分析、文本生成、问答和翻译等。Huggingface 的模型不仅易于使用，还能通过简单的代码实现复杂的自然语言处理功能。

在接下来的文章中，我们将深入探讨如何使用 Huggingface 的自定义分词器和数据集加载功能。这些工具将帮助我们进一步优化模型性能，并为更复杂的 NLP 项目打下坚实的基础。

敬请期待《基于Python的自然语言处理系列（34）：Huggingface 自定义分词器》！

如果你觉得这篇博文对你有帮助，请点赞、收藏、关注我，并且可以打赏支持我！

欢迎关注我的后续博文，我将分享更多关于人工智能、自然语言处理和计算机视觉的精彩内容。

谢谢大家的支持！

猜你喜欢

转载自blog.csdn.net/ljd939952281/article/details/142930424

基于Python的自然语言处理系列（33）：Huggingface基础与Pipeline

基于Python的自然语言处理系列（39）：Huggingface中的解码策略

基于Python的自然语言处理系列（34）：Huggingface 自定义分词器与数据集

基于Python的自然语言处理系列（51）：Weight Quantization

Python 自然语言处理（基于Gensim）

Python 自然语言处理（基于SnowNLP）

自然语言处理某个pipeline

Python自然语言处理—算法基础

基于Python的自然语言处理系列（19）：基于LSTM的语言模型实现

基于Python的自然语言处理系列（37）：数据集、预处理与流式处理

HuggingFace开源的自然语言处理AI工具平台

基于Python的自然语言处理系列（46）：4-bit LLM 量化与 GPTQ

基于Python的自然语言处理系列（45）：Sentence-BERT句子相似度计算

基于 Python 的自然语言处理系列（44）：Summarization（文本摘要）

基于Python的自然语言处理系列（42）：Token Classification（标注分类）

基于Python的自然语言处理系列（53）：多种提示技术

基于Python的自然语言处理系列（41）：代码生成模型训练

基于Python的自然语言处理系列（54）：Neo4j DB QA Chain 实战

基于Python的自然语言处理系列（36）：使用PyTorch微调（无需Trainer）

基于Python的自然语言处理系列（38）：从现有数据训练新的 Tokenizer

基于Python的自然语言处理系列（50）：Soft Prompt 实现

基于Python的自然语言处理系列（49）：适配器和参数高效微调技术

基于Python的自然语言处理系列（48）：参数高效微调（PEFT）

基于Python的自然语言处理系列（47）：DistilBERT：更小、更快、更省、更轻的BERT版本

基于Python的自然语言处理系列（52）：NLP中的Agent

基于Python的自然语言处理系列（35）：Transformer 模型的微调（Finetuning）

基于Python的自然语言处理系列（22）：模型剪枝（Pruning）

自然语言处理基础

自然语言处理1——探索自然语言处理的基础 - Python入门篇

Python自然语言处理

今日推荐

Electron中的关于静态资源加载问题解决方案

《Cursor-AI编程》基础篇-界面指南

《Cursor-AI编程》基础篇-Tab代码智能补充

《Cursor-AI编程》基础篇-Composer功能详解

《Cursor-AI编程》基础篇-Chat功能详解

《Cursor-AI编程》进阶篇-自定义模型

《Cursor-AI编程》进阶篇-上下文详解

【大模型系列篇】最强检索增强技术GraphRAG基本原理详解

【大模型系列篇】基于Ollama和GraphRAG v2.0.0快速构建知识图谱

解释什么是迁移学习？在 CNN 中如何应用？（面试题200合集，高频、关键）

解释数据增强（Data Augmentation）的概念和方法（（面试题200合集，高频、关键））

揭秘大模型“魔法”：Function Calling 让 AI 不止会说，更能“做”！

周排行

ConfigurationClassParser类的parse方法源码解析

基础大讲堂-java 位运算符

ConsecutiveInteger判断给定的整数n能否表示成连续的m(m>1)个正整数之和

多项式问题之六——多项式快速幂

Spring Security技术栈开发企业级认证与授权（四）RESTful API服务异常处理

Linux基础命令---apachectl

MATLAB中的线性插值

Unity编辑器拓展之十七：NGUI ComponentSelector增加搜索框

SqlServer 备份还原教程

[Unity动画]01.

每日归档

2025-04-12(10529)

2025-04-11(9561)

2025-04-10(1213)

2025-04-09(10354)

2025-04-08(12998)

2025-04-07(0)

2025-04-06(0)

2025-04-05(0)

2025-04-04(0)

2025-04-03(0)