Exclusive | Scikit-LLM: Sklearn meets a large language model

17f3ff63cdbf41264f4235555a5d94ec.png

作者:Fareed Khan翻译:陈之炎
校对:赵茹萱


本文约2600字,建议阅读8分钟
本文为您介绍文本分析的工具包Scikit-LLM。

Tags: LLM

43f543efce676afc20bfe3b0c7df0f62.png

Scikit-LLM is a game changer for text analytics, combining the powerful ChatGPT language model with scikit-learn to provide an unparalleled toolkit for understanding and analyzing text. With scikit-LLM, hidden patterns, sentiment, and context can be discovered in various types of text data, such as customer feedback, social media posts, and news articles, among others. It brings together the strengths of language models and scikit-learn to extract valuable insights from text.

Official GitHub repository:

https://github.com/iryna-kondr/scikit-llm

All examples are taken directly from the official repository.

Next, start the wonderful journey of Scikit-LLM!

Install Scikit-LLM

Start by installing Scikit-LLM, which integrates various libraries with powerful scikit-learn and language models, and you can use pip to install it:

pip install scikit-llm


Obtain OpenAI API key

As of May 2023, Scikit-LLM is compatible with a specific set of OpenAI models, requiring users to provide their own OpenAI API key for successful integration.

First import the SKLLMConfig module from the Scikit-LLM library, then add the openAI key:

# importing SKLLMConfig to configure OpenAI API (key and Name)
from skllm.config import SKLLMConfig


# Set your OpenAI API key
SKLLMConfig.set_openai_key("<YOUR_KEY>")


# Set your OpenAI organization (optional)
SKLLMConfig.set_openai_org("<YOUR_ORGANIZATION>")

As stated in the GitHub repository:

In case of a free trial OpenAI account, the rate limit is not enough (3 requests per minute). Please switch to a Pay as you go plan first.

When calling SKLLMConfig.set_openai_org, an organization ID must be provided, not an organization name. The Org ID can be found from the following link: https://platform.openai.com/account/org-settings

Zero-shot GPT classifier

The cool thing about ChatGPT is that it can classify text without special training, all it needs is descriptive labels.

Introduce ZeroShotGPTClassifier here, which is a class in Scikit-LLM, and use it to create scikit-learn classifiers.

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset


# get classification dataset from sklearn
X, y = get_classification_dataset()


# defining the model
clf = ZeroShotGPTClassifier(openai_model="gpt-3.5-turbo")


# fitting the data
clf.fit(X, y)


# predicting the data
labels = clf.predict(X)

Not only that, Scikit-LLM also ensures that it receives a response containing a valid label, and if it does not receive a response containing a valid label, Scikit-LLM will randomly choose a label and calculate its probability based on how often it appears in the training data.

In a nutshell, Scikit-LLM handles the content of the API and makes sure that the available tags are fetched. If a label is missing in the response, it picks a filler label based on its frequency in the training data.

What if there is no labeled data?

Even more interesting - you don't even need to have labeled data to train the model, just provide a list of candidate labels:

# importing zeroshotgptclassifier module and classification dataset
from skllm import ZeroShotGPTClassifier
from skllm.datasets import get_classification_dataset


# get classification dataset from sklearn for prediction only


X, _ = get_classification_dataset()


# defining the model
clf = ZeroShotGPTClassifier()


# Since no training so passing the labels only for prediction
clf.fit(None, ['positive', 'negative', 'neutral'])


# predicting the labels
labels = clf.predict(X)

Isn't that cool? By specifying implicit labels, it is possible to train on data that is not explicitly labeled.

As stated in the GitHub repository:

In zero-shot classification, the effectiveness of a classifier depends on the structure of the label itself, which can be expressed in natural language, descriptive language, and self-explanatory.

For example, in a semantic classification task, it may be more beneficial to convert the label from "<semantics>" to "the semantics of the provided text is <semantics>".

Multi-label zero-shot text classification

Performing multi-label zero-shot text classification is easier than you might think:

# importing Multi-Label zeroshot module and classification dataset
from skllm import MultiLabelZeroShotGPTClassifier
from skllm.datasets import get_multilabel_classification_dataset


# get classification dataset from sklearn 
X, y = get_multilabel_classification_dataset()


# defining the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)


# fitting the model
clf.fit(X, y)


# making predictions
labels = clf.predict(X)

The only difference for multi-label zero-shot text classification is that when creating an instance of the MultiLabelZeroShotGPTClassifier class, you need to specify the maximum number of labels to assign to each sample (here: max_labels=3)

What if there is no labeled data (multi-label example)?

In the example above, the MultiLabelZeroShotGPTClassifier is trained with labeled data (X and y). It is also possible to train a classifier on unlabeled data by providing a list of candidate labels. In this case, the type of y should be List[List[str]].

Here is a training example without labeled data:

# getting classification dataset for prediction only
X, _ = get_multilabel_classification_dataset()


# Defining all the labels that needs to predicted
candidate_labels = [
    "Quality",
    "Price",
    "Delivery",
    "Service",
    "Product Variety"
]


# creating the model
clf = MultiLabelZeroShotGPTClassifier(max_labels=3)


# fitting the labels only
clf.fit(None, [candidate_labels])


# predicting the data
labels = clf.predict(X)


vectorized text

Text vectorization is the process of digitizing text so that it can be more easily understood and analyzed by computers. At this point, Scikit-LLM's GPTVectorizer module can help convert a piece of text, no matter how long the text is, into a fixed-size set of numbers, called a vector.

# Importing the GPTVectorizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTVectorizer


# Creating an instance of the GPTVectorizer class and assigning it to the variable 'model'
model = GPTVectorizer()  


# transorming the
vectors = model.fit_transform(X)

Applies the fit_transform method of a GPTVectorizer instance to the input data X, fits a model to the data, and converts the text to a fixed-dimensional vector, then assigns the resulting vector to the vector variable.

The following demonstrates an example of combining GPTVectorizer and XGBoost Classifier in a scikit-learn pipeline, which can effectively implement text preprocessing and classification:

# Importing the necessary modules and classes
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier


# Creating an instance of LabelEncoder class
le = LabelEncoder()


# Encoding the training labels 'y_train' using LabelEncoder
y_train_encoded = le.fit_transform(y_train)


# Encoding the test labels 'y_test' using LabelEncoder
y_test_encoded = le.transform(y_test)


# Defining the steps of the pipeline as a list of tuples
steps = [('GPT', GPTVectorizer()), ('Clf', XGBClassifier())]


# Creating a pipeline with the defined steps
clf = Pipeline(steps)


# Fitting the pipeline on the training data 'X_train' and the encoded training labels 'y_train_encoded'
clf.fit(X_train, y_train_encoded)


# Predicting the labels for the test data 'X_test' using the trained pipeline
yh = clf.predict(X_test)


text summary

GPT is good at summarizing text. The reason is that there is a GPTSummarizer module in Scikit-LLM. It can be used in two ways: by itself, or as a step before doing something else (such as reducing the size of the data, using text instead of numbers):

# Importing the GPTSummarizer class from the skllm.preprocessing module
from skllm.preprocessing import GPTSummarizer


# Importing the get_summarization_dataset function
from skllm.datasets import get_summarization_dataset


# Calling the get_summarization_dataset function
X = get_summarization_dataset()


# Creating an instance of the GPTSummarizer
s = GPTSummarizer(openai_model='gpt-3.5-turbo', max_words=15)


# Applying the fit_transform method of the GPTSummarizer instance to the input data 'X'.
# It fits the model to the data and generates the summaries, which are assigned to the variable 'summaries'
summaries = s.fit_transform(X)

Note that the max_words hyperparameter places a flexible limit on the number of words in the generated summary. It's not strictly enforced other than to provide hints. This means that, in some cases, the actual number of words in the generated summary may slightly exceed the specified limit. Simply put, while max_words sets a rough target for summary length, occasionally slightly longer summaries may be generated depending on the context and content of the input text.

If you have any questions, please feel free to ask!

Original title: Scikit-LLM: Sklearn Meets Large Language Models

Original link: https://medium.com/@fareedkhandev/scikit-llm-sklearn-meets-large-language-models-11fc6f30e530

Editor: Huang Jiyan

Translator profile

19e122f665785874a9ce9fcf138e269e.jpeg

Chen Zhiyan, graduated from Beijing Jiaotong University with a major in communication and control engineering, and obtained a master's degree in engineering. He has worked as an engineer at Great Wall Computer Software and Systems Company and Datang Microelectronics Company. He is currently a technical supporter at Beijing Wuyi Chaoqun Technology Co., Ltd. Currently engaged in the operation and maintenance of the intelligent translation teaching system, and has accumulated certain experience in artificial intelligence deep learning and natural language processing (NLP). In his spare time, he likes translation and creation. His translated works mainly include: IEC-ISO 7816, Iraqi Petroleum Engineering Project, Declaration of New Fiscalism, etc. Among them, the Chinese-English work "Declaration of New Fiscalism" was officially published in GLOBAL TIMES. I can use my spare time to join the translation volunteer group of the THU Data Pie platform. I hope to communicate and share with you and make progress together.

Translation Team Recruitment Information

Job content: It needs a meticulous heart to translate the selected foreign language articles into fluent Chinese. If you are an international student of data science/statistics/computer, or are engaged in related work overseas, or friends who are confident in your foreign language proficiency, welcome to join the translation team.

You can get: regular translation training to improve the translation level of volunteers, improve the awareness of the frontier of data science, overseas friends can keep in touch with the development of domestic technology applications, and the background of THU's data industry-university-research research brings good development opportunities for volunteers.

Other benefits: Data science workers from famous companies, students from Peking University, Tsinghua University and overseas famous schools will all become your partners in the translation team.

Click "Read the original text" at the end of the article to join the Datapai team~

Reprint Notice

If you need to reprint, please indicate the author and source in a prominent position at the beginning of the article (from: Datapi ID: DatapiTHU), and place an eye-catching QR code at the end of the article. If you have an original logo article, please send [article name - official account name and ID to be authorized] to the contact email, apply for whitelist authorization and edit as required.

After publishing, please send the link back to the contact email (see below). Unauthorized reprinting and adaptation, we will pursue their legal responsibilities according to law.

fd91395116be5eb653b1646cb2cc77d6.png

Click "Read the original text" to embrace the organization

Guess you like

Origin blog.csdn.net/tMb8Z9Vdm66wH68VX1/article/details/131842559