小型语言模型 Phi-2

Small language models with Phi-2 — ROCm Blogs (amd.com)

2024年4月8日，由Phillip Dang撰写。

和许多其他大型语言模型（LLM）一样，Phi-2 是一个基于变压器的模型，具有下一个词预测的目标，训练于数十亿的语料库。Phi-2 拥有27亿个参数，相对来说是一个较小的语言模型，但它在各种任务上表现出色，包括常识推理、语言理解、数学和编程。作为参考，GPT-3.5拥有1750亿个参数，而最小版本的LLaMA-2也有70亿个参数。根据微软的说法, 由于更加精心策划的训练数据和模型扩展，Phi-2 能够匹敌甚至超越规模大25倍的模型。

要深入了解 Phi-2 及其他以前的微软 Phi 模型的内部工作原理，你可以查看这篇微软博客和论文Textbooks Are All You Need.

在这篇博客中，我们将使用 AMD GPU 和支持的 ROCm 软件进行 Phi-2 的推理，并展示其开箱即用的效果。

前提条件

为了跟随本博客的内容，你需要具备以下软件：

ROCm
PyTorch
Linux OS

要查看受支持的 GPU 和操作系统列表，请参阅ROCm 系统要求。为了方便和稳定性，我们建议你直接在 Linux 系统中拉取并运行 rocm/pytorch Docker：

docker run -it --ipc=host --network=host --device=/dev/kfd --device=/dev/dri \
           --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
           --name=olmo rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1 /bin/bash

接下来，确保系统识别到 GPU：

! rocm-smi --showproductname

================= ROCm System Management Interface ================
========================= Product Info ============================
GPU[0] : Card series: Instinct MI210
GPU[0] : Card model: 0x0c34
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: D67301
===================================================================
===================== End of ROCm SMI Log =========================

确保 PyTorch 也识别到 GPU：

import torch
print(f"number of GPUs: {torch.cuda.device_count()}")
print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])

number of GPUs: 1
['AMD Radeon Graphics']

一旦确定系统识别到你的设备，你就可以开始测试 Phi-2 了。

安装库

在开始之前，请确保已安装了所有必要的库：

!pip install transformers accelerate einops datasets
!pip install --upgrade SQLAlchemy==1.4.46
!pip install alembic==1.4.1
!pip install numpy==1.23.4

接下来导入您将在本博客中使用的模块：

import torch
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

加载模型

加载模型和其分词器，请运行以下命令:

torch.set_default_device("cuda")
start_time = time.time()
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
print(f"Loaded in {time.time() - start_time: .2f} seconds")
print(model)

运行上面的命令后，你将得到如下输出:

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.23s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loaded in  5.01 seconds
PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=2560, out_features=51200, bias=True)
)

运行推理

让我们创建一个函数，该函数接收输入提示并生成输出。我们将 max_length 设置为500。每当我们遇到文本结尾标记 (<|endoftext|>) 时，我们就会切断响应，因为我们注意到模型在响应提示后的第一答案之后往往会产生无关的或多余的文本。这一问题已经被微软在指出，他们提到这是“由于其训练数据集主要是教科书，这导致了类似教科书的响应”。

def run_inference(raw_input):
    start_time = time.time()
    inputs = tokenizer(raw_inputs, return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length=500)
    print(f"Generated in {time.time() - start_time: .2f} seconds")
    text = tokenizer.batch_decode(outputs)[0]
    # cut off at endoftext token
    if '<|endoftext|>' in text:
        index = text.index('<|endoftext|>')
    else:
        index = len(text)
    text = text[:index]
    return text

有了这个，我们就可以准备运行推理并使用 Phi-2 进行一些有趣的实验了！我们将测试模型生成代码、总结论文、解释笑话以及以特定风格生成文本的能力。

生成代码

让我们给 Phi-2 一个中等难度的 LeetCode 问题，看看它的表现如何。

raw_inputs = '''
Given an integer array nums, return all the triplets [nums[i], nums[j], nums[k]] such that i != j, i != k, and j != k, and nums[i] + nums[j] + nums[k] == 0.

Notice that the solution set must not contain duplicate triplets.
'''
print(run_inference(raw_inputs))

输出:

Generated in  16.42 seconds

Given an integer array nums, return all the triplets [nums[i], nums[j], nums[k]] such that i!= j, i!= k, and j!= k, and nums[i] + nums[j] + nums[k] == 0.

Notice that the solution set must not contain duplicate triplets.

Example 1:

Input: nums = [-1,0,1,2,-1,-4]
Output: [[-1,-1,2],[-1,0,1]]
Example 2:

Input: nums = []
Output: []


Constraints:

0 <= nums.length <= 3000
-10^4 <= nums[i] <= 10^4
"""

class Solution:
    def threeSum(self, nums: List[int]) -> List[List[int]]:
        nums.sort()
        res = []
        for i in range(len(nums)):
            if i > 0 and nums[i] == nums[i-1]:
                continue
            l, r = i+1, len(nums)-1
            while l < r:
                s = nums[i] + nums[l] + nums[r]
                if s < 0:
                    l += 1
                elif s > 0:
                    r -= 1
                else:
                    res.append([nums[i], nums[l], nums[r]])
                    while l < r and nums[l] == nums[l+1]:
                        l += 1
                    while l < r and nums[r] == nums[r-1]:
                        r -= 1
                    l += 1
                    r -= 1
        return res

这个答案是正确的，并且被 LeetCode 接受。

总结一篇论文

让我们请 Phi-2 来总结一篇有关注意力机制和变压器架构的著名论文。

raw_inputs = '''
Summarize the paper "Attention Is All You Need".
'''
print(run_inference(raw_inputs))

输出:

Generated in  19.46 seconds

Summarize the paper "Attention Is All You Need".
## INPUT

##OUTPUT
The paper "Attention Is All You Need" proposes a novel neural network architecture called Transformer,
which uses self-attention mechanisms to encode and decode sequences of data. The paper shows that
Transformer outperforms existing models on various natural language processing tasks, such as
machine translation, text summarization, and question answering. The paper also introduces the
concept of attention, which allows the model to focus on relevant parts of the input and output, and to
learn from the context of the data. The paper demonstrates that attention can be implemented
efficiently and effectively using a single layer of trainable parameters, without the need for recurrent or
convolutional layers. The paper also provides empirical evidence and theoretical analysis to support
the effectiveness of attention in Transformer.

这篇总结很好地概括了论文的关键点，同时保持了简洁性。

解释笑话

让我们让Phi-2解释笑话，这可以非常微妙。

raw_inputs = '''
Instruct: Explain the joke below
Q: Why did Beethoven get rid of all of his chickens?
A: All they ever said was, “Bach, Bach, Bach!”.
Output:
'''
print(run_inference(raw_inputs))

输出:

Generated in  17.74 seconds

Instruct: Explain the joke below
Q: Why did Beethoven get rid of all of his chickens?
A: All they ever said was, “Bach, Bach, Bach!”.
Output:
The joke is a play on words. The expression “Bach, Bach, Bach” is a reference to the musical
composition of Johann Sebastian Bach. The joke suggests that Beethoven was tired of his chickens
constantly saying the same thing, implying that he wanted to get rid of them because they were too
repetitive.

Phi-2能够理解“Bach”指的是著名作曲家约翰·塞巴斯蒂安·巴赫。然而，它未能解释笑点，笑点是模仿鸡的咯咯叫声。

生成特定风格的文本

接下来，让我们尝试生成特定风格的文本。

raw_inputs = '''
Instruct: Write a detailed dialog between two physicists in Shakespearean english
Output:
'''
print(run_inference(raw_inputs))

输出:

Generated in  18.94 seconds

Instruct: Write a detailed dialog between two physicists in Shakespearean english
Output:
Physicist 1: "Good morrow, my dear friend! I have been pondering the mysteries of the universe, and I seek your wisdom."
Physicist 2: "Ah, thou art a seeker of truth! Pray tell, what enigma has captured thy mind?"
Physicist 1: "I have been contemplating the nature of light, and its duality as both particle and wave. It is a perplexing concept indeed."
Physicist 2: "Ah, light, the very essence of illumination! It dances upon the stage of existence, revealing the secrets of the cosmos."
Physicist 1: "Indeed, but how can we reconcile its particle-like behavior with its wave-like properties? It defies logic!"
Physicist 2: "Ah, my friend, logic is but a mere tool in our quest for understanding. We must embrace the beauty of uncertainty and explore the depths of the unknown."
Physicist 1: "You speak wise words, my friend. Let us embark on this journey together, unraveling the mysteries of the universe one photon at a time."
Physicist 2: "Indeed, let us delve into the realm of quantum mechanics, where the laws of classical physics crumble, and new wonders await."
Physicist 1: "And so, we shall venture forth, armed with our knowledge and curiosity, seeking the truth that lies hidden within the fabric of reality."
Physicist 2: "To the stars and beyond, my friend! May our quest for knowledge illuminate the path ahead."
Physicist 1: "To the stars and beyond!"

我们发现Phi-2不仅能够捕捉到两位物理学家谈论的主题，也能捕捉到具有莎士比亚风格的文笔、用词选择和风格！

Phi-2的局限性#

根据微软（Microsoft），Phi-2 可能存在局限性，例如代码和事实不准确，以及对指令响应的不可靠。在本节中，我们提供了一些快速示例，旨在测试 Phi-2 在代码生成准确性和响应可靠性方面的各种局限性。

生成代码不准确

这里我们测试Phi-2是否通过生成使用不存在或不正确的包(advancedmath)来计算平方根的代码。

raw_inputs = '''
Instruct: Create a function in Python that calculates the square root of a number using the 'advancedmath' package
Output:
'''
print(run_inference(raw_inputs))

输出:

Generated in  18.45 seconds

Instruct: Create a function in Python that calculates the square root of a number using the 'advancedmath' package

import advancedmath

def calculate_square_root(number):
    return advancedmath.sqrt(number)

number = 25
result = calculate_square_root(number)
print(result)

我们看到Phi-2无法告诉我们`advancedmath`包不存在。它自信地提供了一个看似正确的答案。

响应可靠性

接下来，我们测试Phi-2总结和解释 LoRA：大语言模型的低秩自适应这篇技术论文的能力，该论文于2021年6月17日发表。

raw_inputs = '''
Summarize the paper "LoRA: Low-Rank Adaptation of Large Language Models" and explain the method in details.
'''
print(run_inference(raw_inputs))

输出:

Generated in  18.52 seconds

Summarize the paper "LoRA: Low-Rank Adaptation of Large Language Models" and explain the
method in details.
A: The paper "LoRA: Low-Rank Adaptation of Large Language Models" proposes a method for
adapting large language models (LLMs) to specific tasks by reducing their size and complexity. The
method, called Low-Rank Adaptation (LoRA), involves projecting the input text onto a
lower-dimensional space using a low-rank matrix factorization (LRMF) technique. The resulting
low-rank representation is then used to train a new LLM on the reduced data, which is found to
perform well on the target task. The paper evaluates the effectiveness of LoRA on several benchmark
tasks, including text classification, question answering, and machine translation, and shows that it
achieves state-of-the-art performance while significantly reducing the computational cost of training
LLMs.

我们看到模型对LoRA的解释是错误的，因为该方法与减少可训练参数的数量有关，而不是将输入数据投射到一个低维空间。这很可能是因为模型在训练期间没有接触到LoRA论文。