Small language models with Phi-2 — ROCm Blogs (amd.com)
2024年4月8日,由Phillip Dang撰写。
和许多其他大型语言模型(LLM)一样,Phi-2 是一个基于变压器的模型,具有下一个词预测的目标,训练于数十亿的语料库。Phi-2 拥有27亿个参数,相对来说是一个较小的语言模型,但它在各种任务上表现出色,包括常识推理、语言理解、数学和编程。作为参考,GPT-3.5拥有1750亿个参数,而最小版本的LLaMA-2也有70亿个参数。根据微软的说法, 由于更加精心策划的训练数据和模型扩展,Phi-2 能够匹敌甚至超越规模大25倍的模型。
要深入了解 Phi-2 及其他以前的微软 Phi 模型的内部工作原理,你可以查看 这篇微软博客 和论文Textbooks Are All You Need.
在这篇博客中,我们将使用 AMD GPU 和支持的 ROCm 软件进行 Phi-2 的推理,并展示其开箱即用的效果。
前提条件
为了跟随本博客的内容,你需要具备以下软件:
要查看受支持的 GPU 和操作系统列表,请参阅ROCm 系统要求。为了方便和稳定性,我们建议你直接在 Linux 系统中拉取并运行 rocm/pytorch
Docker:
docker run -it --ipc=host --network=host --device=/dev/kfd --device=/dev/dri \ --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \ --name=olmo rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1 /bin/bash
接下来,确保系统识别到 GPU:
! rocm-smi --showproductname
================= ROCm System Management Interface ================ ========================= Product Info ============================ GPU[0] : Card series: Instinct MI210 GPU[0] : Card model: 0x0c34 GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI] GPU[0] : Card SKU: D67301 =================================================================== ===================== End of ROCm SMI Log =========================
确保 PyTorch 也识别到 GPU:
import torch print(f"number of GPUs: {torch.cuda.device_count()}") print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])
number of GPUs: 1 ['AMD Radeon Graphics']
一旦确定系统识别到你的设备,你就可以开始测试 Phi-2 了。
安装库
在开始之前,请确保已安装了所有必要的库:
!pip install transformers accelerate einops datasets !pip install --upgrade SQLAlchemy==1.4.46 !pip install alembic==1.4.1 !pip install numpy==1.23.4
接下来导入您将在本博客中使用的模块:
import torch import time from transformers import AutoModelForCausalLM, AutoTokenizer
加载模型
加载模型和其分词器,请运行以下命令:
torch.set_default_device("cuda") start_time = time.time() model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True) print(f"Loaded in {time.time() - start_time: .2f} seconds") print(model)
运行上面的命令后,你将得到如下输出:
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00, 2.23s/it] Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loaded in 5.01 seconds PhiForCausalLM( (model): PhiModel( (embed_tokens): Embedding(51200, 2560) (embed_dropout): Dropout(p=0.0, inplace=False) (layers): ModuleList( (0-31): 32 x PhiDecoderLayer( (self_attn): PhiAttention( (q_proj): Linear(in_features=2560, out_features=2560, bias=True) (k_proj): Linear(in_features=2560, out_features=2560, bias=True) (v_proj): Linear(in_features=2560, out_features=2560, bias=True) (dense): Linear(in_features=2560, out_features=2560, bias=True) (rotary_emb): PhiRotaryEmbedding() ) (mlp): PhiMLP( (activation_fn): NewGELUActivation() (fc1): Linear(in_features=2560, out_features=10240, bias=True) (fc2): Linear(in_features=10240, out_features=2560, bias=True) ) (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) (resid_dropout): Dropout(p=0.1, inplace=False) ) ) (final_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=2560, out_features=51200, bias=True) )
运行推理
让我们创建一个函数,该函数接收输入提示并生成输出。我们将 max_length
设置为500。每当我们遇到文本结尾标记 (<|endoftext|>
) 时,我们就会切断响应,因为我们注意到模型在响应提示后的第一答案之后往往会产生无关的或多余的文本。这一问题已经被微软在指出,他们提到这是“由于其训练数据集主要是教科书,这导致了类似教科书的响应”。
def run_inference(raw_input): start_time = time.time() inputs = tokenizer(raw_inputs, return_tensors="pt", return_attention_mask=False) outputs = model.generate(**inputs, max_length=500) print(f"Generated in {time.time() - start_time: .2f} seconds") text = tokenizer.batch_decode(outputs)[0] # cut off at endoftext token if '<|endoftext|>' in text: index = text.index('<|endoftext|>') else: index = len(text) text = text[:index] return text
有了这个,我们就可以准备运行推理并使用 Phi-2 进行一些有趣的实验了!我们将测试模型生成代码、总结论文、解释笑话以及以特定风格生成文本的能力。
生成代码
让我们给 Phi-2 一个中等难度的 LeetCode 问题 ,看看它的表现如何。
raw_inputs = ''' Given an integer array nums, return all the triplets [nums[i], nums[j], nums[k]] such that i != j, i != k, and j != k, and nums[i] + nums[j] + nums[k] == 0. Notice that the solution set must not contain duplicate triplets. ''' print(run_inference(raw_inputs))
输出:
Generated in 16.42 seconds Given an integer array nums, return all the triplets [nums[i], nums[j], nums[k]] such that i!= j, i!= k, and j!= k, and nums[i] + nums[j] + nums[k] == 0. Notice that the solution set must not contain duplicate triplets. Example 1: Input: nums = [-1,0,1,2,-1,-4] Output: [[-1,-1,2],[-1,0,1]] Example 2: Input: nums = [] Output: [] Constraints: 0 <= nums.length <= 3000 -10^4 <= nums[i] <= 10^4 """ class Solution: def threeSum(self, nums: List[int]) -> List[List[int]]: nums.sort() res = [] for i in range(len(nums)): if i > 0 and nums[i] == nums[i-1]: continue l, r = i+1, len(nums)-1 while l < r: s = nums[i] + nums[l] + nums[r] if s < 0: l += 1 elif s > 0: r -= 1 else: res.append([nums[i], nums[l], nums[r]]) while l < r and nums[l] == nums[l+1]: l += 1 while l < r and nums[r] == nums[r-1]: r -= 1 l += 1 r -= 1 return res
这个答案是正确的,并且被 LeetCode 接受。
总结一篇论文
让我们请 Phi-2 来总结一篇有关注意力机制和变压器架构的著名论文。
raw_inputs = ''' Summarize the paper "Attention Is All You Need". ''' print(run_inference(raw_inputs))
输出:
Generated in 19.46 seconds Summarize the paper "Attention Is All You Need". ## INPUT ##OUTPUT The paper "Attention Is All You Need" proposes a novel neural network architecture called Transformer, which uses self-attention mechanisms to encode and decode sequences of data. The paper shows that Transformer outperforms existing models on various natural language processing tasks, such as machine translation, text summarization, and question answering. The paper also introduces the concept of attention, which allows the model to focus on relevant parts of the input and output, and to learn from the context of the data. The paper demonstrates that attention can be implemented efficiently and effectively using a single layer of trainable parameters, without the need for recurrent or convolutional layers. The paper also provides empirical evidence and theoretical analysis to support the effectiveness of attention in Transformer.
这篇总结很好地概括了论文的关键点,同时保持了简洁性。
解释笑话
让我们让Phi-2解释笑话,这可以非常微妙。
raw_inputs = ''' Instruct: Explain the joke below Q: Why did Beethoven get rid of all of his chickens? A: All they ever said was, “Bach, Bach, Bach!”. Output: ''' print(run_inference(raw_inputs))
输出:
Generated in 17.74 seconds Instruct: Explain the joke below Q: Why did Beethoven get rid of all of his chickens? A: All they ever said was, “Bach, Bach, Bach!”. Output: The joke is a play on words. The expression “Bach, Bach, Bach” is a reference to the musical composition of Johann Sebastian Bach. The joke suggests that Beethoven was tired of his chickens constantly saying the same thing, implying that he wanted to get rid of them because they were too repetitive.
Phi-2能够理解“Bach”指的是著名作曲家约翰·塞巴斯蒂安·巴赫。然而,它未能解释笑点,笑点是模仿鸡的咯咯叫声。
生成特定风格的文本
接下来,让我们尝试生成特定风格的文本。
raw_inputs = ''' Instruct: Write a detailed dialog between two physicists in Shakespearean english Output: ''' print(run_inference(raw_inputs))
输出:
Generated in 18.94 seconds Instruct: Write a detailed dialog between two physicists in Shakespearean english Output: Physicist 1: "Good morrow, my dear friend! I have been pondering the mysteries of the universe, and I seek your wisdom." Physicist 2: "Ah, thou art a seeker of truth! Pray tell, what enigma has captured thy mind?" Physicist 1: "I have been contemplating the nature of light, and its duality as both particle and wave. It is a perplexing concept indeed." Physicist 2: "Ah, light, the very essence of illumination! It dances upon the stage of existence, revealing the secrets of the cosmos." Physicist 1: "Indeed, but how can we reconcile its particle-like behavior with its wave-like properties? It defies logic!" Physicist 2: "Ah, my friend, logic is but a mere tool in our quest for understanding. We must embrace the beauty of uncertainty and explore the depths of the unknown." Physicist 1: "You speak wise words, my friend. Let us embark on this journey together, unraveling the mysteries of the universe one photon at a time." Physicist 2: "Indeed, let us delve into the realm of quantum mechanics, where the laws of classical physics crumble, and new wonders await." Physicist 1: "And so, we shall venture forth, armed with our knowledge and curiosity, seeking the truth that lies hidden within the fabric of reality." Physicist 2: "To the stars and beyond, my friend! May our quest for knowledge illuminate the path ahead." Physicist 1: "To the stars and beyond!"
我们发现Phi-2不仅能够捕捉到两位物理学家谈论的主题,也能捕捉到具有莎士比亚风格的文笔、用词选择和风格!
Phi-2的局限性#
根据微软(Microsoft),Phi-2 可能存在局限性,例如代码和事实不准确,以及对指令响应的不可靠。在本节中,我们提供了一些快速示例,旨在测试 Phi-2 在代码生成准确性和响应可靠性方面的各种局限性。
生成代码不准确
这里我们测试Phi-2是否通过生成使用不存在或不正确的包(advancedmath
)来计算平方根的代码。
raw_inputs = ''' Instruct: Create a function in Python that calculates the square root of a number using the 'advancedmath' package Output: ''' print(run_inference(raw_inputs))
输出:
Generated in 18.45 seconds Instruct: Create a function in Python that calculates the square root of a number using the 'advancedmath' package import advancedmath def calculate_square_root(number): return advancedmath.sqrt(number) number = 25 result = calculate_square_root(number) print(result)
我们看到Phi-2无法告诉我们`advancedmath`包不存在。它自信地提供了一个看似正确的答案。
响应可靠性
接下来,我们测试Phi-2总结和解释 LoRA:大语言模型的低秩自适应这篇技术论文的能力,该论文于2021年6月17日发表。
raw_inputs = ''' Summarize the paper "LoRA: Low-Rank Adaptation of Large Language Models" and explain the method in details. ''' print(run_inference(raw_inputs))
输出:
Generated in 18.52 seconds Summarize the paper "LoRA: Low-Rank Adaptation of Large Language Models" and explain the method in details. A: The paper "LoRA: Low-Rank Adaptation of Large Language Models" proposes a method for adapting large language models (LLMs) to specific tasks by reducing their size and complexity. The method, called Low-Rank Adaptation (LoRA), involves projecting the input text onto a lower-dimensional space using a low-rank matrix factorization (LRMF) technique. The resulting low-rank representation is then used to train a new LLM on the reduced data, which is found to perform well on the target task. The paper evaluates the effectiveness of LoRA on several benchmark tasks, including text classification, question answering, and machine translation, and shows that it achieves state-of-the-art performance while significantly reducing the computational cost of training LLMs.
我们看到模型对LoRA的解释是错误的,因为该方法与减少可训练参数的数量有关,而不是将输入数据投射到一个低维空间。这很可能是因为模型在训练期间没有接触到LoRA论文。