- 大模型评测
- 指令微调
- 数据
- 部署
- Fine-tuning customizes a pretrained LLM for a specific domain by updating most or all of its parameters with a domain-specific dataset. This approach is resource-intensive but yields high accuracy for specialized use cases.
- PEFT modifies a pretrained LLM with fewer parameter updates, focusing on a subset of the model. It strikes a balance between accuracy and resource usage, offering improvements over prompt engineering with manageable data and computational demands.
- Prompt engineering manipulates the input to an LLM to steer its output, without altering the model’s parameters. It’s the least resource-intensive method, suitable for applications with limited data and computational resources.
- RAG enhances LLM prompts with information from external databases, effectively a sophisticated form of prompt engineering.
类型
-
全量微调
- 微调7B模型需要3×28G的显存(SGD+Momentum),至少需要2张A100的显卡才能满足
-
高效微调
- LoRA
-
量化版本的微调
- QLoRA: 将模型权重量化为4-bit精度,使用分页优化器来处理内存峰值
-
单Gpu,单机多gpu, 多机多gpu
-
不同基座模型
数据
要求
-
With chain-of-thought
-
With examples
-
高质量的prompt以及对应的高质量response
-
覆盖率: prompt的多样性,越多样性越好
-
筛选: 对于不同的模型,可能需要的好数据不完全一样。一个很强的基座只需要很少很少的SFT数据就行了,大部分都能cover,而一个比较差的基座模型对很多prompt都天然不理解,也就需要海量的SFT数据来弥补自身天然的缺陷
- Quality Evaluation: quality score, reward model
- k-Center Greedy
- Augmented Data Selection
- IFD指标
格式
wget https://raw.githubusercontent.com/baichuan-inc/Baichuan2/main/fine-tune/data/belle_chat_ramdon_10k.json
wget https://huggingface.co/datasets/BelleGroup/school_math_0.25M/resolve/main/school_math_0.25M.json
import json
data_path = "/root/autodl-tmp/belle_chat_ramdon_10k.json"
x = json.load(open(data_path))
print(x[0])
print(x[1])
{‘id’: ‘77771’, ‘conversations’: [{‘from’: ‘human’, ‘value’: ‘写一篇关于气候变化对海洋生态的影响的文章。\n’}, {‘from’: ‘gpt’, ‘value’: ‘好的,以下是你要求的文章:\n气候变化对海洋生态的影响\n气候变化对海洋生态产生了广泛的影响。全球变暖导致海洋温度上升,这可能会对许多水生生物的生存、繁殖和迁移造成巨大的影响。另外,海洋酸化也是一个问题,因为大气中二氧化碳的增加导致海水中的pH值下降,使得许多生物的外骨骼变得脆弱,从而影响它们的生存和繁殖。\n此外,气候变化还会引起海平面上升,这不仅对海岸线和沿海居民产生影响,还会对许多海洋生物的栖息地造成影响。一些岛屿上的陆地生物面临几乎完全消失的风险,而另一些生物范围则会扩大到更广泛的领域,这可能会导致新的竞争和生态平衡的改变。\n总之,气候变化并不只是一个地球气温的问题,它对海洋生态系统产生了广泛和深远的影响。\n’}, {‘from’: ‘human’, ‘value’: ‘从刚刚的文章中提取出气候变化对海洋生物繁殖的影响。\n’}, {‘from’: ‘gpt’, ‘value’: ‘文章已经提到过气候变化会对海洋生物的生存、繁殖和迁移造成影响。例如,海洋酸化会使得许多生物的外骨骼变得脆弱,进而影响它们的生存和繁殖。此外,海洋温度上升也可能会对很多生物的生存和繁殖产生负面影响。’}]}
{‘id’: ‘27684’, ‘conversations’: [{‘from’: ‘human’, ‘value’: ‘你好,请问你能帮我查一下明天的天气吗?\n’}, {‘from’: ‘gpt’, ‘value’: ‘当然,你在哪个城市呢?\n’}, {‘from’: ‘human’, ‘value’: ‘我在上海。\n’}, {‘from’: ‘gpt’, ‘value’: ‘好的,根据天气预报,明天上海多云转阴,气温在20到25摄氏度之间。需要我帮你查询其他信息吗?’}]}
也就是每一条id, 和 conversations. 然后conversations中分别是多个from value的组合
f_write = open("/root/autodl-tmp/school_math_0.25M_conv.json", "w")
data_path = "/root/autodl-tmp/school_math_0.25M.json"
with open(data_path) as f:
lines = f.readlines()
num_id = 1
for line in lines:
data = json.loads(line)
print(data)
# conversations = [{"from": "human", "value": data['instruction']+data['input']},{"from": "assistant", "value": data['output']}]
# # conversations = [{"from": "human", "value": data['input']},{"from": "assistant", "value": data['target']}]
# uniq_id = data['id'] if "id" in data else args.dataset_name+"-"+str(num_id)
# item = {"id":uniq_id, "conversations": conversations}
# f_write.write(json.dumps(item, ensure_ascii=False)+"\n")
num_id += 1
if num_id > 3:
break
f_write.close()
{‘instruction’: ‘题目:小明每天早上花费10分钟时间走到学校,如果小明家距离学校2公里,那么他每分钟走多少米?’, ‘input’: ‘’, ‘output’: ‘这是一个关于速度、路程、时间的数学问题。我们可以通过公式:速度=路程÷时间 来解决。\n因为小明每天早上走2公里,所以他的路程为2千米。而他每天早上要花费10分钟时间走到学校,因此他的时间为10分钟,即600秒。\n所以小明每分钟走的距离为 2公里 / 600秒 = 0.0033公里/秒 或 3.3米/秒。\n答案:小明每分钟走3.3米。’}
{‘instruction’: ‘题目:今天小明骑自行车从家到学校用了20分钟,回家用了25分钟。如果小明在上学和回家的路上的速度一样,那么他从家到学校的距离是学校到家的距离的百分之几?’, ‘input’: ‘’, ‘output’: ‘\n假设小明家到学校的距离为x千米,根据速度等于路程除以时间的公式,可以得出小明的速度为:家到学校的速度 = x / 20,学校到家的速度 = x / 25。因为小明在上学和回家的路上的速度一样,所以有:x / 20 = x / 25,解出 x = 5/4 千米。\n因此,家到学校的距离是学校到家的距离的百分之几,可以通过求比值得到:x / (5/4)x = 4/5 = 0.8,即小明从家到学校的距离是学校到家的距离的百分之80。’}
{‘instruction’: ‘题目:\n鹿妈妈买了24个苹果,她想平均分给她的3只小鹿吃,每只小鹿可以分到几个苹果?’, ‘input’: ‘’, ‘output’: ‘\n鹿妈妈买了24个苹果,平均分给3只小鹿吃,那么每只小鹿可以分到的苹果数就是总苹果数除以小鹿的只数。\n24÷3=8\n每只小鹿可以分到8个苹果。所以,答案是每只小鹿可以分到8个苹果。’}
每一条包括instruction, input, output三个item
其中注释的代码,可以将格式转化为与上面相同
评测
- Evaluation Harness
- TruthfulQA MC1/2
- BLiMP Causative
- MMLU Global Facts,
sentencepiece 扩充词表
PEFT高效参数微调
- LORA: LoRA Low-Rank Adaptation of Large Language Models(ICLR 2022)
- Parameter-Efficient Transfer Learning for NLP Adapter(ICML 2019)
- Prefix-Tuning: Optimizing Continuous Prompts for Generation(ACL 2021)
低秩适配器
- 图像中可以理解为某种图像风格的适配器
- 本质上是因为大模型的过参数化,认为有某个更小的维度(内部维度,intrinsic dimension)可以足够表征任务,类似找到真正的抽象核心去调整. 对于预训练权重矩阵,我们可以用一个低秩分解来表示参数更新
- 训练参数仅为整体参数的万分之一、GPU显存使用量减少2/3且不会引入额外的推理耗时
实现
- 原模型旁边增加一个旁路,通过低秩分解(先降维再升维)来模拟参数的更新量
- 训练时,原模型固定,只训练降维矩阵A和升维矩阵B. 由于不需要对模型的权重参数重新计算梯度,大大减少了需要训练的计算量。
- 推理时,可将BA加到原参数上,不引入额外的推理延迟
- 初始化,A采用高斯分布初始化,B初始化为全0,保证训练开始时旁路为0矩阵
- 可插拔式的切换任务,当前任务W0+B1A1,将lora部分减掉,换成B2A2,即可实现任务切换;
- 秩的选取:对于一般的任务,rank=1,2,4,8足矣,而对于一些领域差距比较大的任务可能需要更大的rank。
代码
- 将可微调参数分配到多种类型权重矩阵中,而不应该用更大的秩单独微调某种类型的权重矩阵。
class LoRALayer():
def __init__(
self,
r: int,
lora_alpha: int,
lora_dropout: float,
merge_weights: bool,
):
self.r = r
self.lora_alpha = lora_alpha
# Optional dropout
if lora_dropout > 0.:
self.lora_dropout = nn.Dropout(p=lora_dropout)
else:
self.lora_dropout = lambda x: x
# Mark the weight as unmerged
self.merged = False
self.merge_weights = merge_weights
class Embedding(nn.Embedding, LoRALayer):
# LoRA implemented in a dense layer
def __init__(
self,
num_embeddings: int,
embedding_dim: int,
r: int = 0,
lora_alpha: int = 1,
merge_weights: bool = True,
**kwargs
):
nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=0,
merge_weights=merge_weights)
# Actual trainable parameters
if r > 0:
self.lora_A = nn.Parameter(self.weight.new_zeros((r, num_embeddings)))
self.lora_B = nn.Parameter(self.weight.new_zeros((embedding_dim, r)))
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.weight.requires_grad = False
self.reset_parameters()
def reset_parameters(self):
nn.Embedding.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.zeros_(self.lora_A)
nn.init.normal_(self.lora_B)
def train(self, mode: bool = True):
nn.Embedding.train(self, mode)
if mode:
if self.merge_weights and self.merged:
# Make sure that the weights are not merged
if self.r > 0:
self.weight.data -= (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
# Merge the weights and mark it
if self.r > 0:
self.weight.data += (self.lora_B @ self.lora_A).transpose(0, 1) * self.scaling
self.merged = True
def forward(self, x: torch.Tensor):
if self.r > 0 and not self.merged:
result = nn.Embedding.forward(self, x)
after_A = F.embedding(
x, self.lora_A.transpose(0, 1), self.padding_idx, self.max_norm,
self.norm_type, self.scale_grad_by_freq, self.sparse
)
result += (after_A @ self.lora_B.transpose(0, 1)) * self.scaling
return result
else:
return nn.Embedding.forward(self, x)
class Linear(nn.Linear, LoRALayer):
# LoRA implemented in a dense layer
def __init__(
self,
in_features: int,
out_features: int,
r: int = 0,
lora_alpha: int = 1,
lora_dropout: float = 0.,
fan_in_fan_out: bool = False, # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
merge_weights: bool = True,
**kwargs
):
nn.Linear.__init__(self, in_features, out_features, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
merge_weights=merge_weights)
self.fan_in_fan_out = fan_in_fan_out
# Actual trainable parameters
if r > 0:
self.lora_A = nn.Parameter(self.weight.new_zeros((r, in_features)))
self.lora_B = nn.Parameter(self.weight.new_zeros((out_features, r)))
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.weight.requires_grad = False
self.reset_parameters()
if fan_in_fan_out:
self.weight.data = self.weight.data.transpose(0, 1)
def reset_parameters(self):
nn.Linear.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize B the same way as the default for nn.Linear and A to zero
# this is different than what is described in the paper but should not affect performance
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def train(self, mode: bool = True):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
nn.Linear.train(self, mode)
if mode:
if self.merge_weights and self.merged:
# Make sure that the weights are not merged
if self.r > 0:
self.weight.data -= T(self.lora_B @ self.lora_A) * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
# Merge the weights and mark it
if self.r > 0:
self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling
self.merged = True
def forward(self, x: torch.Tensor):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
if self.r > 0 and not self.merged:
result = F.linear(x, T(self.weight), bias=self.bias)
result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling
return result
else:
return F.linear(x, T(self.weight), bias=self.bias)
class MergedLinear(nn.Linear, LoRALayer):
# LoRA implemented in a dense layer
def __init__(
self,
in_features: int,
out_features: int,
r: int = 0,
lora_alpha: int = 1,
lora_dropout: float = 0.,
enable_lora: List[bool] = [False],
fan_in_fan_out: bool = False,
merge_weights: bool = True,
**kwargs
):
nn.Linear.__init__(self, in_features, out_features, **kwargs)
LoRALayer.__init__(self, r=r, lora_alpha=lora_alpha, lora_dropout=lora_dropout,
merge_weights=merge_weights)
assert out_features % len(enable_lora) == 0, \
'The length of enable_lora must divide out_features'
self.enable_lora = enable_lora
self.fan_in_fan_out = fan_in_fan_out
# Actual trainable parameters
if r > 0 and any(enable_lora):
self.lora_A = nn.Parameter(
self.weight.new_zeros((r * sum(enable_lora), in_features)))
self.lora_B = nn.Parameter(
self.weight.new_zeros((out_features // len(enable_lora) * sum(enable_lora), r))
) # weights for Conv1D with groups=sum(enable_lora)
self.scaling = self.lora_alpha / self.r
# Freezing the pre-trained weight matrix
self.weight.requires_grad = False
# Compute the indices
self.lora_ind = self.weight.new_zeros(
(out_features, ), dtype=torch.bool
).view(len(enable_lora), -1)
self.lora_ind[enable_lora, :] = True
self.lora_ind = self.lora_ind.view(-1)
self.reset_parameters()
if fan_in_fan_out:
self.weight.data = self.weight.data.transpose(0, 1)
def reset_parameters(self):
nn.Linear.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def zero_pad(self, x):
result = x.new_zeros((len(self.lora_ind), *x.shape[1:]))
result[self.lora_ind] = x
return result
def merge_AB(self):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
delta_w = F.conv1d(
self.lora_A.unsqueeze(0),
self.lora_B.unsqueeze(-1),
groups=sum(self.enable_lora)
).squeeze(0)
return T(self.zero_pad(delta_w))
def train(self, mode: bool = True):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
nn.Linear.train(self, mode)
if mode:
if self.merge_weights and self.merged:
# Make sure that the weights are not merged
if self.r > 0 and any(self.enable_lora):
self.weight.data -= self.merge_AB() * self.scaling
self.merged = False
else:
if self.merge_weights and not self.merged:
# Merge the weights and mark it
if self.r > 0 and any(self.enable_lora):
self.weight.data += self.merge_AB() * self.scaling
self.merged = True
def forward(self, x: torch.Tensor):
def T(w):
return w.transpose(0, 1) if self.fan_in_fan_out else w
if self.merged:
return F.linear(x, T(self.weight), bias=self.bias)
else:
result = F.linear(x, T(self.weight), bias=self.bias)
if self.r > 0:
result += self.lora_dropout(x) @ T(self.merge_AB().T) * self.scaling
return result
以上是原理类的,应用时
训练微调阶段
if training_args.use_lora:
from peft import LoraConfig, TaskType, get_peft_model
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
target_modules=["W_pack"], # 模块
inference_mode=False,
r=1,
lora_alpha=32,
lora_dropout=0.1,
)
model.enable_input_require_grads()
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
如果加量化的模型
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
device_map="auto",
trust_remote_code=True,
quantization_config=bnb_config
)
推理阶段
model.save_pretrained("trained-model")
PEFT_MODEL = "/kaggle/working/trained-model"
config = PeftConfig.from_pretrained(PEFT_MODEL)
model = AutoModelForCausalLM.from_pretrained(
config.base_model_name_or_path,
return_dict=True,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
tokenizer=AutoTokenizer.from_pretrained(config.base_model_name_or_path)
tokenizer.pad_token = tokenizer.eos_token
model = PeftModel.from_pretrained(model, PEFT_MODEL)
# 省略一些generageConfig
with torch.inference_mode():
outputs = model.generate(
input_ids = encoding.input_ids,
attention_mask = encoding.attention_mask,
generation_config = generation_config
)
整体步骤
Try this. Basic steps are to:
1/ load the base model
2/ train the base model
3/ save the LoRA adapter
4/ reload the base model at half/full precision
5/ merge the LoRA weights with the base model
6/ save
base_model = AutoModelForCausalLM.from_pretrained(“base_model”, load_in_8bit=True, torch_dtype=torch.float16, device_map=“auto”)
base_model = prepare_model_for_int8_training(base_model)
peft_model = get_peft_model(base_model, peft_config)
training_args = TrainingArguments()
trainer = Trainer()
trainer.train()
peft_model.save_pretrained(lora_adapter, save_adapter=True, save_config=True)
model_to_merge = PeftModel.from_pretrained(AutoModelForCausalLM.from_pretrained(base_model).to(“cuda”), lora_adapter)
merged_model = model_to_merge.merge_and_unload()
merged_model.save_pretrained(merged_model)
工程
- mixed precision 混合精度训练,模型参数和梯度显存占用减半
- 开启zero3, 模型参数、梯度、优化器状态被划分不同的显卡
- 开启cpu_offload,显存不足时利用内存
实例
实践由简单到复杂,我们先以中文baichuan模型为例
- 以belle数据为例
- 以loss损失函数为简单评测
- gradio/streamlit作为简单demo
百川2
- 2.6 万亿 Tokens 预训练
- 7B、13B 的 Base 和 Chat 版本
- 我们自己准备的预训练数据不多,则使用chat版本,并按照chat原本微调的格式准备数据
- 自己预训练数据多的话,则无所谓
- 推理时,base 模型model.generate(), chat模型model.chat()
评测 - C-eval 中文评测数据,52 个学科和四个难度的级别
- MMLU 57 个任务的英文评测数据集,涵盖了初等数学、美国历史、计算机科学、法律等,难度覆盖高中水平到专家水平,是目前主流的 LLM 评测数据集
- CMMLU 67 个主题的综合性性中文评估基准
环境:ATEC比赛提供了30卡时的 A100 80G环境,决定用此环境尝试一下微调。规定只能使用chatGLM2模型
- QLora
- Lora
- Full Parameter
- deepspeed 的设置
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ModelArguments:
"""
Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
"""
model_name_or_path: str = field(
metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
)
ptuning_checkpoint: str = field(
default=None, metadata={"help": "Path to p-tuning v2 checkpoints"}
)
config_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
)
tokenizer_name: Optional[str] = field(
default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
)
cache_dir: Optional[str] = field(
default=None,
metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
)
use_fast_tokenizer: bool = field(
default=True,
metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
)
model_revision: str = field(
default="main",
metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
)
use_auth_token: bool = field(
default=False,
metadata={
"help": (
"Will use the token generated when running `huggingface-cli login` (necessary to use this script "
"with private models)."
)
},
)
resize_position_embeddings: Optional[bool] = field(
default=None,
metadata={
"help": (
"Whether to automatically resize the position embeddings if `max_source_length` exceeds "
"the model's position embeddings."
)
},
)
quantization_bit: Optional[int] = field(
default=None
)
pre_seq_len: Optional[int] = field(
default=None
)
prefix_projection: bool = field(
default=False
)
@dataclass
class DataTrainingArguments:
"""
Arguments pertaining to what data we are going to input our model for training and eval.
"""
lang: Optional[str] = field(default=None, metadata={"help": "Language id for summarization."})
dataset_name: Optional[str] = field(
default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
)
dataset_config_name: Optional[str] = field(
default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
)
prompt_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the full texts (for summarization)."},
)
response_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the summaries (for summarization)."},
)
history_column: Optional[str] = field(
default=None,
metadata={"help": "The name of the column in the datasets containing the history of chat."},
)
train_file: Optional[str] = field(
default=None, metadata={"help": "The input training data file (a jsonlines or csv file)."}
)
validation_file: Optional[str] = field(
default=None,
metadata={
"help": (
"An optional input evaluation data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
)
},
)
test_file: Optional[str] = field(
default=None,
metadata={
"help": "An optional input test data file to evaluate the metrics (rouge) on (a jsonlines or csv file)."
},
)
overwrite_cache: bool = field(
default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
)
preprocessing_num_workers: Optional[int] = field(
default=None,
metadata={"help": "The number of processes to use for the preprocessing."},
)
max_source_length: Optional[int] = field(
default=1024,
metadata={
"help": (
"The maximum total input sequence length after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
max_target_length: Optional[int] = field(
default=128,
metadata={
"help": (
"The maximum total sequence length for target text after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded."
)
},
)
val_max_target_length: Optional[int] = field(
default=None,
metadata={
"help": (
"The maximum total sequence length for validation target text after tokenization. Sequences longer "
"than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
"This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
"during ``evaluate`` and ``predict``."
)
},
)
pad_to_max_length: bool = field(
default=False,
metadata={
"help": (
"Whether to pad all samples to model maximum sentence length. "
"If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
"efficient on GPU but very bad for TPU."
)
},
)
max_train_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of training examples to this "
"value if set."
)
},
)
max_eval_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of evaluation examples to this "
"value if set."
)
},
)
max_predict_samples: Optional[int] = field(
default=None,
metadata={
"help": (
"For debugging purposes or quicker training, truncate the number of prediction examples to this "
"value if set."
)
},
)
num_beams: Optional[int] = field(
default=None,
metadata={
"help": (
"Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
"which is used during ``evaluate`` and ``predict``."
)
},
)
ignore_pad_token_for_loss: bool = field(
default=True,
metadata={
"help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
},
)
source_prefix: Optional[str] = field(
default="", metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
)
forced_bos_token: Optional[str] = field(
default=None,
metadata={
"help": (
"The token to force as the first generated token after the decoder_start_token_id."
"Useful for multilingual models like mBART where the first generated token"
"needs to be the target language token (Usually it is the target language token)"
)
},
)
def __post_init__(self):
if self.dataset_name is None and self.train_file is None and self.validation_file is None and self.test_file is None:
raise ValueError("Need either a dataset name or a training/validation/test file.")
else:
if self.train_file is not None:
extension = self.train_file.split(".")[-1]
assert extension in ["csv", "json", "jsonl"], "`train_file` should be a csv or a json file."
if self.validation_file is not None:
extension = self.validation_file.split(".")[-1]
assert extension in ["csv", "json", "jsonl"], "`validation_file` should be a csv or a json file."
if self.val_max_target_length is None:
self.val_max_target_length = self.max_target_length