kaggle社区LLM Classification Finetuning

之前有个一样的比赛，没去参加，现在弄了一个无限期的比赛出来

训练代码链接：fine_tune | Kaggle

推理代码链接：https://www.kaggle.com/code/linheshen/inference-llama-3-8b?scriptVersionId=219332972

包链接：package_all | Kaggle

项目简介

预测用户在一个由大型语言模型（LLMs）驱动的聊天机器人之间的正面交锋中会偏好哪些回应。获得来自聊天机器人竞技场（Chatbot Arena）的对话数据集，其中不同的LLMs对用户的提示生成答案。通过开发一个胜出的机器学习模型，帮助改进聊天机器人与人类的互动方式，并确保它们更好地符合人类的偏好。实际上是进行一个序列的分类

数据

id - A unique identifier for the row.
model_[a/b] - The identity of model_[a/b]. Included in train.csv but not test.csv.
prompt - The prompt that was given as an input (to both models).
response_[a/b] - The response from model_[a/b] to the given prompt.
winner_model_[a/b/tie] - Binary columns marking the judge's selection. The ground truth target column.

这是tran的列，tie的意思是平局，对于prompt，有model_a和model_b两个模型，他们会给出不同的response，然后需要进行一个分类，就是谁是winner_model，长下面这样

结果是下面这样，预测一个概率分布

下载和导入包

from IPython.display import clear_output
!pip install --upgrade pip
!pip install -qq peft==0.6.0
!pip install -qq bitsandbytes==0.41.1
!pip install -qq accelerate==0.24.1
!pip install -qq transformers==4.35.0
!pip install -qq torch~=2.1.0 --index-url https://download.pytorch.org/whl/cpu -q 
!pip install -qq torch_xla[tpu]~=2.1.0 -f https://storage.googleapis.com/libtpu-releases/index.html -q
!pip uninstall -qq tensorflow -y # If we don't do this, TF will take over TPU and cause permission error for PT
!cp /kaggle/input/utils-xla/spmd_util.py . # From this repo: https://github.com/HeegyuKim/torch-xla-SPMD
!pip install numpy==1.23.5  # 或者选择 1.22.x 系列中的另一个版本
clear_output()
import os
import gc
import re
from time import time
import random
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdmimport torch
import transformers
from sklearn.metrics import accuracy_score
from transformers import AutoTokenizer, LlamaModel, LlamaForSequenceClassification
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType
import torch.nn.functional as Fimport torch_xla.debug.profiler as xp
import torch_xla.core.xla_model as xm
import torch_xla.experimental.xla_sharding as xs
import torch_xla.runtime as xrxr.use_spmd()from torch_xla.experimental.xla_sharded_tensor import XLAShardedTensor
from torch_xla.experimental.xla_sharding import Mesh
from spmd_util import partition_moduletqdm.pandas()print(f'Torch Version: {torch.__version__}')

这里用llama3做微调，用3.1我这里一直报错，网上解决方法是升级transformer，但是我弄了也没什么用

配置

class CFG:NUM_EPOCHS = 1BATCH_SIZE = 16DROPOUT = 0.05 MODEL_NAME = '/kaggle/input/llama-3/transformers/8b-chat-hf/1'SEED = 2024 MAX_LENGTH = 1024 NUM_WARMUP_STEPS = 128LR_MAX = 5e-5 NUM_LABELS = 3 LORA_RANK = 4LORA_ALPHA = 8LORA_MODULES = ['o_proj', 'v_proj']DEVICE = xm.xla_device() # Initialize TPU Device

这里使用tpu，配置是上面这样

LORA

事实上我们也不能调整整个模型，LLM这里的微调只是从原来的知识中调出，只是训练一小部分参数，LORA是一种用于微调大型预训练语言模型的技术。它通过引入低秩矩阵来模拟预训练模型参数的微小变化，从而允许在保持预训练模型大部分知识的同时，对模型进行特定任务的适配。LORA的主要优点是它减少了微调所需的参数数量，从而降低了内存需求和训练成本，并且可以更快地适应新任务。

LoRA 将微调视为学习参数变化：冻结模型参数，然后学习使模型在微调任务中表现更好所需的这些参数的变化。

LoRA微调的流程

首先冻结模型参数。使用这些参数进行推理，但不会更新它们。然后创建两个矩阵，当它们相乘时，它们的大小将与我们正在微调的模型的权重矩阵的大小相同。在一个大型模型中，有多个权重矩阵，为每个权重矩阵创建一个这样的配对。

LoRA将这些矩阵称为矩阵“A”和“B”。这些矩阵一起代表了LoRA微调过程中的可学习参数。

然后将输入通过冻结的权重和变化矩阵传递：

根据两个输出的组合计算损失，然后根据损失更新矩阵A和B：

这些变化矩阵是即时计算的，从未被存储，这就是为什么LoRA的内存占用如此小的原因。实际上，在训练期间只存储模型参数、矩阵A和B以及A和B的梯度。

我们执行此操作，直到我们优化了变化矩阵的因素以进行微调任务。更新矩阵A和B的反向传播步骤比更新完整模型参数集的过程要快得多，因为A和B要小得多。这就是为什么尽管训练过程中有更多的操作，LoRA仍然通常比传统微调更快的原因。

当我们最终想要使用这个微调模型进行推断时，我们只需计算变化矩阵，并将变化添加到权重中。这意味着LoRA不会改变模型的推断时间：

LoRA Rank

LoRA有一个超参数，称为Rank，它描述了用于构建之前讨论的变化矩阵的深度。较高的值意味着更大的和矩阵，这意味着它们可以在变化矩阵中编码更多的线性独立信息。一般设为1,4,8

“r"参数可以被视为"信息瓶颈”。较小的r值意味着A和B可以用更小的内存占用编码较少的信息。较大的r值意味着A和B可以编码更多的信息，但内存占用更大。

分解的A和B矩阵导致相同大小的变化矩阵，但是r=2能够将更多线性独立的信息编码到变化矩阵中，因为A和B矩阵中包含更多信息。事实证明，LoRA论文所做的核心假设，即模型参数的变化具有低隐式秩，是一个相当强的假设。微软（LoRA的出版商）的人员尝试了一些值，并发现即使是秩为一的矩阵也表现出色。

LORA_ALPHA 是一个缩放因子，用于控制低秩矩阵对原始预训练模型参数的影响程度。这个因子可以调整低秩矩阵在模型中的权重，较大的LORA_ALPHA值意味着低秩矩阵将对模型的输出有更大的影响。

训练

固定种子

def set_seeds(seed):"""Set seeds for reproducibility """os.environ['PYTHONHASHSEED'] = str(seed)random.seed(seed)np.random.seed(seed)torch.manual_seed(seed)if torch.cuda.is_available():torch.cuda.manual_seed(seed)torch.cuda.manual_seed_all(seed)# Set seed for all TPU coresxm.set_rng_state(seed, device=xm.xla_device())  set_seeds(seed=CFG.SEED)

数据预处理

tokenizer = AutoTokenizer.from_pretrained(CFG.MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'
tokenizer.add_eos_token = True# save tokenizer to load offline during inference
tokenizer.save_pretrained('tokenizer')
def get_token_lengths(texts):# tokenize and receive input_ids for reach textinput_ids = tokenizer(texts.tolist(), return_tensors='np')['input_ids']# return length of inputs_ids for each textreturn [len(t) for t in input_ids]
train = pd.read_csv('/kaggle/input/llm-classification-finetuning/train.csv')
def process(input_str):stripped_str = input_str.strip('[]')sentences = [s.strip('"') for s in stripped_str.split('","')]return  ' '.join(sentences)train.loc[:, 'prompt'] = train['prompt'].apply(process)
train.loc[:, 'response_a'] = train['response_a'].apply(process)
train.loc[:, 'response_b'] = train['response_b'].apply(process)# Drop 'Null' for training
indexes = train[(train.response_a == 'null') & (train.response_b == 'null')].index
train.drop(indexes, inplace=True)
train.reset_index(inplace=True, drop=True)print(f"Total {len(indexes)} Null response rows dropped")
print('Total train samples: ', len(train))
train['text'] = 'User prompt: ' + train['prompt'] +  '\n\nModel A :\n' + train['response_a'] +'\n\n--------\n\nModel B:\n'  + train['response_b']
print(train['text'][4])
train.loc[:, 'token_count'] = get_token_lengths(train['text'])# prepare label for model
train.loc[:, 'label'] = np.argmax(train[['winner_model_a','winner_model_b','winner_tie']].values, axis=1)# Display data
display(train.head())

标签分布

train.label.value_counts()

分词

tokens = tokenizer(train['text'].tolist(), padding='max_length', max_length=CFG.MAX_LENGTH, truncation=True, return_tensors='np')# Input IDs are the token IDs
INPUT_IDS = tokens['input_ids']
# Attention Masks to Ignore Padding Tokens
ATTENTION_MASKS = tokens['attention_mask']
# Label of Texts
LABELS = train[['winner_model_a','winner_model_b','winner_tie']].valuesprint(f'INPUT_IDS shape: {INPUT_IDS.shape}, ATTENTION_MASKS shape: {ATTENTION_MASKS.shape}')
print(f'LABELS shape: {LABELS.shape}')

创建数据集和模型

def train_dataset(batch_size):N_SAMPLES = LABELS.shape[0]IDXS = np.arange(N_SAMPLES - (N_SAMPLES % batch_size))while True:# Shuffle Indicesnp.random.shuffle(IDXS)# Iterate Over All Indices Oncefor idxs in IDXS.reshape(-1, batch_size):input_ids = torch.tensor(INPUT_IDS[idxs]).to(DEVICE)attention_mask = torch.tensor(ATTENTION_MASKS[idxs]).to(DEVICE)labels = torch.tensor(LABELS[idxs]).to(DEVICE)  # Multi-label output# Shard Over TPU Nodes if applicable (you need to define mesh appropriately)xs.mark_sharding(input_ids, mesh, (0, 1))xs.mark_sharding(attention_mask, mesh, (0, 1))xs.mark_sharding(labels, mesh, (0, 1))yield input_ids, attention_mask, labelsTRAIN_DATASET = train_dataset(CFG.BATCH_SIZE)
base_model = LlamaForSequenceClassification.from_pretrained(CFG.MODEL_NAME,num_labels=CFG.NUM_LABELS,torch_dtype=torch.bfloat16)base_model.config.pretraining_tp = 1 # Assign Padding TOKEN
base_model.config.pad_token_id = tokenizer.pad_token_id

lora_config = LoraConfig(r=CFG.LORA_RANK,  # the dimension of the low-rank matriceslora_alpha = CFG.LORA_ALPHA, # scaling factor for LoRA activations vs pre-trained weight activationslora_dropout= CFG.DROPOUT, bias='none',inference_mode=False,task_type=TaskType.SEQ_CLS,target_modules=CFG.LORA_MODULES ) # Only Use Output and Values Projection
model = get_peft_model(base_model, lora_config)
# Trainable Parameters
model.print_trainable_parameters()
num_devices = xr.global_runtime_device_count()
mesh_shape = (1, num_devices, 1)
device_ids = np.array(range(num_devices))
mesh = Mesh(device_ids, mesh_shape, ('dp', 'fsdp', 'mp'))
# distribute model
partition_module(model, mesh)print(f'num_devices: {num_devices}')
MODEL_LAYERS_ROWS = []
TRAINABLE_PARAMS = []
N_TRAINABLE_PARAMS = 0for name, param in model.named_parameters():# Layer Parameter Countn_parameters = int(torch.prod(torch.tensor(param.shape)))# Only Trainable Layersif param.requires_grad:# Add Layer InformationMODEL_LAYERS_ROWS.append({'param': n_parameters,'name': name,'dtype': param.data.dtype,})# Append Trainable ParameterTRAINABLE_PARAMS.append({ 'params': param })# Add Number Of Trainable Parameters"N_TRAINABLE_PARAMS += n_parametersdisplay(pd.DataFrame(MODEL_LAYERS_ROWS))print(f"""
===============================
N_TRAINABLE_PARAMS: {N_TRAINABLE_PARAMS:,}
N_TRAINABLE_LAYERS: {len(TRAINABLE_PARAMS)}
===============================
""")

N_SAMPLES = len(train)
STEPS_PER_EPOCH = N_SAMPLES // CFG.BATCH_SIZEOPTIMIZER = torch.optim.AdamW(model.parameters(), lr=CFG.LR_MAX)# Cosine Learning Rate With Warmup
lr_scheduler = transformers.get_cosine_schedule_with_warmup(optimizer=OPTIMIZER,num_warmup_steps=CFG.NUM_WARMUP_STEPS,num_training_steps=STEPS_PER_EPOCH * CFG.NUM_EPOCHS)print(f'BATCH_SIZE: {CFG.BATCH_SIZE}, N_SAMPLES: {N_SAMPLES}, STEPS_PER_EPOCH: {STEPS_PER_EPOCH}')
for state in OPTIMIZER.state.values():for k, v in state.items():if isinstance(v, torch.Tensor) and state[k].dtype is not torch.float32:state[k] = v.to(dtype=torch.float32)
input_ids, attention_mask, labels = next(TRAIN_DATASET)print(f'input_ids shape: {input_ids.shape}, dtype: {input_ids.dtype}')
print(f'attention_mask shape: {attention_mask.shape}, dtype: {attention_mask.dtype}')
print(f'labels shape: {labels.shape}, dtype: {labels.dtype}')

开始训练

model.train()# Loss Function, Cross Entropy
LOSS_FN = torch.nn.CrossEntropyLoss().to(dtype=torch.float32)
st = time()
warnings.filterwarnings("error")
METRICS = {'loss': [],'accuracy': {'y_true': [], 'y_pred': [] }}for epoch in tqdm(range(CFG.NUM_EPOCHS)):ste = time()for step in range(STEPS_PER_EPOCH):# Zero Out GradientsOPTIMIZER.zero_grad()# Get Batchinput_ids, attention_mask, labels = next(TRAIN_DATASET)# Forward Passoutputs = model(input_ids=input_ids, attention_mask=attention_mask)# Logits Float32logits = outputs.logits.to(dtype=torch.float32)# Backward Passloss = LOSS_FN(logits, labels.to(dtype=torch.float32))loss.backward()# optimizer stepOPTIMIZER.step()xm.mark_step()# Update Learning Rate Schedulerlr_scheduler.step()# Update Metrics And Progress BarMETRICS['loss'].append(float(loss))METRICS['accuracy']['y_true'] += labels.squeeze().tolist()METRICS['accuracy']['y_pred'] += torch.argmax(F.softmax(logits, dim=-1), dim=1).cpu().tolist()if (step + 1) % 200 == 0:  metrics = 'µ_loss: {:.3f}'.format(np.mean(METRICS['loss']))metrics += ', step_loss: {:.3f}'.format(METRICS['loss'][-1])metrics += ', µ_auc: {:.3f}'.format(accuracy_score(torch.argmax(torch.tensor(METRICS['accuracy']['y_true']), axis=-1), \METRICS['accuracy']['y_pred']))lr = OPTIMIZER.param_groups[0]['lr']print(f'{epoch+1:02}/{CFG.NUM_EPOCHS:02} | {step+1:04}/{STEPS_PER_EPOCH} lr: {lr:.2E}, {metrics}', end='')print(f'\nSteps per epoch: {step+1} complete | Time elapsed: {time()- st}')print(f'\nEpoch {epoch+1} Completed | Total time for epoch: {time() - ste} ' )# If stopped, and to continue training in future on tpu we save model and optimizerxm.save({k: v.cpu() for k, v in model.named_parameters() if v.requires_grad}, f'model_llama_3_cp_{epoch+1}_v1.pth')xm.save(OPTIMIZER.state_dict(), f'optimizer_llama_3_cp_{epoch+1}_v1.pth')    print(f'Model saved at epoch {epoch+1}| Elapsed time: {time() - st} ')

训练结果

plt.figure(figsize=(15, 6))
plt.plot(METRICS['loss'])    
plt.xlabel('Step per epoch')
plt.ylabel('Loss')
plt.title('Loss Plot step per epoch')    
plt.show()

保存模型

model = model.cpu()
torch.save(dict([(k,v) for k, v in model.named_parameters() if v.requires_grad]), 'llama_3_finetuned_model.pth')

推理

推理阶段用两张T4

!pip install -q -U bitsandbytes --no-index --find-links /kaggle/input/package-all/
!pip install -q -U transformers --no-index --find-links /kaggle/input/package-all/
!pip install -q -U tokenizers --no-index --find-links /kaggle/input/package-all/
!pip install -q -U peft --no-index --find-links /kaggle/input/package-all/
import torch
import sklearn
import numpy as np
import pandas as pd
import timefrom transformers import AutoTokenizer, LlamaModel, LlamaForSequenceClassification, BitsAndBytesConfig
from peft import get_peft_config, PeftModel, PeftConfig, get_peft_model, LoraConfig, TaskType
from torch.cuda.amp import autocast
from threading import Threadtorch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)if (not torch.cuda.is_available()): print("Sorry - GPU required!")
MODEL_NAME = '/kaggle/input/llama-3/transformers/8b-chat-hf/1'
WEIGHTS_PATH = '/kaggle/input/llama-fine-tune/llama_3_finetuned_model.pth'
MAX_LENGTH = 1024
BATCH_SIZE = 8
DEVICE = torch.device("cuda")    
test = pd.read_csv('/kaggle/input/llm-classification-finetuning/test.csv')
sample_sub = pd.read_csv('/kaggle/input/llm-classification-finetuning/sample_submission.csv')# concatenate strings in list
def process(input_str):stripped_str = input_str.strip('[]')sentences = [s.strip('"') for s in stripped_str.split('","')]return  ' '.join(sentences)test.loc[:, 'prompt'] = test['prompt'].apply(process)
test.loc[:, 'response_a'] = test['response_a'].apply(process)
test.loc[:, 'response_b'] = test['response_b'].apply(process)display(sample_sub)
display(test.head(5))
test['text'] = 'User prompt: ' + test['prompt'] +  '\n\nModel A :\n' + test['response_a'] +'\n\n--------\n\nModel B:\n'  + test['response_b']
print(test['text'][0])
%%timetokenizer = AutoTokenizer.from_pretrained('/kaggle/input/llama-fine-tune/tokenizer')tokens = tokenizer(test['text'].tolist(), padding='max_length',max_length=MAX_LENGTH, truncation=True, return_tensors='pt')INPUT_IDS = tokens['input_ids'].to(DEVICE, dtype=torch.int32)
ATTENTION_MASKS = tokens['attention_mask'].to(DEVICE, dtype=torch.int32)# Move tensors to CPU and convert them to lists
input_ids_cpu = [tensor.cpu().tolist() for tensor in INPUT_IDS]
attention_masks_cpu = [tensor.cpu().tolist() for tensor in ATTENTION_MASKS]data = pd.DataFrame()
data['INPUT_IDS'] = input_ids_cpu
data['ATTENTION_MASKS'] = attention_masks_cpu
data[:2]
bnb_config =  BitsAndBytesConfig(load_in_8bit=True,bnb_8bit_compute_dtype=torch.float16,bnb_8bit_use_double_quant=False)# Load base model on GPU 0
device0 = torch.device('cuda:0')base_model_0 = LlamaForSequenceClassification.from_pretrained(MODEL_NAME,num_labels=3,torch_dtype=torch.float16,quantization_config=bnb_config,device_map='cuda:0')
base_model_0.config.pad_token_id = tokenizer.pad_token_id# Load base model on GPU 1
device1 = torch.device('cuda:1')
base_model_1 = LlamaForSequenceClassification.from_pretrained(MODEL_NAME,num_labels=3,torch_dtype=torch.float16,quantization_config=bnb_config,device_map='cuda:1')
base_model_1.config.pad_token_id = tokenizer.pad_token_id
peft_config = LoraConfig(r=4,lora_alpha=8,lora_dropout=0.05,bias='none',inference_mode=True,task_type=TaskType.SEQ_CLS,target_modules=['o_proj', 'v_proj'])
# Get peft
model_0 = get_peft_model(base_model_0, peft_config).to(device0) 
# Load weights
model_0.load_state_dict(torch.load(WEIGHTS_PATH), strict=False)
model_0.eval()model_1 = get_peft_model(base_model_1, peft_config).to(device1)
model_1.load_state_dict(torch.load(WEIGHTS_PATH), strict=False)
model_1.eval()
model_0.print_trainable_parameters(), model_1.print_trainable_parameters()
import gc
gc.collect()
def inference(df, model, device, batch_size=BATCH_SIZE):input_ids = torch.tensor(df['INPUT_IDS'].values.tolist(), dtype=torch.long)attention_mask = torch.tensor(df['ATTENTION_MASKS'].values.tolist(), dtype=torch.long)generated_class_a = []generated_class_b = []generated_class_c = []model.eval()for start_idx in range(0, len(df), batch_size):end_idx = min(start_idx + batch_size, len(df))batch_input_ids = input_ids[start_idx:end_idx].to(device)batch_attention_mask = attention_mask[start_idx:end_idx].to(device)with torch.no_grad():with autocast():outputs = model(input_ids=batch_input_ids,attention_mask=batch_attention_mask)probabilities = torch.softmax(outputs.logits, dim=-1).cpu().numpy()generated_class_a.extend(probabilities[:, 0])generated_class_b.extend(probabilities[:, 1])generated_class_c.extend(probabilities[:, 2])df['winner_model_a'] = generated_class_adf['winner_model_b'] = generated_class_bdf['winner_tie'] = generated_class_ctorch.cuda.empty_cache()  return df
st = time.time()N_SAMPLES = len(data)# Split the data into two subsets
half = round(N_SAMPLES / 2)
sub1 = data.iloc[0:half].copy()
sub2 = data.iloc[half:N_SAMPLES].copy()# Function to run inference in a thread
def run_inference(df, model, device, results, index):results[index] = inference(df, model, device)# Dictionary to store results from threads
results = {}# start threads
t0 = Thread(target=run_inference, args=(sub1, model_0, device0, results, 0))
t1 = Thread(target=run_inference, args=(sub2, model_1, device1, results, 1))t0.start()
t1.start()# Wait for all threads to finish
t0.join()
t1.join()# Combine results back into the original DataFrame
data = pd.concat([results[0], results[1]], axis=0)print(f"Processing complete. Total time: {time.time() - st}")
TARGETS = ['winner_model_a', 'winner_model_b', 'winner_tie']sample_sub[TARGETS] = data[TARGETS]
display(sample_sub)
sample_sub.to_csv('submission.csv', index=False)

最终结果如下：