Chapter 6 -Fine-tuning for classification

Chapter 6 -Fine-tuning for classification

  • 本章内容涵盖

    1. 引入不同的LLM微调方法
    2. 准备用于文本分类的数据集
    3. 修改预训练的 LLM 进行微调
    4. 微调 LLM 以识别垃圾邮件
    5. 评估微调LLM分类器的准确性
    6. 使用微调的 LLM 对新数据进行分类

    现在,我们将通过在大语言模型上对特定目标任务(如文本分类)进行微调(fine-tuning)。我们将以将文本消息分类为“垃圾邮件”或“非垃圾邮件”为例进行具体分析。下图展示了微调大语言模型的两种主要方式:

    1. 用于分类的微调(步骤 8)

    2. 和用于遵循指令的微调(步骤 9)。


6.1-Different categories of fine-tuning

  • 微调语言模型的最常见方式是指令微调(instruction-finetuning)和分类微调(classification finetuning)

    下图所示得是指令微调

    分类微调通过训练模型识别特定类别标签(如“垃圾邮件”和“非垃圾邮件”),使其成为专注于单一任务的专业化模型,相较于通用模型更易于创建和优化。而经过指令微调的模型通常可以执行多种任务。

    关键点在于,经过分类微调的模型仅限于预测其在训练过程中遇到的类别。例如,它可以判断某条文本是“垃圾邮件”还是“非垃圾邮件”,如图 下图所示,展示了一个使用大语言模型(LLM)进行文本分类的场景。经过垃圾邮件分类微调的模型不需要在输入之外提供额外指令。与指令微调模型不同,它只能响应“垃圾邮件”或“非垃圾邮件”。,但它无法对输入文本做出其他判断

    指令微调提升模型理解和生成响应的能力,适合处理多种任务,但需要更多数据和计算资源;分类微调适合精确分类任务(如情感分析、垃圾邮件检测),所需资源较少,但仅限于训练过的特定类别。


6.2-Preparing the dataset

  • 如下图所示,分类微调 LLM 的三阶段过程

    1. 数据集准备。

    2. 模型设置。

    3. 微调和评估模型。

  • 本节准备用于分类微调的数据集。我们使用一个包含垃圾邮件和非垃圾邮件文本的数据集,对大语言模型(LLM)进行微调,以对其进行分类。首先,我们下载并解压数据集。

    import urllib.request
    import zipfile
    import os
    from pathlib import Pathurl = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
    zip_path = "sms_spam_collection.zip"
    extracted_path = "sms_spam_collection"
    data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"def download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path):if data_file_path.exists():print(f"{data_file_path} already exists. Skipping download and extraction.")return# Downloading the filewith urllib.request.urlopen(url) as response:with open(zip_path, "wb") as out_file:out_file.write(response.read())# Unzipping the filewith zipfile.ZipFile(zip_path, "r") as zip_ref:zip_ref.extractall(extracted_path)# Add .tsv file extensionoriginal_file_path = Path(extracted_path) / "SMSSpamCollection"os.rename(original_file_path, data_file_path)print(f"File downloaded and saved as {data_file_path}")download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)
    

    导入csv文件

    import pandas as pddownload_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path)df = pd.read_csv(data_file_path, sep='\t', header=None, names=["Label", "Text"])
    df
    

    image-20250112141326941

    当我们检查类分布时,我们看到数据包含“ham”(即“not spam”)的频率比“spam”高得多

    print(df["Label"].value_counts())"""输出"""
    Label
    ham     4825
    spam     747
    Name: count, dtype: int64
    

    处于快速微调大模型考虑,对数据集进行下采样(处理类平衡的方法之一),让每个类别包含出747个实例

    def create_balanced_dataset(df):# Count the instances of "spam"num_spam = df[df["Label"] == "spam"].shape[0]# Randomly sample "ham" instances to match the number of "spam" instancesham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)# Combine ham "subset" with "spam"balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]], ignore_index=True)return balanced_dfbalanced_df = create_balanced_dataset(df)
    print(balanced_df["Label"].value_counts())"""输出"""
    Label
    ham     747
    spam    747
    Name: count, dtype: int64
    

    接下来,我们将字符串类标签“ham”和“spam”更改为整数类标签0和1:

    balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})  
    balanced_df
    

    image-20250112142111403

    现在让我们定义一个函数,将数据集随机划分为训练、验证和测试子集,70%用于训练,10%用于验证,20%用于测试

    def random_split(df, train_frac, validation_frac):# Shuffle the entire DataFramedf = df.sample(frac=1, random_state=123).reset_index(drop=True)# Calculate split indicestrain_end = int(len(df) * train_frac)validation_end = train_end + int(len(df) * validation_frac)# Split the DataFrametrain_df = df[:train_end]validation_df = df[train_end:validation_end]test_df = df[validation_end:]return train_df, validation_df, test_dftrain_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)
    # Test size is implied to be 0.2 as the remaindertrain_df.to_csv("train.csv", index=None)
    validation_df.to_csv("validation.csv", index=None)
    test_df.to_csv("test.csv", index=None)
    

    我们已经下载了数据集,对其进行类别平衡并拆分为训练验证测试集。


6.3-Creating data loaders

  • 注意文本消息有不同的长度;如果我们想将多个训练示例组合在一个批次中,我们必须

    1. 将所有消息截断为数据集或批处理中最短消息的长度
    2. 将所有消息填充到数据集或批处理中最长消息的长度

    第一个选项在计算上更便宜,但是如果较短的消息比平均或最长的消息小得多,它可能会导致不显著的信息丢失,这可能会降低模型性能。因此,我们选择第二个选项,它保留所有消息的全部内容。为了实现批处理,将所有消息填充到数据集中最长消息的长度,我们将paddingTokens添加到所有较短的消息中。为此,我们使用“<|endoftext|>"作为填充标记。

    但是,我们可以将与“<|endoftext|>”对应的令牌ID添加到编码的文本消息中,而不是直接将字符串“<|endoftext|>附加到每个文本消息中,如图下图所示。

    50256是填充令牌“<|endoftext|>”的令牌ID。我们可以通过使用我们之前使用的TikToker包中的GPT-2令牌器对“<|endoftext|>”进行编码来仔细检查令牌ID是否正确

    tokenizer = tiktoken.get_encoding("gpt2")
    print(tokenizer.encode("<|endoftext|>", allowed_special={"<|endoftext|>"}))"""输出"""
    [50256]
    

    下面的SpamDataset类继承了pytorch中创建数据集的Dataset基类(子类只需要实现__init__、__getitem__,__len__三个方法就行),在类中标识训练集中最长的序列,并将填充标记添加到其他序列以匹配该序列长度

    import torch
    from torch.utils.data import Datasetclass SpamDataset(Dataset):def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):self.data = pd.read_csv(csv_file)# Pre-tokenize textsself.encoded_texts = [tokenizer.encode(text) for text in self.data["Text"]]if max_length is None:self.max_length = self._longest_encoded_length()else:self.max_length = max_length# Truncate sequences if they are longer than max_lengthself.encoded_texts = [encoded_text[:self.max_length]for encoded_text in self.encoded_texts]# Pad sequences to the longest sequenceself.encoded_texts = [encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))for encoded_text in self.encoded_texts]def __getitem__(self, index):encoded = self.encoded_texts[index]label = self.data.iloc[index]["Label"]return (torch.tensor(encoded, dtype=torch.long),torch.tensor(label, dtype=torch.long))def __len__(self):return len(self.data)def _longest_encoded_length(self):max_length = 0for encoded_text in self.encoded_texts:encoded_length = len(encoded_text)if encoded_length > max_length:max_length = encoded_lengthreturn max_length
    
    train_dataset = SpamDataset(csv_file="train.csv",max_length=None,tokenizer=tokenizer
    )print(train_dataset.max_length)"""输出"""
    120
    

    代码输出120,表明最长的序列包含不超过120个标记,这是文本消息的常见长度。考虑到上下文长度限制,该模型最多可以处理1,024个标记的序列。如果我们的数据集中包含更长的文本,可以在创建时传递参数max_length=1024

    接下来,我们将验证集和测试集填充至与最长训练序列相同的长度。需要注意的是,任何超过最长训练样本长度的验证集和测试集样本,都会在我们之前定义的SpamDataset中通过encoded_text[:self.max_length]进行截断。这种截断是可选的;如果验证集和测试集中没有超过1,024个标记的序列,我们可以将max_length设置为None

    val_dataset = SpamDataset(csv_file="validation.csv",max_length=train_dataset.max_length,tokenizer=tokenizer
    )
    test_dataset = SpamDataset(csv_file="test.csv",max_length=train_dataset.max_length,tokenizer=tokenizer
    )
    

    接下来,我们使用数据集实例化数据加载器,如下图,单个训练批次由表示为令牌 ID 的八个文本消息组成。每个文本消息由 120 个令牌 ID 组成。

    from torch.utils.data import DataLoadernum_workers = 0
    batch_size = 8torch.manual_seed(123)train_loader = DataLoader(dataset=train_dataset,batch_size=batch_size,shuffle=True,num_workers=num_workers,drop_last=True,
    )val_loader = DataLoader(dataset=val_dataset,batch_size=batch_size,num_workers=num_workers,drop_last=False,
    )test_loader = DataLoader(dataset=test_dataset,batch_size=batch_size,num_workers=num_workers,drop_last=False,
    )
    

    作为验证步骤,我们遍历数据加载器并确保每个批次包含8个训练示例,其中每个训练示例由120个令牌组成

    print("Train loader:")
    for input_batch, target_batch in train_loader:passprint("Input batch dimensions:", input_batch.shape)
    print("Label batch dimensions", target_batch.shape)"""输出"""
    Train loader:
    Input batch dimensions: torch.Size([8, 120])
    Label batch dimensions torch.Size([8])
    

    最后,让我们打印每个数据集中的批次总数

    print(f"{len(train_loader)} training batches")
    print(f"{len(val_loader)} validation batches")
    print(f"{len(test_loader)} test batches")"""输出"""
    130 training batches
    19 validation batches
    38 test batches
    

    到现在,我们已经准备好了数据。


6.4-Initializing a model with pretrained weights

  • 在这一部分,我们初始化在上一章中使用过的预训练模型

    完成第 1 阶段(准备数据集)后,我们现在必须初始化 LLM,然后我们将对其进行微调以对垃圾邮件进行分类。

    CHOOSE_MODEL = "gpt2-small (124M)"
    INPUT_PROMPT = "Every effort moves"BASE_CONFIG = {"vocab_size": 50257,     # Vocabulary size"context_length": 1024,  # Context length"drop_rate": 0.0,        # Dropout rate"qkv_bias": True         # Query-key-value bias
    }model_configs = {"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
    }BASE_CONFIG.update(model_configs[CHOOSE_MODEL])assert train_dataset.max_length <= BASE_CONFIG["context_length"], (f"Dataset length {train_dataset.max_length} exceeds model's context "f"length {BASE_CONFIG['context_length']}. Reinitialize data sets with "f"`max_length={BASE_CONFIG['context_length']}`"
    )
    

    接下来,我们从gpt_download.py文件中导入download_and_load_gpt2函数,并重用预训练中的GPTModel类andload_weights_into_gpt函数(第5节)将下载的权重加载到GPT模型中。

    from gpt_download import download_and_load_gpt2
    from previous_chapters import GPTModel, load_weights_into_gptmodel_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")
    settings, params = download_and_load_gpt2(model_size=model_size, models_dir="E:\\LLM\\gpt2\\") # models_dir为gpt2模型下载的存放路径model = GPTModel(BASE_CONFIG)
    load_weights_into_gpt(model, params)
    model.eval();
    
  • 为了确保模型被正确加载,让我们仔细检查它是否生成连贯的文本

    from previous_chapters import (generate_text_simple,text_to_token_ids,token_ids_to_text
    )text_1 = "Every effort moves you"token_ids = generate_text_simple(model=model,idx=text_to_token_ids(text_1, tokenizer),max_new_tokens=15,context_size=BASE_CONFIG["context_length"]
    )print(token_ids_to_text(token_ids, tokenizer))"""输出"""
    Every effort moves you forward.The first step is to understand the importance of your work
    
  • 在我们将模型微调为分类器之前,让我们看看模型是否已经可以通过提示对垃圾短信进行分类

    text_2 = ("Is the following text 'spam'? Answer with 'yes' or 'no':"" 'You are a winner you have been specially"" selected to receive $1000 cash or a $2000 award.'"
    )token_ids = generate_text_simple(model=model,idx=text_to_token_ids(text_2, tokenizer),max_new_tokens=23,context_size=BASE_CONFIG["context_length"]
    )print(token_ids_to_text(token_ids, tokenizer))"""输出"""
    Is the following text 'spam'? Answer with 'yes' or 'no': 'You are a winner you have been specially selected to receive $1000 cash or a $2000 award.'The following text 'spam'? Answer with 'yes' or 'no': 'You are a winner
    

    如我们所见,模型不太擅长遵循说明。这是意料之中的,因为它只经过预训练,没有instruction-finetuned(指令微调将在下一章中介绍)


6.5 Adding a classification head

  • 为进行分类微调,须修改预训练的大语言模型(LLM)。我们将原本把隐藏表征映射到含50,257个词的词表的输出层,替换为一个更小、仅映射到 “0(非垃圾邮件)” 和 “1(垃圾邮件)” 这两个类别的输出层如下图所示。除输出层外,模型其余部分保持不变 。

    print(model)
    

    image-20250112155122932

  • 我们的目标是替换和微调输出层, 为了实现这一点,我们首先冻结模型,这意味着我们使所有层都不可训练。

    for param in model.parameters():param.requires_grad = False
    

    然后,我们替换输出层(‘model.out_head’)

    torch.manual_seed(123)num_classes = 2
    model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=num_classes)
    

    在这个新模型中,out_head 输出层的 requires_grad 属性默认设置为 True,这意味着它是模型中唯一一个在训练期间更新的层。从技术上讲,仅训练我们刚刚添加的输出层就足够了。然而,正如我在实验中发现的那样,微调附加层可以显著提高模型的预测性能(更多详细信息请参阅附录 B)。此外,我们还配置了最后一个 Transformer 块以及将该块连接到输出层的最终 LayerNorm 模块,使其参与训练,如下图所示。

    GPT 模型包含 12 个重复的 Transformer 块。在输出层附近,我们将最后一个 LayerNormalization 层和最后一个 Transformer 块设置为可训练(trainable),而其余 11 个 Transformer 块和嵌入层保持不变,设置为不可训练(non-trainable)。为了使最后一个 LayerNormalization 层和最后一个 Transformer 块可训练,我们将它们各自的 requires_grad 属性设置为 True

    for param in model.trf_blocks[-1].parameters():param.requires_grad = Truefor param in model.final_norm.parameters():param.requires_grad = True
    
  • 即使我们添加了一个新的输出层并将某些层标记为可训练或不可训练,我们仍然可以像以前一样使用这个模型。例如,我们可以编辑一个与之前使用的示例文本相同的示例文本

    inputs = tokenizer.encode("Do you have time")
    inputs = torch.tensor(inputs).unsqueeze(0)
    print("Inputs:", inputs)
    print("Inputs dimensions:", inputs.shape) # shape: (batch_size, num_tokens)"""输出"""
    Inputs: tensor([[5211,  345,  423,  640]])
    Inputs dimensions: torch.Size([1, 4])
    

    但是与之前不同的是,它现在有两个输出维度,而不是50,257

    with torch.no_grad():outputs = model(inputs)print("Outputs:\n", outputs)
    print("Outputs dimensions:", outputs.shape) # shape: (batch_size, num_tokens, num_classes)"""输出"""
    Outputs:tensor([[[-1.5854,  0.9904],[-3.7235,  7.4548],[-2.2661,  6.6049],[-3.5983,  3.9902]]])
    Outputs dimensions: torch.Size([1, 4, 2])
    
  • 请记住,我们有兴趣对这个模型进行微调以返回一个类标签,该标签指示模型输入是“垃圾邮件”还是“非垃圾邮件”。我们不需要对所有四个输出行进行微调;相反,我们可以专注于单个输出令牌。特别是,我们将关注与最后一个输出令牌对应的最后一行,如下图所示

    第3章讨论了将各输入令牌相互连接的注意力机制,还介绍类GPT模型中使用的因果注意掩码(使当前令牌仅关注当前及先前令牌位置),基于此因果注意机制,最后一个令牌包含信息最多,故将对其进行微调以用于垃圾邮件分类任务 。

    print("Last output token:", outputs[:, -1, :])"""输出"""
    Last output token: tensor([[-3.5983,  3.9902]])
    


6.6-Calculating the classification loss and accuracy

  • 在我们对模型进行微调之前,只剩下一个小任务:我们必须实现在微调期间使用的模型评估函数,如下图所示。

    在实现评估实用程序前,先讨论模型输出转换为类标签预测的方法,此前计算LLM生成下一个令牌的令牌ID是将50,257个输出经softmax转概率后用argmax取最高概率位置,现在对输入做“垃圾邮件”或“非垃圾邮件”预测采用同样方法,只是处理的是二维输出。

  • 让我们使用一个具体的实例来考虑最后一个令牌的输出。

    print("Last output token:", outputs[:, -1, :])"""输出"""
    Last output token: tensor([[-3.5983,  3.9902]])
    

    我们通过’softmax’函数将输出(logits)转换为概率分数,然后通过’artmax’函数获得最大概率值的索引位置

    probas = torch.softmax(outputs[:, -1, :], dim=-1)
    label = torch.argmax(probas)
    print("Class label:", label.item())"""输出"""
    Class label: 1
    

    简化代码,不适用 softmax (因为最大的输出直接对应于最高概率得分。)

    logits = outputs[:, -1, :]
    label = torch.argmax(logits)
    print("Class label:", label.item())"""输出"""
    Class label: 1
    

    这个概念可用于计算分类精度,即衡量整个adataset中正确预测的百分比。 为了确定分类精度,我们将基于argmax的预测代码应用于数据集中的所有示例,并通过定义acalc_accuracy_loader函数来计算正确预测的比例。

    def calc_accuracy_loader(data_loader, model, device, num_batches=None):model.eval()correct_predictions, num_examples = 0, 0if num_batches is None:num_batches = len(data_loader)else:num_batches = min(num_batches, len(data_loader))for i, (input_batch, target_batch) in enumerate(data_loader):if i < num_batches:input_batch, target_batch = input_batch.to(device), target_batch.to(device)with torch.no_grad():logits = model(input_batch)[:, -1, :]  # Logits of last output tokenpredicted_labels = torch.argmax(logits, dim=-1)num_examples += predicted_labels.shape[0]correct_predictions += (predicted_labels == target_batch).sum().item()else:breakreturn correct_predictions / num_examples
    
  • 现在让我们应用该函数来计算不同数据集的分类精度:

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# Note:
    # Uncommenting the following lines will allow the code to run on Apple Silicon chips, if applicable,
    # which is approximately 2x faster than on an Apple CPU (as measured on an M3 MacBook Air).
    # As of this writing, in PyTorch 2.4, the results obtained via CPU and MPS were identical.
    # However, in earlier versions of PyTorch, you may observe different results when using MPS.#if torch.cuda.is_available():
    #    device = torch.device("cuda")
    #elif torch.backends.mps.is_available():
    #    device = torch.device("mps")
    #else:
    #    device = torch.device("cpu")
    #print(f"Running on {device} device.")model.to(device) # no assignment model = model.to(device) necessary for nn.Module classestorch.manual_seed(123) # For reproducibility due to the shuffling in the training data loadertrain_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=10)
    val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=10)
    test_accuracy = calc_accuracy_loader(test_loader, model, device, num_batches=10)print(f"Training accuracy: {train_accuracy*100:.2f}%")
    print(f"Validation accuracy: {val_accuracy*100:.2f}%")
    print(f"Test accuracy: {test_accuracy*100:.2f}%")"""输出"""
    Training accuracy: 46.25%
    Validation accuracy: 45.00%
    Test accuracy: 48.75%
    

    模型的初始预测精度接近随机(50%),需通过微调提升。微调前需定义损失函数,目标是最大化垃圾邮件分类精度(输出 01)。由于分类精度不可微,使用交叉熵损失作为代理目标。calc_loss_batch 函数仅优化最后一个令牌的输出(model(input_batch)[:, -1, :])。

    def calc_loss_batch(input_batch, target_batch, model, device):input_batch, target_batch = input_batch.to(device), target_batch.to(device)logits = model(input_batch)[:, -1, :]  # Logits of last output tokenloss = torch.nn.functional.cross_entropy(logits, target_batch)return loss
    

    calc_loss_loader 和第五章一模一样

    # Same as in chapter 5
    def calc_loss_loader(data_loader, model, device, num_batches=None):total_loss = 0.if len(data_loader) == 0:return float("nan")elif num_batches is None:num_batches = len(data_loader)else:# Reduce the number of batches to match the total number of batches in the data loader# if num_batches exceeds the number of batches in the data loadernum_batches = min(num_batches, len(data_loader))for i, (input_batch, target_batch) in enumerate(data_loader):if i < num_batches:loss = calc_loss_batch(input_batch, target_batch, model, device)total_loss += loss.item()else:breakreturn total_loss / num_batches
    
  • 使用 calc_closs_loader,我们在开始训练之前计算初始训练、验证和测试集损失(初始的损失值)

    with torch.no_grad(): # Disable gradient tracking for efficiency because we are not training, yettrain_loss = calc_loss_loader(train_loader, model, device, num_batches=5)val_loss = calc_loss_loader(val_loader, model, device, num_batches=5)test_loss = calc_loss_loader(test_loader, model, device, num_batches=5)print(f"Training loss: {train_loss:.3f}")
    print(f"Validation loss: {val_loss:.3f}")
    print(f"Test loss: {test_loss:.3f}")"""输出"""
    Training loss: 2.453
    Validation loss: 2.583
    Test loss: 2.322
    

6.7-Finetuning the model on supervised data

  • 在本节中,我们定义并使用训练函数来提高模型的分类精度
    下面的train_classifier_simple函数实际上与我们在第5章中用于预训练模型的train_model_simple函数相同
    唯一的两个区别是我们现在

    1. 跟踪看到的训练示例的数量(‘examples_seen’),而不是看到的token数量
    2. 计算每个epoch后的准确性,而不是在每个epoch后打印示例文本

    PyTorch 中训练深度神经网络的典型训练循环需在多轮次中遍历训练集批次,通过计算批次损失确定梯度以更新模型权重来降低损失。实现相关概念的训练函数与预训练所用的 train_model_simple 函数相似,区别在于跟踪训练样本数而非标记数,且每轮次后计算准确率而非打印示例文本 。

  • 训练代码

    # Overall the same as `train_model_simple` in chapter 5
    def train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,eval_freq, eval_iter):# Initialize lists to track losses and examples seentrain_losses, val_losses, train_accs, val_accs = [], [], [], []examples_seen, global_step = 0, -1# Main training loopfor epoch in range(num_epochs):model.train()  # Set model to training modefor input_batch, target_batch in train_loader:optimizer.zero_grad() # Reset loss gradients from previous batch iterationloss = calc_loss_batch(input_batch, target_batch, model, device)loss.backward() # Calculate loss gradientsoptimizer.step() # Update model weights using loss gradientsexamples_seen += input_batch.shape[0] # New: track examples instead of tokensglobal_step += 1# Optional evaluation stepif global_step % eval_freq == 0:train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)train_losses.append(train_loss)val_losses.append(val_loss)print(f"Ep {epoch+1} (Step {global_step:06d}): "f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")# Calculate accuracy after each epochtrain_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter)val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")print(f"Validation accuracy: {val_accuracy*100:.2f}%")train_accs.append(train_accuracy)val_accs.append(val_accuracy)return train_losses, val_losses, train_accs, val_accs, examples_seen
    

    train_classifier_simple 中使用的 evaluate_model 函数与我们在第 5 章中使用的函数相同

    # Same as chapter 5
    def evaluate_model(model, train_loader, val_loader, device, eval_iter):model.eval()with torch.no_grad():train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)model.train()return train_loss, val_loss
    
  • 接下来,我们初始化优化器,设置训练周期的数量,并使用thetrain_classifier_simple函数启动训练。

    import timestart_time = time.time()torch.manual_seed(123)optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)num_epochs = 5
    train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(model, train_loader, val_loader, optimizer, device,num_epochs=num_epochs, eval_freq=50, eval_iter=5,
    )end_time = time.time()
    execution_time_minutes = (end_time - start_time) / 60
    print(f"Training completed in {execution_time_minutes:.2f} minutes.")"""输出"""
    Ep 1 (Step 000000): Train loss 2.153, Val loss 2.392
    Ep 1 (Step 000050): Train loss 0.617, Val loss 0.637
    Ep 1 (Step 000100): Train loss 0.523, Val loss 0.557
    Training accuracy: 70.00% | Validation accuracy: 72.50%
    Ep 2 (Step 000150): Train loss 0.561, Val loss 0.489
    Ep 2 (Step 000200): Train loss 0.419, Val loss 0.397
    Ep 2 (Step 000250): Train loss 0.409, Val loss 0.353
    Training accuracy: 82.50% | Validation accuracy: 85.00%
    Ep 3 (Step 000300): Train loss 0.333, Val loss 0.320
    Ep 3 (Step 000350): Train loss 0.340, Val loss 0.306
    Training accuracy: 90.00% | Validation accuracy: 90.00%
    Ep 4 (Step 000400): Train loss 0.136, Val loss 0.200
    Ep 4 (Step 000450): Train loss 0.153, Val loss 0.132
    Ep 4 (Step 000500): Train loss 0.222, Val loss 0.137
    Training accuracy: 100.00% | Validation accuracy: 97.50%
    Ep 5 (Step 000550): Train loss 0.207, Val loss 0.143
    Ep 5 (Step 000600): Train loss 0.083, Val loss 0.074
    Training accuracy: 100.00% | Validation accuracy: 97.50%
    Training completed in 0.67 minutes.
    

    在M3 MacBook Air笔记本电脑上训练大约需要6分钟,在V100或A100 GPU上训练不到半分钟,我的电脑是windows,显卡3060 12G,训练总时长 40s (0.67minutes)。

  • 然后我们使用Matplotlib绘制训练和验证集的损失函数。

    import matplotlib.pyplot as pltdef plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):fig, ax1 = plt.subplots(figsize=(5, 3))# Plot training and validation loss against epochsax1.plot(epochs_seen, train_values, label=f"Training {label}")ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}")ax1.set_xlabel("Epochs")ax1.set_ylabel(label.capitalize())ax1.legend()# Create a second x-axis for examples seenax2 = ax1.twiny()  # Create a second x-axis that shares the same y-axisax2.plot(examples_seen, train_values, alpha=0)  # Invisible plot for aligning ticksax2.set_xlabel("Examples seen")fig.tight_layout()  # Adjust layout to make roomplt.savefig(f"{label}-plot.pdf")plt.show()
    
    epochs_tensor = torch.linspace(0, num_epochs, len(train_losses))
    examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)
    

    image-20250112170649237

    早些时候,当我们启动训练时,我们将训练的轮数(epochs)设置为 5。轮数的选择取决于数据集和任务的难度,虽然没有通用的解决方案或推荐值,但 5 轮通常是一个不错的起点。如果在前几轮训练后,模型出现过拟合现象,则可能需要减少轮数。相反,如果趋势线表明验证损失可能随着进一步训练而改善,则应增加轮数。在本例中,5 轮是一个合理的数值,因为没有出现早期过拟合的迹象,且验证损失接近 0。

    现在使用相同的plot_values函数,现在让我们绘制分类精度:

    epochs_tensor = torch.linspace(0, num_epochs, len(train_accs))
    examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label="accuracy")
    

    image-20250112171132788

    基于上面的准确度图,我们可以看到模型在第 4 和第 5 阶段之后实现了相对较高的训练和验证准确度但是,我们必须记住,我们之前在训练函数中指定了 “eval_iter=5”,这意味着我们只估计了训练和验证集的性能,我们可以计算完整数据集的训练、验证和测试集性能,如下所示。

    train_accuracy = calc_accuracy_loader(train_loader, model, device)
    val_accuracy = calc_accuracy_loader(val_loader, model, device)
    test_accuracy = calc_accuracy_loader(test_loader, model, device)print(f"Training accuracy: {train_accuracy*100:.2f}%")
    print(f"Validation accuracy: {val_accuracy*100:.2f}%")
    print(f"Test accuracy: {test_accuracy*100:.2f}%")"""输出"""
    Training accuracy: 97.21%
    Validation accuracy: 97.32%
    Test accuracy: 95.67%
    

    训练集和测试集的性能几乎相同。训练集和测试集准确率之间的细微差异表明训练数据的过拟合程度极低。通常情况下,验证集的准确率会比测试集的准确率略高一些,这是因为模型开发过程中常常需要调整超参数,以使模型在验证集上表现良好,但这些调整后的超参数可能无法同样有效地泛化到测试集上。这种情况很常见,不过,可以通过调整模型设置来尽可能缩小这一差距,比如在优化器配置中提高丢弃率(drop_rate)或权重衰减参数(weight_decay) 。


6.8-Using the LLM as a spam classifier

  • 对模型进行微调和评估后,我们现在可以对垃圾短信进行分类了(下图)。让我们使用基于 GPT 的微调垃圾邮件分类模型。

    followingclassify_review 函数遵循数据预处理步骤,类似于我们在 SpamDataset 实现器中使用的步骤。然后,在将文本处理成token ID 后,函数使用该模型预测一个整数类标签,类似于我们在第 6.6 节中实现的标签,然后返回相应的类名

    def classify_review(text, model, tokenizer, device, max_length=None, pad_token_id=50256):model.eval()# Prepare inputs to the modelinput_ids = tokenizer.encode(text)supported_context_length = model.pos_emb.weight.shape[0]# Note: In the book, this was originally written as pos_emb.weight.shape[1] by mistake# It didn't break the code but would have caused unnecessary truncation (to 768 instead of 1024)# Truncate sequences if they too longinput_ids = input_ids[:min(max_length, supported_context_length)]# Pad sequences to the longest sequenceinput_ids += [pad_token_id] * (max_length - len(input_ids))input_tensor = torch.tensor(input_ids, device=device).unsqueeze(0) # add batch dimension# Model inferencewith torch.no_grad():logits = model(input_tensor)[:, -1, :]  # Logits of the last output tokenpredicted_label = torch.argmax(logits, dim=-1).item()# Return the classified resultreturn "spam" if predicted_label == 1 else "not spam"
    
  • 让我们在下面的几个例子中尝试一下

    text_1 = ("You are a winner you have been specially"" selected to receive $1000 cash or a $2000 award."
    )print(classify_review(text_1, model, tokenizer, device, max_length=train_dataset.max_length
    ))"""输出"""
    spam
    
    text_2 = ("Hey, just wanted to check if we're still on"" for dinner tonight? Let me know!"
    )print(classify_review(text_2, model, tokenizer, device, max_length=train_dataset.max_length
    ))"""输出"""
    not spam
    
  • 最后,让我们保存模型,以防以后我们想重用模型而不必再次训练它

    torch.save(model.state_dict(), "review_classifier.pth")
    

    然后,在新会话中,我们可以按如下方式加载模型

    model_state_dict = torch.load("review_classifier.pth", map_location=device, weights_only=True)
    model.load_state_dict(model_state_dict)
    

6.9-Summary and takeaways

  • summary

    1. There are different strategies for fine-tuning LLMs, including classification fine-tuning and instruction fine-tuning.
    2. Classification fine-tuning involves replacing the output layer of an LLM via a small classification layer.
    3. In the case of classifying text messages as “spam” or “not spam,” the new classification layer consists of only two output nodes. Previously, we used the number of output nodes equal to the number of unique tokens in the vocabulary (i.e., 50,256).
    4. Instead of predicting the next token in the text as in pretraining, classification fine-tuning trains the model to output a correct class label—for example, “spam” or “not spam.”
    5. The model input for fine-tuning is text converted into token IDs, similar to pretraining.
    6. Before fine-tuning an LLM, we load the pretrained model as a base model.
    7. Evaluating a classification model involves calculating the classification accuracy (the fraction or percentage of correct predictions).
    8. Fine-tuning a classification model uses the same cross entropy loss function as when pretraining the LLM
  • takeaways

    1. ./gpt_class_finetune.py为用于分类微调的脚本
    2. appendix E 中有lora参数高效训练的介绍
  • gpt_class_finetune.py

    # Copyright (c) Sebastian Raschka under Apache License 2.0 (see LICENSE.txt).
    # Source for "Build a Large Language Model From Scratch"
    #   - https://www.manning.com/books/build-a-large-language-model-from-scratch
    # Code: https://github.com/rasbt/LLMs-from-scratch# This is a summary file containing the main takeaways from chapter 6.import urllib.request
    import zipfile
    import os
    from pathlib import Path
    import timeimport matplotlib.pyplot as plt
    import pandas as pd
    import tiktoken
    import torch
    from torch.utils.data import Dataset, DataLoaderfrom gpt_download import download_and_load_gpt2
    from previous_chapters import GPTModel, load_weights_into_gptdef download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path, test_mode=False):if data_file_path.exists():print(f"{data_file_path} already exists. Skipping download and extraction.")returnif test_mode:  # Try multiple times since CI sometimes has connectivity issuesmax_retries = 5delay = 5  # delay between retries in secondsfor attempt in range(max_retries):try:# Downloading the filewith urllib.request.urlopen(url, timeout=10) as response:with open(zip_path, "wb") as out_file:out_file.write(response.read())break  # if download is successful, break out of the loopexcept urllib.error.URLError as e:print(f"Attempt {attempt + 1} failed: {e}")if attempt < max_retries - 1:time.sleep(delay)  # wait before retryingelse:print("Failed to download file after several attempts.")return  # exit if all retries failelse:  # Code as it appears in the chapter# Downloading the filewith urllib.request.urlopen(url) as response:with open(zip_path, "wb") as out_file:out_file.write(response.read())# Unzipping the filewith zipfile.ZipFile(zip_path, "r") as zip_ref:zip_ref.extractall(extracted_path)# Add .tsv file extensionoriginal_file_path = Path(extracted_path) / "SMSSpamCollection"os.rename(original_file_path, data_file_path)print(f"File downloaded and saved as {data_file_path}")def create_balanced_dataset(df):# Count the instances of "spam"num_spam = df[df["Label"] == "spam"].shape[0]# Randomly sample "ham" instances to match the number of "spam" instancesham_subset = df[df["Label"] == "ham"].sample(num_spam, random_state=123)# Combine ham "subset" with "spam"balanced_df = pd.concat([ham_subset, df[df["Label"] == "spam"]])return balanced_dfdef random_split(df, train_frac, validation_frac):# Shuffle the entire DataFramedf = df.sample(frac=1, random_state=123).reset_index(drop=True)# Calculate split indicestrain_end = int(len(df) * train_frac)validation_end = train_end + int(len(df) * validation_frac)# Split the DataFrametrain_df = df[:train_end]validation_df = df[train_end:validation_end]test_df = df[validation_end:]return train_df, validation_df, test_dfclass SpamDataset(Dataset):def __init__(self, csv_file, tokenizer, max_length=None, pad_token_id=50256):self.data = pd.read_csv(csv_file)# Pre-tokenize textsself.encoded_texts = [tokenizer.encode(text) for text in self.data["Text"]]if max_length is None:self.max_length = self._longest_encoded_length()else:self.max_length = max_length# Truncate sequences if they are longer than max_lengthself.encoded_texts = [encoded_text[:self.max_length]for encoded_text in self.encoded_texts]# Pad sequences to the longest sequenceself.encoded_texts = [encoded_text + [pad_token_id] * (self.max_length - len(encoded_text))for encoded_text in self.encoded_texts]def __getitem__(self, index):encoded = self.encoded_texts[index]label = self.data.iloc[index]["Label"]return (torch.tensor(encoded, dtype=torch.long),torch.tensor(label, dtype=torch.long))def __len__(self):return len(self.data)def _longest_encoded_length(self):max_length = 0for encoded_text in self.encoded_texts:encoded_length = len(encoded_text)if encoded_length > max_length:max_length = encoded_lengthreturn max_lengthdef calc_accuracy_loader(data_loader, model, device, num_batches=None):model.eval()correct_predictions, num_examples = 0, 0if num_batches is None:num_batches = len(data_loader)else:num_batches = min(num_batches, len(data_loader))for i, (input_batch, target_batch) in enumerate(data_loader):if i < num_batches:input_batch, target_batch = input_batch.to(device), target_batch.to(device)with torch.no_grad():logits = model(input_batch)[:, -1, :]  # Logits of last output tokenpredicted_labels = torch.argmax(logits, dim=-1)num_examples += predicted_labels.shape[0]correct_predictions += (predicted_labels == target_batch).sum().item()else:breakreturn correct_predictions / num_examplesdef calc_loss_batch(input_batch, target_batch, model, device):input_batch, target_batch = input_batch.to(device), target_batch.to(device)logits = model(input_batch)[:, -1, :]  # Logits of last output tokenloss = torch.nn.functional.cross_entropy(logits, target_batch)return lossdef calc_loss_loader(data_loader, model, device, num_batches=None):total_loss = 0.if len(data_loader) == 0:return float("nan")elif num_batches is None:num_batches = len(data_loader)else:num_batches = min(num_batches, len(data_loader))for i, (input_batch, target_batch) in enumerate(data_loader):if i < num_batches:loss = calc_loss_batch(input_batch, target_batch, model, device)total_loss += loss.item()else:breakreturn total_loss / num_batchesdef evaluate_model(model, train_loader, val_loader, device, eval_iter):model.eval()with torch.no_grad():train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)model.train()return train_loss, val_lossdef train_classifier_simple(model, train_loader, val_loader, optimizer, device, num_epochs,eval_freq, eval_iter, tokenizer):# Initialize lists to track losses and tokens seentrain_losses, val_losses, train_accs, val_accs = [], [], [], []examples_seen, global_step = 0, -1# Main training loopfor epoch in range(num_epochs):model.train()  # Set model to training modefor input_batch, target_batch in train_loader:optimizer.zero_grad()  # Reset loss gradients from previous batch iterationloss = calc_loss_batch(input_batch, target_batch, model, device)loss.backward()  # Calculate loss gradientsoptimizer.step()  # Update model weights using loss gradientsexamples_seen += input_batch.shape[0]  # New: track examples instead of tokensglobal_step += 1# Optional evaluation stepif global_step % eval_freq == 0:train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)train_losses.append(train_loss)val_losses.append(val_loss)print(f"Ep {epoch+1} (Step {global_step:06d}): "f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")# Calculate accuracy after each epochtrain_accuracy = calc_accuracy_loader(train_loader, model, device, num_batches=eval_iter)val_accuracy = calc_accuracy_loader(val_loader, model, device, num_batches=eval_iter)print(f"Training accuracy: {train_accuracy*100:.2f}% | ", end="")print(f"Validation accuracy: {val_accuracy*100:.2f}%")train_accs.append(train_accuracy)val_accs.append(val_accuracy)return train_losses, val_losses, train_accs, val_accs, examples_seendef plot_values(epochs_seen, examples_seen, train_values, val_values, label="loss"):fig, ax1 = plt.subplots(figsize=(5, 3))# Plot training and validation loss against epochsax1.plot(epochs_seen, train_values, label=f"Training {label}")ax1.plot(epochs_seen, val_values, linestyle="-.", label=f"Validation {label}")ax1.set_xlabel("Epochs")ax1.set_ylabel(label.capitalize())ax1.legend()# Create a second x-axis for tokens seenax2 = ax1.twiny()  # Create a second x-axis that shares the same y-axisax2.plot(examples_seen, train_values, alpha=0)  # Invisible plot for aligning ticksax2.set_xlabel("Examples seen")fig.tight_layout()  # Adjust layout to make roomplt.savefig(f"{label}-plot.pdf")# plt.show()if __name__ == "__main__":import argparseparser = argparse.ArgumentParser(description="Finetune a GPT model for classification")parser.add_argument("--test_mode",default=False,action="store_true",help=("This flag runs the model in test mode for internal testing purposes. ""Otherwise, it runs the model as it is used in the chapter (recommended)."))args = parser.parse_args()######################################### Download and prepare dataset########################################url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"zip_path = "sms_spam_collection.zip"extracted_path = "sms_spam_collection"data_file_path = Path(extracted_path) / "SMSSpamCollection.tsv"download_and_unzip_spam_data(url, zip_path, extracted_path, data_file_path, test_mode=args.test_mode)df = pd.read_csv(data_file_path, sep="\t", header=None, names=["Label", "Text"])balanced_df = create_balanced_dataset(df)balanced_df["Label"] = balanced_df["Label"].map({"ham": 0, "spam": 1})train_df, validation_df, test_df = random_split(balanced_df, 0.7, 0.1)train_df.to_csv("train.csv", index=None)validation_df.to_csv("validation.csv", index=None)test_df.to_csv("test.csv", index=None)######################################### Create data loaders########################################tokenizer = tiktoken.get_encoding("gpt2")train_dataset = SpamDataset(csv_file="train.csv",max_length=None,tokenizer=tokenizer)val_dataset = SpamDataset(csv_file="validation.csv",max_length=train_dataset.max_length,tokenizer=tokenizer)test_dataset = SpamDataset(csv_file="test.csv",max_length=train_dataset.max_length,tokenizer=tokenizer)num_workers = 0batch_size = 8torch.manual_seed(123)train_loader = DataLoader(dataset=train_dataset,batch_size=batch_size,shuffle=True,num_workers=num_workers,drop_last=True,)val_loader = DataLoader(dataset=val_dataset,batch_size=batch_size,num_workers=num_workers,drop_last=False,)test_loader = DataLoader(dataset=test_dataset,batch_size=batch_size,num_workers=num_workers,drop_last=False,)######################################### Load pretrained model######################################### Small GPT model for testing purposesif args.test_mode:BASE_CONFIG = {"vocab_size": 50257,"context_length": 120,"drop_rate": 0.0,"qkv_bias": False,"emb_dim": 12,"n_layers": 1,"n_heads": 2}model = GPTModel(BASE_CONFIG)model.eval()device = "cpu"# Code as it is used in the main chapterelse:CHOOSE_MODEL = "gpt2-small (124M)"INPUT_PROMPT = "Every effort moves"BASE_CONFIG = {"vocab_size": 50257,     # Vocabulary size"context_length": 1024,  # Context length"drop_rate": 0.0,        # Dropout rate"qkv_bias": True         # Query-key-value bias}model_configs = {"gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},"gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},"gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},"gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},}BASE_CONFIG.update(model_configs[CHOOSE_MODEL])assert train_dataset.max_length <= BASE_CONFIG["context_length"], (f"Dataset length {train_dataset.max_length} exceeds model's context "f"length {BASE_CONFIG['context_length']}. Reinitialize data sets with "f"`max_length={BASE_CONFIG['context_length']}`")model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")settings, params = download_and_load_gpt2(model_size=model_size, models_dir="gpt2")model = GPTModel(BASE_CONFIG)load_weights_into_gpt(model, params)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")######################################### Modify and pretrained model########################################for param in model.parameters():param.requires_grad = Falsetorch.manual_seed(123)num_classes = 2model.out_head = torch.nn.Linear(in_features=BASE_CONFIG["emb_dim"], out_features=num_classes)model.to(device)for param in model.trf_blocks[-1].parameters():param.requires_grad = Truefor param in model.final_norm.parameters():param.requires_grad = True######################################### Finetune modified model########################################start_time = time.time()torch.manual_seed(123)optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)num_epochs = 5train_losses, val_losses, train_accs, val_accs, examples_seen = train_classifier_simple(model, train_loader, val_loader, optimizer, device,num_epochs=num_epochs, eval_freq=50, eval_iter=5,tokenizer=tokenizer)end_time = time.time()execution_time_minutes = (end_time - start_time) / 60print(f"Training completed in {execution_time_minutes:.2f} minutes.")######################################### Plot results######################################### loss plotepochs_tensor = torch.linspace(0, num_epochs, len(train_losses))examples_seen_tensor = torch.linspace(0, examples_seen, len(train_losses))plot_values(epochs_tensor, examples_seen_tensor, train_losses, val_losses)# accuracy plotepochs_tensor = torch.linspace(0, num_epochs, len(train_accs))examples_seen_tensor = torch.linspace(0, examples_seen, len(train_accs))plot_values(epochs_tensor, examples_seen_tensor, train_accs, val_accs, label="accuracy")

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/12285.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【从零开始的LeetCode-算法】922. 按奇偶排序数组 II

给定一个非负整数数组 nums&#xff0c; nums 中一半整数是 奇数 &#xff0c;一半整数是 偶数 。 对数组进行排序&#xff0c;以便当 nums[i] 为奇数时&#xff0c;i 也是 奇数 &#xff1b;当 nums[i] 为偶数时&#xff0c; i 也是 偶数 。 你可以返回 任何满足上述条件的…

python 小游戏:扫雷

目录 1. 前言 2. 准备工作 3. 生成雷区 4. 鼠标点击扫雷 5. 胜利 or 失败 6. 游戏效果展示 7. 完整代码 1. 前言 本文使用 Pygame 实现的简化版扫雷游戏。 如上图所示&#xff0c;游戏包括基本的扫雷功能&#xff1a;生成雷区、左键点击扫雷、右键标记地雷、显示数字提示…

安全策略实验报告

1.实验拓扑图 2.实验需求 vlan2属于办公区&#xff0c;vlan3生产区 办公区pc在工作日时间可以正常访问OAserver&#xff0c;i其他时间不允许 办公区pc可以在任意时间访问Web server 生产区pc可以在任意时间访问OA server但不能访问web server 特例&#xff1a;生产区pc可以…

力扣73矩阵置零

给定一个 m x n 的矩阵&#xff0c;如果一个元素为 0 &#xff0c;则将其所在行和列的所有元素都设为 0 。请使用 原地 算法。 输入&#xff1a;matrix [[1,1,1],[1,0,1],[1,1,1]] 输出&#xff1a;[[1,0,1],[0,0,0],[1,0,1]] 输入&#xff1a;matrix [[0,1,2,0],[3,4,5,2],[…

蓝桥杯C语言组:暴力破解

基于C语言的暴力破解方法详解 暴力破解是一种通过穷举所有可能的解来找到正确答案的算法思想。在C语言中&#xff0c;暴力破解通常用于解决那些问题规模较小、解的范围有限的问题。虽然暴力破解的效率通常较低&#xff0c;但它是一种简单直接的方法&#xff0c;适用于一些简单…

【自然语言处理(NLP)】生成词向量:GloVe(Global Vectors for Word Representation)原理及应用

文章目录 介绍GloVe 介绍核心思想共现矩阵1. 共现矩阵的定义2. 共现概率矩阵的定义3. 共现概率矩阵的意义4. 共现概率矩阵的构建步骤5. 共现概率矩阵的应用6. 示例7. 优缺点优点缺点 **总结** 目标函数训练过程使用预训练的GloVe词向量 优点应用总结 个人主页&#xff1a;道友老…

介绍一下Mybatis的Executor执行器

Executor执行器是用来执行我们的具体的SQL操作的 有三种基本的Executor执行器&#xff1a; SimpleExecutor简单执行器 每执行一次update或select&#xff0c;就创建一个Statement对象&#xff0c;用完立刻关闭Statement对象 ReuseExecutor可重用执行器 可重复利用Statement…

Autosar-以太网是怎么运行的?(Davinci配置部分)

写在前面&#xff1a; 入行一段时间了&#xff0c;基于个人理解整理一些东西&#xff0c;如有错误&#xff0c;欢迎各位大佬评论区指正&#xff01;&#xff01;&#xff01; 目录 1.Autosar ETH通讯软件架构 2.Ethernet MCAL配置 2.1配置对应Pin属性 2.2配置TXD引脚 2.3配…

【基于SprintBoot+Mybatis+Mysql】电脑商城项目之用户登录

&#x1f9f8;安清h&#xff1a;个人主页 &#x1f3a5;个人专栏&#xff1a;【Spring篇】【计算机网络】【Mybatis篇】 &#x1f6a6;作者简介&#xff1a;一个有趣爱睡觉的intp&#xff0c;期待和更多人分享自己所学知识的真诚大学生。 目录 &#x1f3af;1.登录-持久层 &…

VSCode设置内容字体大小

1、打开VSCode软件&#xff0c;点击左下角的“图标”&#xff0c;选择“Setting”。 在命令面板中的Font Size处选择适合自己的字体大小。 2、对比Font Size值为14与20下的字体大小。

企业商业秘密百问百答之三十八【商务保密协议签订】

《企业商业秘密百问百答》是由天禾律所陈军律师团队精心编撰的成果&#xff0c;汇集了该团队律师在处理商业秘密相关的刑事和民事案件中的丰富经验。近年来&#xff0c;这份资料已通过线上和线下的方式向全国近千家企业进行了广泛宣讲&#xff0c;并获得了积极的社会反响。 其…

C++11中的bind

官方文档对于bind接口的概述解释&#xff1a;Bind function arguments 在C11中&#xff0c;std::bind 是一个非常有用的工具&#xff0c;用于将函数、成员函数或函数对象与特定的参数绑定在一起&#xff0c;生成一个新的可调用对象。std::bind 可以用于部分应用函数参数、改变…

Qt网络相关

“ 所有生而孤独的人&#xff0c;葆有的天真 ” 为了⽀持跨平台, QT对⽹络编程的 API 也进⾏了重新封装。本章会上手一套基于QT的网络通信编写。 UDP Socket 在使用Qt进行网络编程前&#xff0c;需要在Qt项目中的.pro文件里添加对应的网络模块( network ). QT core gui net…

会计学基础

【拯救者】会计学基础速成&#xff08;期末 复试 升本均可用&#xff09; ©无忌教育 重点: 适用课本: 会计基础 会计基础是指会计工作的基本原则和方法&#xff0c;它努力为会计核算提供一个共同的基础&#xff0c;以便各种组织在会计核算上得到一致的结果。会计基础主要…

我们信仰AI?从神明到人工智能——信任的进化

信任的进化&#xff1a; 信任是我们最宝贵的资产。而现在&#xff0c;它正像黑色星期五促销的廉价平板电视一样&#xff0c;被一点点拆解。在过去&#xff0c;世界很简单&#xff1a;人们相信晚间新闻、那些满是灰尘书籍的教授&#xff0c;或者手持病历、眉头紧锁的医生。而如…

《DeepSeek R1:7b 写一个python程序调用摄像头获取视频并显示》

C:\Users\Administrator>ollama run deepseek-r1:7b hello Hello! How can I assist you today? &#x1f60a; 写一个python程序调用摄像头获取视频并显示 好&#xff0c;我需要帮用户写一个Python程序&#xff0c;它能够使用摄像头获取视频&#xff0c;并在屏幕上显示出…

Linux网络 | 进入数据链路层,学习相关协议与概念

前言&#xff1a;本节内容进入博主讲解的网络层级中的最后一层&#xff1a;数据链路层。 首先博主还是会线代友友们认识一下数据链路层的报文。 然后会带大家重新理解一些概念&#xff0c;比如局域网交换机等等。然后就是ARP协议。 讲完这些&#xff0c; 本节任务就算结束。 那…

Python 科学计算

&#x1f9d1; 博主简介&#xff1a;CSDN博客专家&#xff0c;历代文学网&#xff08;PC端可以访问&#xff1a;https://literature.sinhy.com/#/literature?__c1000&#xff0c;移动端可微信小程序搜索“历代文学”&#xff09;总架构师&#xff0c;15年工作经验&#xff0c;…

18.[前端开发]Day18-王者荣耀项目实战(一)

01-06 项目实战 1 代码规范 2 CSS编写顺序 3 组件化开发思想 组件化开发思路 项目整体思路 – 各个击破 07_(掌握)王者荣耀-top-整体布局完成 完整代码 01_page_top1.html <!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8…

Java 大视界 -- Java 大数据在智能医疗影像诊断中的应用(72)

💖亲爱的朋友们,热烈欢迎来到 青云交的博客!能与诸位在此相逢,我倍感荣幸。在这飞速更迭的时代,我们都渴望一方心灵净土,而 我的博客 正是这样温暖的所在。这里为你呈上趣味与实用兼具的知识,也期待你毫无保留地分享独特见解,愿我们于此携手成长,共赴新程!💖 一、…