我们将微调 ada 分类器以区分两种运动:棒球和曲棍球。
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import openaicategories = ['rec.sport.baseball', 'rec.sport.hockey']
sports_dataset = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, categories=categories)
数据探索
可以使用 sklearn 加载新闻组数据集。 首先,我们将查看数据本身:
print(sports_dataset['data'][0])
From: dougb@comm.mot.com (Doug Bank)
Subject: Re: Info needed for Cleveland tickets
Reply-To: dougb@ecs.comm.mot.com
Organization: Motorola Land Mobile Products Sector
Distribution: usa
Nntp-Posting-Host: 145.1.146.35
Lines: 17In article <1993Apr1.234031.4950@leland.Stanford.EDU>, bohnert@leland.Stanford.EDU (matthew bohnert) writes:|> I'm going to be in Cleveland Thursday, April 15 to Sunday, April 18.
|> Does anybody know if the Tribe will be in town on those dates, and
|> if so, who're they playing and if tickets are available?The tribe will be in town from April 16 to the 19th.
There are ALWAYS tickets available! (Though they are playing Toronto,
and many Toronto fans make the trip to Cleveland as it is easier to
get tickets in Cleveland than in Toronto. Either way, I seriously
doubt they will sell out until the end of the season.)--
Doug Bank Private Systems Division
dougb@ecs.comm.mot.com Motorola Communications Sector
dougb@nwu.edu Schaumburg, Illinois
dougb@casbah.acns.nwu.edu 708-576-8207
sports_dataset.target_names[sports_dataset['target'][0]]
'rec.sport.baseball'
len_all, len_baseball, len_hockey = len(sports_dataset.data), len([e for e in sports_dataset.target if e == 0]), len([e for e in sports_dataset.target if e == 1])
print(f"Total examples: {len_all}, Baseball examples: {len_baseball}, Hockey examples: {len_hockey}")
Total examples: 1197, Baseball examples: 597, Hockey examples: 600
数据准备
我们将数据集转换为 pandas 数据框,其中有一列用于提示和完成。 提示包含来自邮件列表的电子邮件,完成是运动的名称,曲棍球或棒球。 仅出于演示目的和微调速度,我们仅采用 300 个示例。 在实际用例中,示例越多性能越好。
import pandas as pdlabels = [sports_dataset.target_names[x].split('.')[-1] for x in sports_dataset['target']]
texts = [text.strip() for text in sports_dataset['data']]
df = pd.DataFrame(zip(texts, labels), columns = ['prompt','completion']) #[:300]
df.head()
prompt | completion | |
---|---|---|
0 | From: dougb@comm.mot.com (Doug Bank)\nSubject:… | baseball |
1 | From: gld@cunixb.cc.columbia.edu (Gary L Dare)… | hockey |
2 | From: rudy@netcom.com (Rudy Wade)\nSubject: Re… | baseball |
3 | From: monack@helium.gas.uug.arizona.edu (david… | hockey |
4 | Subject: Let it be Known\nFrom: <ISSBTL@BYUVM… | baseball |
数据准备工具
我们现在可以使用数据准备工具,它会在微调之前对我们的数据集提出一些改进建议。 在启动该工具之前,我们更新了 openai 库以确保我们使用的是最新的数据准备工具。 我们另外指定 -q 自动接受所有建议。
!pip install --upgrade openai
!openai tools fine_tunes.prepare_data -f sport2.jsonl -q
Analyzing...- Your file contains 1197 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 11 examples that are very long. These are rows: [134, 200, 281, 320, 404, 595, 704, 838, 1113, 1139, 1174]
For conditional generation, and for classification the examples shouldn't be longer than 2048 tokens.
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detailsBased on the analysis we will perform the following actions:
- [Recommended] Remove 11 long examples [Y/n]: Y
- [Recommended] Add a suffix separator `\n\n###\n\n` to all prompts [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: YYour data will be written to a new JSONL file. Proceed [Y/n]: YWrote modified files to `sport2_prepared_train.jsonl` and `sport2_prepared_valid.jsonl`
Feel free to take a look!Now use that file when fine-tuning:
> openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball"After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 30.8 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
该工具有助于对数据集提出一些改进建议,并将数据集拆分为训练集和验证集。
提示和完成之间的后缀是必要的,以告诉模型输入文本已停止,现在需要预测类别。 由于我们在每个示例中使用相同的分隔符,因此该模型能够了解它是为了预测分隔符后的棒球或曲棍球。 补全中的空格前缀很有用,因为大多数单词标记都是用空格前缀标记的。 该工具还认识到这可能是一项分类任务,因此它建议将数据集拆分为训练数据集和验证数据集。 这将使我们能够轻松衡量新数据的预期性能。
微调
该工具建议我们运行以下命令来训练数据集。 由于这是一项分类任务,我们想知道我们的分类用例在提供的验证集上的泛化性能如何。 该工具建议添加 --compute_classification_metrics --classification_positive_class " baseball" 以计算分类指标。
我们可以简单地从 CLI 工具中复制建议的命令。 我们特别添加 -m ada 来微调更便宜和更快的 ada 模型,该模型在性能上通常与分类用例中更慢和更昂贵的模型相当。
!openai api fine_tunes.create -t "sport2_prepared_train.jsonl" -v "sport2_prepared_valid.jsonl" --compute_classification_metrics --classification_positive_class " baseball" -m ada
Upload progress: 100%|████████████████████| 1.52M/1.52M [00:00<00:00, 1.81Mit/s]
Uploaded file from sport2_prepared_train.jsonl: file-Dxx2xJqyjcwlhfDHpZdmCXlF
Upload progress: 100%|███████████████████████| 388k/388k [00:00<00:00, 507kit/s]
Uploaded file from sport2_prepared_valid.jsonl: file-Mvb8YAeLnGdneSAFcfiVcgcN
Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
Streaming events until fine-tuning is complete...(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2021-07-30 13:15:50] Created fine-tune: ft-2zaA7qi0rxJduWQpdvOvmGn3
[2021-07-30 13:15:52] Fine-tune enqueued. Queue number: 0
[2021-07-30 13:15:56] Fine-tune started
[2021-07-30 13:18:55] Completed epoch 1/4
[2021-07-30 13:20:47] Completed epoch 2/4
[2021-07-30 13:22:40] Completed epoch 3/4
[2021-07-30 13:24:31] Completed epoch 4/4
[2021-07-30 13:26:22] Uploaded model: ada:ft-openai-2021-07-30-12-26-20
[2021-07-30 13:26:27] Uploaded result file: file-6Ki9RqLQwkChGsr9CHcr1ncg
[2021-07-30 13:26:28] Fine-tune succeededJob complete! Status: succeeded 🎉
Try out your fine-tuned model:openai api completions.create -m ada:ft-openai-2021-07-30-12-26-20 -p <YOUR_PROMPT>
模型在十分钟左右训练成功。 我们可以看到模型名称是 ada:ft-openai-2021-07-30-12-26-20,我们可以使用它来进行推理。
[高级] 结果和预期的模型性能
我们现在可以下载结果文件以观察在保留的验证集上的预期性能。
!openai api fine_tunes.results -i ft-2zaA7qi0rxJduWQpdvOvmGn3 > result.csv
results = pd.read_csv('result.csv')
results[results['classification/accuracy'].notnull()].tail(1)
step | elapsed_tokens | elapsed_examples | training_loss | training_sequence_accuracy | training_token_accuracy | classification/accuracy | classification/precision | classification/recall | classification/auroc | classification/auprc | classification/f1.0 | validation_loss | validation_sequence_accuracy | validation_token_accuracy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
929 | 930 | 3027688 | 3720 | 0.044408 | 1.0 | 1.0 | 0.991597 | 0.983471 | 1.0 | 1.0 | 1.0 | 0.991667 | NaN | NaN | NaN |
准确率达到99.6%。 在下图中,我们可以看到在训练运行期间验证集的准确性如何提高。
results[results['classification/accuracy'].notnull()]['classification/accuracy'].plot()
使用模型
我们现在可以调用模型来获得预测。
test = pd.read_json('sport2_prepared_valid.jsonl', lines=True)
test.head()
prompt | completion | |
---|---|---|
0 | From: gld@cunixb.cc.columbia.edu (Gary L Dare)… | hockey |
1 | From: smorris@venus.lerc.nasa.gov (Ron Morris … | hockey |
2 | From: golchowy@alchemy.chem.utoronto.ca (Geral… | hockey |
3 | From: krattige@hpcc01.corp.hp.com (Kim Krattig… | baseball |
4 | From: warped@cs.montana.edu (Doug Dolven)\nSub… | baseball |
我们需要按照我们在微调期间使用的提示使用相同的分隔符。 在这种情况下,它是 \n\n###\n\n
。 由于我们关心的是分类,所以我们希望温度尽可能低,我们只需要一个令牌完成来确定模型的预测。
ft_model = 'ada:ft-openai-2021-07-30-12-26-20'
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0)
res['choices'][0]['text']
' hockey'
要获取对数概率,我们可以在完成请求中指定 logprobs 参数
res = openai.Completion.create(model=ft_model, prompt=test['prompt'][0] + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['logprobs']['top_logprobs'][0]
<OpenAIObject at 0x7fe114e435c8> JSON: {" baseball": -7.6311407," hockey": -0.0006307676
}
我们可以看到该模型预测曲棍球的可能性比棒球大得多,这是正确的预测。 通过请求 log_probs,我们可以看到每个类别的预测(对数)概率。
概括
有趣的是,我们的微调分类器非常通用。 尽管接受了针对不同邮件列表的电子邮件的训练,它也成功地预测了推文。
sample_hockey_tweet = """Thank you to the
@Canesand all you amazing Caniacs that have been so supportive! You guys are some of the best fans in the NHL without a doubt! Really excited to start this new chapter in my career with the
@DetroitRedWings!!"""
res = openai.Completion.create(model=ft_model, prompt=sample_hockey_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']
' hockey'
sample_baseball_tweet="""BREAKING: The Tampa Bay Rays are finalizing a deal to acquire slugger Nelson Cruz from the Minnesota Twins, sources tell ESPN."""
res = openai.Completion.create(model=ft_model, prompt=sample_baseball_tweet + '\n\n###\n\n', max_tokens=1, temperature=0, logprobs=2)
res['choices'][0]['text']
' baseball'