最新ChatGPT GPT-4 NLU应用之实体分类识别与模型微调(附ipynb与python源码及视频)——开源DataWhale发布入门ChatGPT技术新手从0到1必备使用指南手册(六)

目录

  • 前言
  • 最新ChatGPT GPT-4 自然语言理解NLU实战之实体分类识别与模型微调
    • 主题分类
    • 精准分类解决手段
    • 模型微调步骤
    • 核心代码
  • 其它NLU应用及实战
    • 相关文献
  • 参考资料
  • 其它资料下载

在这里插入图片描述

前言

想象一下,如果您也可以训练一个为特定场景和目的量身打造的大型自然语言模型,以帮助您在实时聊天中更快地、更准确地与用户互动。是不是特别地牛!

ChatGPT可以帮助您非常简单地做到这一点。利用微调模型技术,ChatGPT结合您的自定义语料集,重新训练先前的预训练语言模型,提高模型的精度和可靠性。这使得ChatGPT在特定环境中表现更佳,从而帮助您优化模型,在与客户进行对话时发挥最佳效果。

让我们从零开始,一步一步地学习使用ChatGPT开始构建您的定制化自然语言模型吧。

最新ChatGPT GPT-4 自然语言理解NLU实战之实体分类识别与模型微调

主题分类

  在前面"相关API"一节,我们已经介绍了各种分类和实体提取的用法。这里会给大家介绍更具体、常见的任务。

  首先是主题分类,简单来说就是给定文本,判断属于哪一类主题。

  我们找一个新闻主题分类的数据集看看,数据集取自:CLUEbenchmark/CLUE,共15个类别。

import pnlp
lines = pnlp.read_file_to_list_dict("./dataset/tnews.json")
len(lines)
10000
from collections import Counter
ct = Counter([v["label_desc"] for v in lines])
ct.most_common()
[('news_tech', 1089),('news_finance', 956),('news_entertainment', 910),('news_world', 905),('news_car', 791),('news_sports', 767),('news_culture', 736),('news_military', 716),('news_travel', 693),('news_game', 659),('news_edu', 646),('news_agriculture', 494),('news_house', 378),('news_story', 215),('news_stock', 45)]

  stock这个类别太少了,我们先给它去掉:

lines = [v for v in lines if v["label_desc"] != "news_stock"]
len(lines)
9955
def get_prompt(text):prompt = f"""对给定文本进行分类,类别包括:科技、金融、娱乐、世界、汽车、文化、军事、旅游、游戏、教育、农业、房产、社会、股票。给定文本:
{text}
类别:
"""return prompt
lines[0]
{'label': '102','label_desc': 'news_entertainment','sentence': '江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物','keywords': '江疏影,美少女,经纪人,甜甜圈'}
prompt = get_prompt(lines[0]["sentence"])
print(prompt)
对给定文本进行分类,类别包括:科技、金融、娱乐、世界、汽车、文化、军事、旅游、游戏、教育、农业、房产、社会、股票。给定文本:
江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物
类别:
import openai
OPENAI_API_KEY = "填入专属的API key"openai.api_key = OPENAI_API_KEY
def complete(prompt):response = openai.Completion.create(prompt=prompt,temperature=0,max_tokens=10,top_p=1,frequency_penalty=0,presence_penalty=0,model="text-davinci-003")ans = response["choices"][0]["text"].strip(" \n")return ans
complete(prompt)
'娱乐'
def ask(content):response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": content}],temperature=0,max_tokens=10,top_p=1,frequency_penalty=0,presence_penalty=0,)ans = response.get("choices")[0].get("message").get("content")return ans
ask(prompt)
'娱乐'

  再试几个其他类别的:

lines[1]
{'label': '110','label_desc': 'news_military','sentence': '以色列大规模空袭开始!伊朗多个军事目标遭遇打击,誓言对等反击','keywords': '伊朗,圣城军,叙利亚,以色列国防军,以色列'}
prompt = get_prompt(lines[1]["sentence"])
print(prompt)
对给定文本进行分类,类别包括:科技、金融、娱乐、世界、汽车、文化、军事、旅游、游戏、教育、农业、房产、社会、股票。给定文本:
以色列大规模空袭开始!伊朗多个军事目标遭遇打击,誓言对等反击
类别:
ask(prompt)
'军事'
complete(prompt)
'军事'
lines[2]
{'label': '104','label_desc': 'news_finance','sentence': '出栏一头猪亏损300元,究竟谁能笑到最后!','keywords': '商品猪,养猪,猪价,仔猪,饲料'}
prompt = get_prompt(lines[2]["sentence"])
print(prompt)
对给定文本进行分类,类别包括:科技、金融、娱乐、世界、汽车、文化、军事、旅游、游戏、教育、农业、房产、社会、股票。给定文本:
出栏一头猪亏损300元,究竟谁能笑到最后!
类别:
ask(prompt)
'农业'
complete(prompt)
'社会'

精准分类解决手段

  这个有点迷糊,不过「农业」这个类别感觉也没问题。当然,遇到有错误的情况,我们起码还有两种手段来解决:

  • Few-Shot,可以每次随机从数据集里抽几条出来作为Prompt的一部分。
  • Fine-Tuning,把我们自己的数据集按指定格式准备好,提交给API,让它帮我们微调一个属于我们自己的模型,它在我们自己的数据集上学习过。

  Few-Shot最关键的是如何找到这个「Few」,换句话说,我们拿什么Case给模型当做参考样本。对于类别比较多的多分类(实际工作中,成百上千中Label是很常见的),Few-Shot即使每个Label一个例子,这上下文长度也不得了。不太现实。这时候其实Few-Shot有点不太方便了。当然,如果我们非要用也不是不行,还是最常用的策略:先召回几个相似句,然后把相似句的内容和类别作为Few-Shot的例子,让接口来预测给定句子的类别。

模型微调步骤

  不过,我们还可以使用微调(Fine-Tuning)方法在自己的数据集上对模型进行微调,简单来说就是让模型「熟悉」我们独特的数据,进而让其具备在类似数据集上识别出类别的能力。

  接下来,就让我们看看具体怎么做,一般包括三个主要步骤:

  • 准备数据:按接口要求的格式把数据准备好,这里的数据就是我们自己的数据集,至少包含一段文本和一个类别。
  • 微调:使用微调接口将刚刚的数据传递过去,由服务器自动完成微调,微调完成后可以得到一个新的model_id。注意,这个model_id只属于你自己,不要将它公开给其他人。
  • 使用新的模型进行推理:嗯,这个很简单,把原来接口里的model参数内容换成我们的model_id即可。

核心代码

  咱们接下来就来调调这个多分类模型,我们只取后500条作为训练集(为了快速和省钱……)。

train_lines = lines[-500:]
import pandas as pd
train = pd.DataFrame(train_lines)
train.shape
(500, 4)
train.head()
labellabel_descsentencekeywords
0103news_sports为什么斯凯奇与阿迪达斯脚感很相似,价格却差了近一倍?达斯勒,阿迪达斯,FOAM,BOOST,斯凯奇
1100news_story女儿日渐消瘦,父母发现有怪物,每天吃女儿一遍大将军,怪物
2104news_finance另类逼空确认反弹,剑指3200点以上股票,另类逼空,金融,创业板,快速放大
3100news_story老公在聚会上让我向他的上司敬酒,现在老公哭了,我笑了远走高飞
4108news_edu女孩上初中之后成绩下降,如何才能提升成绩?
train.label_desc.value_counts()
news_finance          48
news_tech             47
news_game             46
news_entertainment    46
news_travel           44
news_sports           42
news_military         40
news_world            38
news_car              36
news_culture          35
news_edu              27
news_agriculture      20
news_house            19
news_story            12
Name: label_desc, dtype: int64

股票数据稍微少了些,问题不大。

Step1:准备数据

  要保证有两列为:prompt和completion。

df_train = train[["sentence", "label_desc"]]
df_train.columns = ["prompt", "completion"]
df_train.head()
promptcompletion
0为什么斯凯奇与阿迪达斯脚感很相似,价格却差了近一倍?news_sports
1女儿日渐消瘦,父母发现有怪物,每天吃女儿一遍news_story
2另类逼空确认反弹,剑指3200点以上news_finance
3老公在聚会上让我向他的上司敬酒,现在老公哭了,我笑了news_story
4女孩上初中之后成绩下降,如何才能提升成绩?news_edu

  存起来:

df_train.to_json("dataset/tnews-finetuning.jsonl", orient='records', lines=True)

  使用openai命令行工具进行转换:

!openai tools fine_tunes.prepare_data -f dataset/tnews-finetuning.jsonl -q
Analyzing...- Your file contains 500 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- More than a third of your `prompt` column/key is uppercase. Uppercase prompts tends to perform worse than a mixture of case encountered in normal language. We recommend to lower case the data if that makes sense in your domain. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
- Your data does not contain a common separator at the end of your prompts. Having a separator string appended to the end of the prompt makes it clearer to the fine-tuned model where the completion should begin. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detail and examples. If you intend to do open-ended generation, then you should leave the prompts empty
- All completions start with prefix `news_`. Most of the time you should only add the output data into the completion, without any prefix
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more detailsBased on the analysis we will perform the following actions:
- [Recommended] Lowercase all your data in column/key `prompt` [Y/n]: Y
- [Recommended] Add a suffix separator ` ->` to all prompts [Y/n]: Y
- [Recommended] Remove prefix `news_` from all completions [Y/n]: Y
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: Y
- [Recommended] Would you like to split into training and validation set? [Y/n]: YYour data will be written to a new JSONL file. Proceed [Y/n]: YWrote modified files to `dataset/tnews-finetuning_prepared_train.jsonl` and `dataset/tnews-finetuning_prepared_valid.jsonl`
Feel free to take a look!Now use that file when fine-tuning:
> openai api fine_tunes.create -t "dataset/tnews-finetuning_prepared_train.jsonl" -v "dataset/tnews-finetuning_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 14After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string ` ->` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 14.33 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

  看一下处理成什么样子了:

!head dataset/tnews-finetuning_prepared_train.jsonl
{"prompt":"cf生存特训:火箭弹狂野复仇,为兄弟报仇就要不死不休 ->","completion":" game"}{"prompt":"哈尔滨 东北抗日联军博物馆 ->","completion":" culture"}{"prompt":"中国股市中,庄家为何如此猖獗?一文告诉你真相 ->","completion":" finance"}{"prompt":"天府锦绣又重来 ->","completion":" agriculture"}{"prompt":"生活,游戏,电影中有哪些词汇稍加修改便可以成为一个非常霸气的名字? ->","completion":" game"}{"prompt":"法庭上,生父要争夺孩子抚养权,小男孩的发言让生父当场哑口无言 ->","completion":" entertainment"}{"prompt":"如何才能选到好的深圳大数据培训机构? ->","completion":" edu"}{"prompt":"有哪些娱乐圈里面的明星追星? ->","completion":" entertainment"}{"prompt":"东坞原生态野生茶 ->","completion":" culture"}{"prompt":"亚冠:恒大不胜早有预示,全北失利命中注定 ->","completion":" sports"}

  最好再验证一下每个数据集的类型:

train = pnlp.read_file_to_list_dict("./dataset/tnews-finetuning_prepared_train.jsonl")
valid = pnlp.read_file_to_list_dict("./dataset/tnews-finetuning_prepared_valid.jsonl")
len(Counter([v["completion"] for v in train]))
14
len(Counter([v["completion"] for v in valid]))
14

  好的,最明显的是给我们每一个prompt后面加了个标记,除此之外还有(请注意看上面的日志):

  • 小写
  • 去除标签news_前缀
  • 在completion前面加空格
  • 切分为训练和验证集

  这些都是常见的、推荐的预处理,我们就按这样。

Step2:微调

import os
os.environ.setdefault("OPENAI_API_KEY", "填入专属的API key")
'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'
!openai api fine_tunes.create \-t "./dataset/tnews-finetuning_prepared_train.jsonl" \-v "./dataset/tnews-finetuning_prepared_valid.jsonl" \--compute_classification_metrics --classification_n_classes 14 \-m davinci\--no_check_if_files_exist
Upload progress: 100%|████████████████████| 41.2k/41.2k [00:00<00:00, 15.6Mit/s]
Uploaded file from ./dataset/tnews-finetuning_prepared_train.jsonl: file-DKjBKHqWFJwo7O8MNZGOcj3F
Upload progress: 100%|████████████████████| 10.5k/10.5k [00:00<00:00, 7.85Mit/s]
Uploaded file from ./dataset/tnews-finetuning_prepared_valid.jsonl: file-j088k3GWqGeqY0o2DDAfWfPh
Created fine-tune: ft-QOkrWkHU0aleR6f5IQw1UpVL
Streaming events until fine-tuning is complete...(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-04-04 21:16:33] Created fine-tune: ft-QOkrWkHU0aleR6f5IQw1UpVLStream interrupted (client disconnected).
To resume the stream, run:openai api fine_tunes.follow -i ft-QOkrWkHU0aleR6f5IQw1UpVL

  其中,-t-v分别指定训练集和验证集,接下来那行用来计算指标,-m指定要微调的模型。最后一行是检查文件是否存在,如果之前传过文件的话,这里可以复用。为了便于演示,我们这里不检查。可以微调的模型和价格参见:Pricing。

  另外,值得一提的是:只能微调Completion接口,ChatCompletion不支持微调。也就是说InstructGPT的几个模型是可以微调的,但是ChatGPT不能微调,可参阅:https://platform.openai.com/docs/guides/chat/is-fine-tuning-available-for-gpt-3-5-turbo

  我们这里选davinci,也是咱们之前一直用的。

  可以看到,上面跑一下就断掉了,这个是正常的,我们可以通过另一个API去查看任务的进度。注意,这里的ID是上面日志打出来的ID,每次执行都会变。

!openai api fine_tunes.get -i ft-QOkrWkHU0aleR6f5IQw1UpVL
{"created_at": 1680614193,"events": [{"created_at": 1680614193,"level": "info","message": "Created fine-tune: ft-QOkrWkHU0aleR6f5IQw1UpVL","object": "fine-tune-event"}],"fine_tuned_model": null,"hyperparams": {"batch_size": null,"classification_n_classes": 14,"compute_classification_metrics": true,"learning_rate_multiplier": null,"n_epochs": 4,"prompt_loss_weight": 0.01},"id": "ft-QOkrWkHU0aleR6f5IQw1UpVL","model": "davinci","object": "fine-tune","organization_id": "org-bKXddeZffpMS2CUNCCXsW7m5","result_files": [],"status": "pending","training_files": [{"bytes": 41212,"created_at": 1680614191,"filename": "./dataset/tnews-finetuning_prepared_train.jsonl","id": "file-DKjBKHqWFJwo7O8MNZGOcj3F","object": "file","purpose": "fine-tune","status": "processed","status_details": null}],"updated_at": 1680614193,"validation_files": [{"bytes": 10507,"created_at": 1680614193,"filename": "./dataset/tnews-finetuning_prepared_valid.jsonl","id": "file-j088k3GWqGeqY0o2DDAfWfPh","object": "file","purpose": "fine-tune","status": "processed","status_details": null}]}

  或者用它刚刚给的提示。注意换ID!

!openai api fine_tunes.follow -i ft-QOkrWkHU0aleR6f5IQw1UpVL
[2023-04-04 14:32:14] Created fine-tune: ft-5LDv5IiFqPvLob3KkThWLTUGStream interrupted (client disconnected).
To resume the stream, run:openai api fine_tunes.follow -i ft-5LDv5IiFqPvLob3KkThWLTUG

  注意,这个是follow,刚刚上面那个是get。大家可以通过openai api --help查看更多:

!openai api --help
usage: openai api [-h]{engines.list,engines.get,engines.update,engines.generate,chat_completions.create,completions.create,deployments.list,deployments.get,deployments.delete,deployments.create,models.list,models.get,models.delete,files.create,files.get,files.delete,files.list,fine_tunes.list,fine_tunes.create,fine_tunes.get,fine_tunes.results,fine_tunes.events,fine_tunes.follow,fine_tunes.cancel,fine_tunes.delete,image.create,image.create_edit,image.create_variation,audio.transcribe,audio.translate}...positional arguments:{engines.list,engines.get,engines.update,engines.generate,chat_completions.create,completions.create,deployments.list,deployments.get,deployments.delete,deployments.create,models.list,models.get,models.delete,files.create,files.get,files.delete,files.list,fine_tunes.list,fine_tunes.create,fine_tunes.get,fine_tunes.results,fine_tunes.events,fine_tunes.follow,fine_tunes.cancel,fine_tunes.delete,image.create,image.create_edit,image.create_variation,audio.transcribe,audio.translate}All API subcommandsoptional arguments:-h, --help            show this help message and exit

  建议大家过段时间get一下进度就好,不需要一直follow。这里可能要等一段时间,等排队完成后进入训练阶段就很快了。

  主要看status是什么状态,依然注意要换ID。

!openai api fine_tunes.get -i ft-QOkrWkHU0aleR6f5IQw1UpVL
{"created_at": 1680614193,"events": [{"created_at": 1680614193,"level": "info","message": "Created fine-tune: ft-QOkrWkHU0aleR6f5IQw1UpVL","object": "fine-tune-event"},{"created_at": 1680614845,"level": "info","message": "Fine-tune costs $2.33","object": "fine-tune-event"},{"created_at": 1680614846,"level": "info","message": "Fine-tune enqueued","object": "fine-tune-event"},{"created_at": 1680617657,"level": "info","message": "Fine-tune is in the queue. Queue number: 31","object": "fine-tune-event"},{"created_at": 1680617785,"level": "info","message": "Fine-tune is in the queue. Queue number: 30","object": "fine-tune-event"},{"created_at": 1680617805,"level": "info","message": "Fine-tune is in the queue. Queue number: 29","object": "fine-tune-event"},{"created_at": 1680617809,"level": "info","message": "Fine-tune is in the queue. Queue number: 28","object": "fine-tune-event"},{"created_at": 1680617918,"level": "info","message": "Fine-tune is in the queue. Queue number: 27","object": "fine-tune-event"},{"created_at": 1680617928,"level": "info","message": "Fine-tune is in the queue. Queue number: 26","object": "fine-tune-event"},{"created_at": 1680618038,"level": "info","message": "Fine-tune is in the queue. Queue number: 25","object": "fine-tune-event"},{"created_at": 1680618050,"level": "info","message": "Fine-tune is in the queue. Queue number: 24","object": "fine-tune-event"},{"created_at": 1680618087,"level": "info","message": "Fine-tune is in the queue. Queue number: 23","object": "fine-tune-event"},{"created_at": 1680618096,"level": "info","message": "Fine-tune is in the queue. Queue number: 22","object": "fine-tune-event"},{"created_at": 1680618207,"level": "info","message": "Fine-tune is in the queue. Queue number: 21","object": "fine-tune-event"},{"created_at": 1680618256,"level": "info","message": "Fine-tune is in the queue. Queue number: 20","object": "fine-tune-event"},{"created_at": 1680618268,"level": "info","message": "Fine-tune is in the queue. Queue number: 19","object": "fine-tune-event"},{"created_at": 1680618336,"level": "info","message": "Fine-tune is in the queue. Queue number: 18","object": "fine-tune-event"},{"created_at": 1680618372,"level": "info","message": "Fine-tune is in the queue. Queue number: 17","object": "fine-tune-event"},{"created_at": 1680618445,"level": "info","message": "Fine-tune is in the queue. Queue number: 16","object": "fine-tune-event"},{"created_at": 1680618488,"level": "info","message": "Fine-tune is in the queue. Queue number: 15","object": "fine-tune-event"},{"created_at": 1680618582,"level": "info","message": "Fine-tune is in the queue. Queue number: 14","object": "fine-tune-event"},{"created_at": 1680618612,"level": "info","message": "Fine-tune is in the queue. Queue number: 13","object": "fine-tune-event"},{"created_at": 1680618652,"level": "info","message": "Fine-tune is in the queue. Queue number: 12","object": "fine-tune-event"},{"created_at": 1680618693,"level": "info","message": "Fine-tune is in the queue. Queue number: 11","object": "fine-tune-event"},{"created_at": 1680618717,"level": "info","message": "Fine-tune is in the queue. Queue number: 10","object": "fine-tune-event"},{"created_at": 1680618759,"level": "info","message": "Fine-tune is in the queue. Queue number: 9","object": "fine-tune-event"},{"created_at": 1680618790,"level": "info","message": "Fine-tune is in the queue. Queue number: 8","object": "fine-tune-event"},{"created_at": 1680618841,"level": "info","message": "Fine-tune is in the queue. Queue number: 7","object": "fine-tune-event"},{"created_at": 1680618886,"level": "info","message": "Fine-tune is in the queue. Queue number: 6","object": "fine-tune-event"},{"created_at": 1680618899,"level": "info","message": "Fine-tune is in the queue. Queue number: 5","object": "fine-tune-event"},{"created_at": 1680618928,"level": "info","message": "Fine-tune is in the queue. Queue number: 4","object": "fine-tune-event"},{"created_at": 1680618979,"level": "info","message": "Fine-tune is in the queue. Queue number: 3","object": "fine-tune-event"},{"created_at": 1680619057,"level": "info","message": "Fine-tune is in the queue. Queue number: 2","object": "fine-tune-event"},{"created_at": 1680619065,"level": "info","message": "Fine-tune is in the queue. Queue number: 1","object": "fine-tune-event"},{"created_at": 1680619127,"level": "info","message": "Fine-tune is in the queue. Queue number: 0","object": "fine-tune-event"},{"created_at": 1680619227,"level": "info","message": "Fine-tune started","object": "fine-tune-event"},{"created_at": 1680619443,"level": "info","message": "Completed epoch 1/4","object": "fine-tune-event"},{"created_at": 1680619573,"level": "info","message": "Completed epoch 2/4","object": "fine-tune-event"},{"created_at": 1680619701,"level": "info","message": "Completed epoch 3/4","object": "fine-tune-event"},{"created_at": 1680619829,"level": "info","message": "Completed epoch 4/4","object": "fine-tune-event"},{"created_at": 1680619890,"level": "info","message": "Uploaded model: davinci:ft-personal-2023-04-04-14-51-29","object": "fine-tune-event"},{"created_at": 1680619891,"level": "info","message": "Uploaded result file: file-xvIjsDl5aFEtXWOVY6nZyLcD","object": "fine-tune-event"},{"created_at": 1680619891,"level": "info","message": "Fine-tune succeeded","object": "fine-tune-event"}],"fine_tuned_model": "davinci:ft-personal-2023-04-04-14-51-29","hyperparams": {"batch_size": 1,"classification_n_classes": 14,"compute_classification_metrics": true,"learning_rate_multiplier": 0.1,"n_epochs": 4,"prompt_loss_weight": 0.01},"id": "ft-QOkrWkHU0aleR6f5IQw1UpVL","model": "davinci","object": "fine-tune","organization_id": "org-bKXddeZffpMS2CUNCCXsW7m5","result_files": [{"bytes": 82416,"created_at": 1680619891,"filename": "compiled_results.csv","id": "file-xvIjsDl5aFEtXWOVY6nZyLcD","object": "file","purpose": "fine-tune-results","status": "processed","status_details": null}],"status": "succeeded","training_files": [{"bytes": 41212,"created_at": 1680614191,"filename": "./dataset/tnews-finetuning_prepared_train.jsonl","id": "file-DKjBKHqWFJwo7O8MNZGOcj3F","object": "file","purpose": "fine-tune","status": "processed","status_details": null}],"updated_at": 1680619892,"validation_files": [{"bytes": 10507,"created_at": 1680614193,"filename": "./dataset/tnews-finetuning_prepared_valid.jsonl","id": "file-j088k3GWqGeqY0o2DDAfWfPh","object": "file","purpose": "fine-tune","status": "processed","status_details": null}]}

  过了很长一段时间,终于成功了。微调结束后,我们还可以通过下面的命令查看结果:

# -i 就是上面微调的模型id,就是`id`字段
!openai api fine_tunes.results -i ft-QOkrWkHU0aleR6f5IQw1UpVL > metric.csv
metric = pd.read_csv('metric.csv')
metric[metric['classification/accuracy'].notnull()].tail(1)
stepelapsed_tokenselapsed_examplestraining_losstraining_sequence_accuracytraining_token_accuracyvalidation_lossvalidation_sequence_accuracyvalidation_token_accuracyclassification/accuracyclassification/weighted_f1_score
160116028322616020.0087391.01.0NaNNaNNaN0.630.619592
metric[metric['classification/accuracy'].notnull()]['classification/accuracy'].plot()
<AxesSubplot:>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-usRfhl1e-1683766805146)(ChatGPT%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97%E2%80%94%E2%80%94%E5%8F%A5%E8%AF%8D%E5%88%86%E7%B1%BB_files/ChatGPT%E4%BD%BF%E7%94%A8%E6%8C%87%E5%8D%97%E2%80%94%E2%80%94%E5%8F%A5%E8%AF%8D%E5%88%86%E7%B1%BB_209_1.png)]

  这个Accuracy其实是非常一般的,应该是咱们给的语料太少的缘故。我们先看下怎么使用。

Step3:使用

prompt = get_prompt(lines[2]["sentence"])
print(prompt)
对给定文本进行分类,类别包括:科技、金融、娱乐、世界、汽车、文化、军事、旅游、游戏、教育、农业、房产、社会、股票。给定文本:
出栏一头猪亏损300元,究竟谁能笑到最后!
类别:
def complete(prompt, model, max_tokens=2):response = openai.Completion.create(prompt=prompt,temperature=0,max_tokens=max_tokens,top_p=1,frequency_penalty=0,presence_penalty=0,model=model)ans = response["choices"][0]["text"].strip(" \n")return ans
# 原来的,微调之前的
complete(prompt, "text-davinci-003", 5)
'社会'

  注意咱们的prompt也要改一下。预期应该要返回一个英文单词,回忆一下训练数据里面的completion。我们这里为了方便说明,特意在前面用了中文标签,微调时用英文(就是希望它能直接输出英文标签,表示微调有效)。大家在实际使用时务必要统一。

  我们把模型换成刚刚微调的(就是上面返回结果中的fine_tuned_model字段)。

# 微调后的
prompt = lines[2]["sentence"] + " ->"
complete(prompt, "davinci:ft-personal-2023-04-04-14-51-29", 1)
'agriculture'

  咦——居然变成农业了,它和之前ChatGPT(ChatCompletion)的输出一样,不过依然不是我们训练数据集里的「金融」。我想这个句子看起来确实更加像农业主题,放在农业主题下应该也没问题。而且,我们的训练数据集里并没有包含这条样本。所以,这个问题不太大。

  如果我们非要它变成金融的,可以把这条数据也丢给微调接口,微调后应该就可以得到我们训练集里给的类别了。

  上面我们介绍了主题分类的微调。实体抽取的微调也是类似的,它推荐的输入格式如下:

{"prompt":"<any text, for example news article>\n\n###\n\n", "completion":" <list of entities, separated by a newline> END"}
{'prompt': '<any text, for example news article>\n\n###\n\n','completion': ' <list of entities, separated by a newline> END'}

  举个例子:

{"prompt":"Portugal will be removed from the UK's green travel list from Tuesday, amid rising coronavirus cases and concern over a \"Nepal mutation of the so-called Indian variant\". It will join the amber list, meaning holidaymakers should not visit and returnees must isolate for 10 days...\n\n###\n\n", "completion":" Portugal\nUK\nNepal mutation\nIndian variant END"}
{'prompt': 'Portugal will be removed from the UK\'s green travel list from Tuesday, amid rising coronavirus cases and concern over a "Nepal mutation of the so-called Indian variant". It will join the amber list, meaning holidaymakers should not visit and returnees must isolate for 10 days...\n\n###\n\n','completion': ' Portugal\nUK\nNepal mutation\nIndian variant END'}

  相信大家应该很容易理解,不妨自己做一些尝试。尤其是给一些专业领域的实体进行微调,对比一下微调前后的效果。

  如果大家对这块内容感兴趣,可以进一步阅读【相关文献11和12】。

其它NLU应用及实战

  • 最新ChatGPT GPT-4 NLU实战之文档问答类ChatPDF功能

  • 最新ChatGPT GPT-4 NLU实战之智能多轮对话机器人

相关文献

  • 【1】GPT3 和它的 In-Context Learning | Yam
  • 【2】ChatGPT Prompt 工程:设计、实践与思考 | Yam
  • 【3】一些 ChatGPT Prompt 示例 | Yam
  • 【4】dair-ai/Prompt-Engineering-Guide: 🐙 Guides, papers, lecture, notebooks and resources for prompt engineering
  • 【5】Best practices for prompt engineering with OpenAI API | OpenAI Help Center
  • 【6】ChatGPT Prompts and Products | PromptVine
  • 【7】Prompt Vibes
  • 【8】ShareGPT: Share your wildest ChatGPT conversations with one click.
  • 【9】Awesome ChatGPT Prompts | This repo includes ChatGPT prompt curation to use ChatGPT better.
  • 【10】Learn Prompting | Learn Prompting
  • 【11】Fine-tuning - OpenAI API
  • 【12】[PUBLIC] Best practices for fine-tuning GPT-3 to classify text - Google Docs
  • 【13】hscspring/chatbot: Lab for Chatbot

memo:

  • openai-cookbook/text_explanation_examples.md at main · openai/openai-cookbook
  • openai-cookbook/Fine-tuned_classification.ipynb at main · openai/openai-cookbook
  • openai-cookbook/Gen_QA.ipynb at main · openai/openai-cookbook
  • openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook
  • openai-cookbook/Fine-tuned_classification.ipynb at main · openai/openai-cookbook

参考资料

ChatGPT 使用指南:句词分类 @长琴

相关视频讲解

其它资料下载

如果大家想继续了解人工智能相关学习路线和知识体系,欢迎大家翻阅我的另外一篇博客《重磅 | 完备的人工智能AI 学习——基础知识学习路线,所有资料免关注免套路直接网盘下载》
这篇博客参考了Github知名开源平台,AI技术平台以及相关领域专家:Datawhale,ApacheCN,AI有道和黄海广博士等约有近100G相关资料,希望能帮助到所有小伙伴们。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/12213.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

斯坦福大学打响 ChatGPT 反击战,阻止学生用 AI 写作业

整理 | 朱珂欣 出品 | CSDN&#xff08;ID&#xff1a;CSDNnews&#xff09; 自 OpenAI 于 2022 年 11 月发布 ChatGPT 以来&#xff0c; ChatGPT 可谓是风靡各个领域&#xff1a;修bug、写代码、写周报、绘画、写小说……不仅如此&#xff0c;ChatGPT 也没“放过”教育圈&a…

研究成果将会开源!复旦MOSS团队深度访谈来了

来自&#xff1a;复旦大学 由美国OpenAI公司开发的ChatGPT模型在全球点燃了新一轮AI革命&#xff0c;海内外关注度陡增。前不久&#xff0c;复旦大学计算机科学技术学院邱锡鹏团队发布类ChatGPT模型MOSS&#xff0c;新闻迅速登上微博热搜。 MOSS“火”了。2月20日发布当天就收到…

AIGC的发展与机遇

陈老老老板&#x1f9b8; &#x1f468;‍&#x1f4bb;本文专栏&#xff1a;赠书活动专栏&#xff08;为大家争取的福利&#xff0c;免费送书&#xff09;试一下文末赠书&#xff0c;讲一些科普类的知识 &#x1f468;‍&#x1f4bb;本文简述&#xff1a;本篇内容的引用都已征…

自然语言处理数据增强魔法书:轻松解锁NLP技巧与方法

❤️觉得内容不错的话&#xff0c;欢迎点赞收藏加关注&#x1f60a;&#x1f60a;&#x1f60a;&#xff0c;后续会继续输入更多优质内容❤️ &#x1f449;有问题欢迎大家加关注私戳或者评论&#xff08;包括但不限于NLP算法相关&#xff0c;linux学习相关&#xff0c;读研读博…

好物周刊#1:提示工程师养成指南

不要哀求&#xff0c;学会争取。若是如此&#xff0c;终有所获。 &#x1f388; 项目 ddrun 后端采用 midway3.0&#xff0c;后台采用 nuxt2.x&#xff0c;小程序采用 uniapp 实现的一套跑腿下单接单系统。 适用于校园、社区、城市等场景&#xff0c;可提供打印服务、跑腿服…

它如何做到让我们持久且不感疲劳

写在前面 随着科技的进步和数字化生活的兴起&#xff0c;人们长时间使用显示器的需求增加&#xff0c;越来越多的人戴眼镜并且面临眼睛问题。显示器屏幕灯在当今社会也逐渐扮演着不可或缺的角色。 首先&#xff0c;显示器屏幕灯能够提供必要的亮度&#xff0c;确保我们在各种…

打不过就拉拢!ChatGPT和MidJourney已经成我小秘书!太爽了

大家好&#xff0c;我是晓衡。 这两周&#xff0c;我战斗力爆棚了&#xff01;每天大概睡了四~五个小时&#xff0c;而且中午也没有休息过&#xff0c;但精神却还很亢奋。 直到周一下午&#xff0c;身体才有种被掏空的感觉&#xff0c;晚上 10 点就睡了。可能是兴奋劲还在&…

最强组合HuggingFace+ChatGPT=「贾维斯」现在开放demo了!

来源&#xff1a;新智元 前段时间&#xff0c;浙大&微软发布了一个大模型协作系统HuggingGPT直接爆火。 研究者提出了用ChatGPT作为控制器&#xff0c;连接HuggingFace社区中的各种AI模型&#xff0c;完成多模态复杂任务。 整个过程&#xff0c;只需要做的是&#xff1a;用…

【ChatGPT实战】13.使用ChatGPT帮你搞定房屋装修

房屋装修是一项既令人兴奋又令人操劳的事情。随着人工智能技术的不断发展&#xff0c;在房屋设计方面我们可以借助ChatGPT和Midjourney这两个工具来更好地设计你的房屋装修。 在这篇文章中&#xff0c;我们将介绍如何使用ChatGPT和Midjourney这两个工具&#xff0c;来帮助你轻…

ChatGPT赋能Scrum实践

对于Scrum实践者来说&#xff0c;以ChatGPT为代表的的大语言模型是很有效的工具&#xff0c;但要用好这个工具需要一些技巧&#xff0c;本文介绍了60个适用于Scrum实践的提示&#xff0c;可以帮助ChatGPT有效输出适用于Scrum实践的内容。原文: 60 ChatGPT Prompts Plus Prompt …

人工智能的新时代:讯飞星火大模型Vs ChatGPT

近日&#xff0c;科大讯飞召开了星火认知大模型成果发布会&#xff0c;会上表示讯飞星火大模型将突破开放式问答&#xff0c;对标ChatGPT&#xff0c;在中文能力上超过ChatGPT&#xff0c;在英文能力上与ChatGPT相当。对此&#xff0c;你怎么看&#xff1f; 一、你有使用过这种…

5620亿参数,最大多模态模型控制机器人,谷歌把具身智能玩出新高度

关注并星标 从此不迷路 计算机视觉研究院 公众号ID&#xff5c;ComputerVisionGzq 学习群&#xff5c;扫码在主页获取加入方式 计算机视觉研究院专栏 作者&#xff1a;Edison_G 机器人越来越像人了&#xff01;谷歌等的这项研究将现实世界的信息通过传感器纳入多模态语言模型&a…

终结者最强大脑!谷歌发布史上最大「通才」模型PaLM-E,5620亿参数,看图说话还能操控机器人...

新智元报道 本文转载自新智元。 谷歌近日公布了一个炸弹级「通才」模型PaLM-E&#xff0c;足足有5620亿参数。它是一种多模态具身视觉语言模型&#xff0c;从引导机器人执行任务&#xff0c;到回答关于可观察世界的问题&#xff0c;全都能搞定。 大语言模型的飞速「变异」&am…

摊牌了,微软始料未及,Excel和WPS用户:我们已经在使用了

ChatGPT嵌入 Power Platform 3月6日&#xff0c;微软公司宣布&#xff0c;将 ChatGPT 的技术扩展到其 Power Platform 无代码开发平台上&#xff0c;这将允许其用户在很少甚至不需要编写代码的情况下&#xff0c;就能开发自己的应用程序。 从人工智能到ChatGPT&#xff0c;一次…

领域知识网络即服务:知识助力产业升级“无形的手”?

文|智能相对论 作者|叶远风 你用过ChatGPT了吗&#xff1f; 这个风靡全球的AI应用&#xff0c;以出圈的方式展现着AI的能量。 一个对话的界面&#xff0c;可以聊任何你想聊的话题&#xff0c;可以写出任何你想要的文字。 似乎有无限的人类知识&#xff0c;被集中到了一起&…

炸了!谷歌发布史上最大通才模型PaLM-E,5620亿参数!看图说话还能操控机器人...

点击下方卡片&#xff0c;关注“CVer”公众号 AI/CV重磅干货&#xff0c;第一时间送达 点击进入—>【计算机视觉】微信技术交流群 转载自&#xff1a;新智元 【导读】谷歌刚刚上线了一个炸弹级「通才」模型PaLM-E&#xff0c;足足有5620亿参数。它是一种多模态具身视觉语言模…

Arduino实训-可移动机械臂(智能小车)

Arduino课程设计的实训要求通过UNO开发板设计工程项目并且与自主开发的AndroidAPP实现连接&#xff0c;最后选择了完成一个可移动机械臂。下面是最终成果图。 最终的效果&#xff1a;可通过手机开发的AndroidApp实现控制小车上的蓝牙模块进行连接并且控制小车的运动和机械臂的运…

30分钟!从0到1,用ChatGPT+Python做一个AI起名网!

坚持6年&#xff0c;第629篇原创 现在利用ChatGPT可以做很多很多事情&#xff0c;而对于我们程序员来说&#xff0c;是机会也是挑战&#xff01;因为原来很多工种&#xff0c;很多技术问题现在可以用非常廉价的技术去获取&#xff0c;成本更低了&#xff01; 打个比方&#xff…

ChatGPT 加持,决策大模型距离 AGI 更进一步

过去短短不到一年里&#xff0c;ChatGPT、GPT-4 的相继面世&#xff0c;不断刷新人们对 AI 的认知。 新技术带来变革&#xff0c;也引发了外界对 AI 是否会取代人的讨论&#xff0c;OpenAI 首席执行官 Sam Altman 也公开表示&#xff0c;对人工智能技术的强大能力有些担忧。 …

人工智能从来不是一帆风顺的

不久之前&#xff0c;人们还常说&#xff0c;计算机视觉的辨别能力尚不如一岁大的孩子。如今看来&#xff0c;这句话要改写了。 ----- 特伦斯谢诺夫斯基 目录 人工智能初现端倪 人工智能初现 人工智能“第一次危机”——“设计派”行不通 第一次危机破局——学习派初见成效…