ChatGLM-6B 是一个开源的、支持中英双语问答的对话语言模型,基于 General Language Model (GLM) 架构,具有 62 亿参数。
本机显卡只有6G(GTX 1660 Ti),所以刚好可以使用,用户可以在消费级的显卡上进行本地部署(INT4 量化级别下最低只需 6GB 显存),可以去下面参考链接下载对应模型
参考:https://huggingface.co/THUDM/chatglm-6b-int4
https://github.com/THUDM/ChatGLM-6B
安装环境
1、安装对应cuda 版本的torch包
** CUDA 11.3
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
2、安装transformers等包
pip install protobuf transformers==4.27.1 cpm_kernels sentencepiece
1、使用
运算推理还是比较慢,长的内容需要3分钟多
from transformers import AutoTokenizer, AutoModel## 在线加载下载模型很慢,所以这边离线下载本地加载
##模型下载地址:https://cloud.tsinghua.edu.cn/d/674208019e314311ab5c/tokenizer = AutoTokenizer.from_pretrained(r"C:\Users\lonng\Downloads\chatglm-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained(r"C:\Users\lonng\Downloads\chatglm-6b-int4", trust_remote_code=True).half().cuda()
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
historyresponse, history = model.chat(tokenizer, "晚上睡不着应该怎么办", history=history)
print(response)
chatglm-6b-int4 模型bin文件3个多G
response, history = model.chat(tokenizer, "单细胞测序分析方法", history=history)
api多进程高并发 python fastapi代码
from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModel
import uvicorn, json, datetime
import torchDEVICE = "cuda"
DEVICE_ID = "0"
CUDA_DEVICE = f"{DEVICE}:{DEVICE_ID}" if DEVICE_ID else DEVICEdef torch_gc():if torch.cuda.is_available():with torch.cuda.device(CUDA_DEVICE):torch.cuda.empty_cache()torch.cuda.ipc_collect()app = FastAPI()@app.post("/")
async def create_item(request: Request):global model, tokenizerjson_post_raw = await request.json()json_post = json.dumps(json_post_raw)json_post_list = json.loads(json_post)prompt = json_post_list.get('prompt')history = json_post_list.get('history')max_length = json_post_list.get('max_length')top_p = json_post_list.get('top_p')temperature = json_post_list.get('temperature')response, history = model.chat(tokenizer,prompt,history=history,max_length=max_length if max_length else 2048,top_p=top_p if top_p else 0.7,temperature=temperature if temperature else 0.95)now = datetime.datetime.now()time = now.strftime("%Y-%m-%d %H:%M:%S")answer = {"response": response,"history": history,"status": 200,"time": time}log = "[" + time + "] " + '", prompt:"' + prompt + '", response:"' + repr(response) + '"'print(log)torch_gc()return answer# Set the device to the second GPU
torch.cuda.set_device(1)tokenizer = AutoTokenizer.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True).cuda()
model.eval()if __name__ == '__main__':# 多显卡支持,使用下面三行代替上面两行,将num_gpus改为你实际的显卡数量# model_path = "THUDM/chatglm2-6b"# tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)# model = load_model_on_gpus(model_path, num_gpus=2)#model.eval()uvicorn.run("api:app", host='192.168.19.14', port=8000, reload=True, workers=3)
2、P-Tuning v2微调
参考:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning
运行以下指令进行训练:
bash train.sh
###train.sh文件
PRE_SEQ_LEN=128
LR=2e-2CUDA_VISIBLE_DEVICES=0 python3 main.py \--do_train \--train_file AdvertiseGen/train.json \--validation_file AdvertiseGen/dev.json \--prompt_column content \--response_column summary \--overwrite_cache \--model_name_or_path ****/chatglm-6b-int4 \--output_dir output/adgen-chatglm-6b-pt-$PRE_SEQ_LEN-$LR \--overwrite_output_dir \--max_source_length 64 \--max_target_length 64 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--gradient_accumulation_steps 16 \--predict_with_generate \--max_steps 3000 \--logging_steps 10 \--save_steps 1000 \--learning_rate $LR \--pre_seq_len $PRE_SEQ_LEN \--quantization_bit 4
train.sh
中的 PRE_SEQ_LEN
和 LR
分别是 soft prompt 长度和训练的学习率,可以进行调节以取得最佳的效果。P-Tuning-v2 方法会冻结全部的模型参数,可通过调整 quantization_bit
来被原始模型的量化等级,不加此选项则为 FP16 精度加载。
在默认配置 quantization_bit=4
、per_device_train_batch_size=1
、gradient_accumulation_steps=16
下,INT4 的模型参数被冻结,一次训练迭代会以 1 的批处理大小进行 16 次累加的前后向传播,等效为 16 的总批处理大小,此时最低只需 6.7G 显存。若想在同等批处理大小下提升训练效率,可在二者乘积不变的情况下,加大 per_device_train_batch_size
的值,但也会带来更多的显存消耗,请根据实际情况酌情调整。
如果你想要从本地加载模型,可以将 train.sh
中的 THUDM/chatglm-6b
改为你本地的模型路径。
结果会保存ChatGLM-6B-main/ptuning/output下:
加载微调权重:
参考:https://github.com/THUDM/ChatGLM-6B/blob/main/ptuning/main.py
##加载微调模型验证
sh evaluate.sh
3、指定角色扮演
参考:https://aistudio.baidu.com/aistudio/projectdetail/6306692
只要就是提前构建好history,注意histor字典里每轮结果是元组结构,包含两部分每个元组用户内容,系统内容,即[(用户内容,系统内容),(),()]
import os
import platform
import signal
from transformers import AutoTokenizer, AutoModel
import readlinetokenizer = AutoTokenizer.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True).half().cuda()
model = model.eval()os_name = platform.system()
clear_command = 'cls' if os_name == 'Windows' else 'clear'
stop_stream = Falsedef build_prompt(history):prompt = "欢迎使用 杰创智能小助手 ,输入内容即可进行对话,clear 清空对话历史,stop 终止程序"for query, response in history:prompt += f"\n\n用户:{query}"prompt += f"\n\n小杰:{response}"return promptdef signal_handler(signal, frame):global stop_streamstop_stream = Truedef main():history = [("你名字叫***小助手,小名昵称为***,是***公司研发的AI助手,你主要擅长领域是*******","好的,小**很乐意为你服务")]global stop_streamprint("欢迎使用 ***小助手 ,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")while True:query = input("\n用户:")if query.strip() == "stop":breakif query.strip() == "clear":history = []os.system(clear_command)print("欢迎使用 *****小助手 ,输入内容即可进行对话,clear 清空对话历史,stop 终止程序")continuecount = 0for response, history in model.stream_chat(tokenizer, query, history=history):if stop_stream:stop_stream = Falsebreakelse:count += 1if count % 8 == 0:os.system(clear_command)print(build_prompt(history[1:]), flush=True) ##history[1:]注意是屏蔽制定角色内容不展示到前端signal.signal(signal.SIGINT, signal_handler)os.system(clear_command)print(build_prompt(history[1:]), flush=True)if __name__ == "__main__":main()
gradio版本
from transformers import AutoModel, AutoTokenizer
import gradio as gr
import mdtex2html
from utils import load_model_on_gpustokenizer = AutoTokenizer.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True)
model = AutoModel.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True).half().cuda()
# 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量
#from utils import load_model_on_gpus
#model = load_model_on_gpus("/mnt/data/chatglm/chatglm2-6b-int4", num_gpus=4)
model = model.eval()"""Override Chatbot.postprocess"""def postprocess(self, y):if y is None:return []for i, (message, response) in enumerate(y):y[i] = (None if message is None else mdtex2html.convert((message)),None if response is None else mdtex2html.convert(response),)return ygr.Chatbot.postprocess = postprocessdef predict(input, chatbot, max_length, top_p, temperature, history, past_key_values):chatbot.append((input, ""))print("history:",history)for response, history in model.stream_chat(tokenizer, input, history,max_length=max_length, top_p=top_p,temperature=temperature):chatbot[-1] = (input, response)print(response, history)yield chatbot, historydef reset_user_input():return gr.update(value='')def reset_state():### 提前设置角色return [], [('你名字叫*********安全','好的,小杰很乐意为你服务')], Nonewith gr.Blocks() as demo:gr.HTML("""<h1 align="center">杰创智能AI聊天</h1>""")chatbot = gr.Chatbot()with gr.Row():with gr.Column(scale=4):with gr.Column(scale=12):user_input = gr.Textbox(show_label=False, placeholder="Input...", lines=10).style(container=False)with gr.Column(min_width=32, scale=1):submitBtn = gr.Button("Submit", variant="primary")with gr.Column(scale=1):emptyBtn = gr.Button("Clear History")max_length = gr.Slider(0, 32768, value=8192, step=1.0, label="Maximum length", interactive=True)top_p = gr.Slider(0, 1, value=0.8, step=0.01, label="Top P", interactive=True)temperature = gr.Slider(0, 1, value=0.95, step=0.01, label="Temperature", interactive=True)### 提前设置角色history = gr.State([('你名字叫******更安全','好的,小杰很乐意为你服务')])past_key_values = gr.State(None)submitBtn.click(predict, [user_input, chatbot, max_length, top_p, temperature, history, past_key_values],[chatbot, history], show_progress=True)submitBtn.click(reset_user_input, [], [user_input])emptyBtn.click(reset_state, outputs=[chatbot, history], show_progress=True)demo.queue().launch(server_name="192.168.19.14", share=False)
streamlit版本
from transformers import AutoModel, AutoTokenizer
import streamlit as st
from streamlit_chat import messagest.set_page_config(page_title="ChatGLM2-6b 演示",page_icon=":robot:",layout='wide'
)@st.cache_resource
def get_model():tokenizer = AutoTokenizer.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True)model = AutoModel.from_pretrained("/mnt/data/chatglm/chatglm2-6b-int4", trust_remote_code=True).half().cuda()# 多显卡支持,使用下面两行代替上面一行,将num_gpus改为你实际的显卡数量# from utils import load_model_on_gpus# model = load_model_on_gpus("THUDM/chatglm2-6b", num_gpus=2)model = model.eval()return tokenizer, modelMAX_TURNS = 20
MAX_BOXES = MAX_TURNS * 2def predict(input, max_length, top_p, temperature, history=None):tokenizer, model = get_model()if history is None:history = []with container:if len(history) > 0:if len(history)>MAX_BOXES:history = history[-MAX_TURNS:]for i, (query, response) in enumerate(history):message(query, avatar_style="big-smile", key=str(i) + "_user")message(response, avatar_style="bottts", key=str(i))message(input, avatar_style="big-smile", key=str(len(history)) + "_user")st.write("AI正在回复:")with st.empty():for response, history in model.stream_chat(tokenizer, input, history, max_length=max_length, top_p=top_p,temperature=temperature):query, response = history[-1]st.write(response)return historycontainer = st.container()# create a prompt text for the text generation
prompt_text = st.text_area(label="用户命令输入",height = 100,placeholder="请在这儿输入您的命令")max_length = st.sidebar.slider('max_length', 0, 32768, 8192, step=1
)
top_p = st.sidebar.slider('top_p', 0.0, 1.0, 0.8, step=0.01
)
temperature = st.sidebar.slider('temperature', 0.0, 1.0, 0.95, step=0.01
)if 'state' not in st.session_state:### 提前设置角色st.session_state['state'] = [("你名字叫*****更安全","好的,小杰很乐意为你服务")]
print(st.session_state['state'])
if st.button("发送", key="predict"):with st.spinner("AI正在思考,请稍等........"):# text generationst.session_state["state"] = predict(prompt_text, max_length, top_p, temperature, st.session_state["state"])