★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
(开启本项目前,如果想尝试7B模型,请开启32GB以上的GPU环境)
(本项目由我编写的rwkv-paddle提供推理代码支持,欢迎大家star。感谢RWKV作者提供的torch代码作为支持)
自从ChatGPT出现以来,开源社区便热衷于创造属于开源社区的ChatGPT,而在llama的权重泄露后,这股复现风愈演愈烈。这时,RWKV架构的作者别树一帜,区别于GPT的trnasformer模型结构,使用自己的RWKV类RNN结构,向社区宣告ChatRWKV的诞生,ChatRWKV是对标ChatGPT的开源项目,希望做“大规模语言模型的Stable Diffusion”。
前言
目前在自然语言处理(NLP)和计算机视觉(CV)领域中,Transformer及其变种已成为主流技术,其中最核心的self-attention机制因为其O(N2)的时间复杂度(二次依赖问题)而饱受诟病。
在不改变Transformer整体架构的前提下,目前学术界解决二次依赖问题的主要思路有两种。一种是实现self-attention的线性化,这方面已经有很多相关工作,比如Reformer、Linformer等。虽然有很多关于线性attention的工作,但大多需要在效果上做出牺牲才能获得一定的速度提升。并且,通过替换线性attention以提高Transformer速度这一思路必须付出代价。
另一种思路是替换self-attention换成其他线性复杂度的的结构,曾经火过一时的MLPMixer就是一种解决方案。
而RWKV的提出则改变了思路,又回到了传统的RNN,但是实现了并行化。因为RNN只依赖于上一时间步,所以理论上拥有无限的记忆。
实验
关于模型原理,我不多做赘述,网上有许多资料。我个人的话对于数理方面的东西也不能完全理解,所以本项目主要做迁移模型到PaddlePaddle上的推理。
本项目主要迁移了RWKV上比较热门的两类模型,用于连续对话的Raven聊天机器人,用于小说续写的Novel模型。
安装环境
运行以下脚本进行环境安装
!pip install 'rwkv-paddle>=0.7.3' pynvml ipywidgets
测试模型
PS:PaddlePaddle 不支持部分策略。 最受支持的策略是“cuda fp16”和“cpu fp32”。
PS:PaddlePaddle 版本要大于2.4.0。
PS:模型比较大,加载时间会比较久。
运行以下代码进行简单的模型测试:
import os# set these before import RWKV
os.environ['RWKV_JIT_ON'] = '0' # RWKV JIT Mode is not supported on paddlepaddle now
os.environ["RWKV_CUDA_ON"] = '1' # '1' to compile CUDA kernel (10x faster), requires c++ compiler & cuda libraries########################################################################################################
#
# Use '/' in model path, instead of '\'. Use ctx4096 models if you need long ctx.
#
# fp16 = good for GPU (!!! DOES NOT support CPU !!!)
# fp32 = good for CPU
# bf16 = worse accuracy, supports CPU
# xxxi8 (example: fp16i8, fp32i8) = xxx with int8 quantization to save 50% VRAM/RAM, slower, slightly less accuracy
#
# We consider [ln_out+head] to be an extra layer, so L12-D768 (169M) has "13" layers, L24-D2048 (1.5B) has "25" layers, etc.
# Strategy Examples: (device = cpu/cuda/cuda:0/cuda:1/...)
# 'cpu fp32' = all layers cpu fp32
# 'cuda fp16' = all layers cuda fp16
# 'cuda fp16i8' = all layers cuda fp16 with int8 quantization
# 'cuda fp16i8 *10 -> cpu fp32' = first 10 layers cuda fp16i8, then cpu fp32 (increase 10 for better speed)
# 'cuda:0 fp16 *10 -> cuda:1 fp16 *8 -> cpu fp32' = first 10 layers cuda:0 fp16, then 8 layers cuda:1 fp16, then cpu fp32
#
# Basic Strategy Guide: (fp16i8 works for any GPU)
# 100% VRAM = 'cuda fp16' # all layers cuda fp16
# 98% VRAM = 'cuda fp16i8 *1 -> cuda fp16' # first 1 layer cuda fp16i8, then cuda fp16
# 96% VRAM = 'cuda fp16i8 *2 -> cuda fp16' # first 2 layers cuda fp16i8, then cuda fp16
# 94% VRAM = 'cuda fp16i8 *3 -> cuda fp16' # first 3 layers cuda fp16i8, then cuda fp16
# ...
# 50% VRAM = 'cuda fp16i8' # all layers cuda fp16i8
# 48% VRAM = 'cuda fp16i8 -> cpu fp32 *1' # most layers cuda fp16i8, last 1 layer cpu fp32
# 46% VRAM = 'cuda fp16i8 -> cpu fp32 *2' # most layers cuda fp16i8, last 2 layers cpu fp32
# 44% VRAM = 'cuda fp16i8 -> cpu fp32 *3' # most layers cuda fp16i8, last 3 layers cpu fp32
# ...
# 0% VRAM = 'cpu fp32' # all layers cpu fp32
#
# Use '+' for STREAM mode, which can save VRAM too, and it is sometimes faster
# 'cuda fp16i8 *10+' = first 10 layers cuda fp16i8, then fp16i8 stream the rest to it (increase 10 for better speed)
#
# Extreme STREAM: 3G VRAM is enough to run RWKV 14B (slow. will be faster in future)
# 'cuda fp16i8 *0+ -> cpu fp32 *1' = stream all layers cuda fp16i8, last 1 layer [ln_out+head] cpu fp32
#
# ########################################################################################################from rwkv_paddle.model import RWKV
from rwkv_paddle.utils import PIPELINE, PIPELINE_ARGS# download models: https://huggingface.co/BlinkDL
# model = RWKV(model='./data/data209062/RWKV-4-Raven-7B-v9-ChnEng-ctx4096', strategy='cuda fp16')
# model = RWKV(model='./data/data209062/RWKV-4-Novel-7B-v1-Chn-ctx4096', strategy='cuda fp16')
model = RWKV(model='./data/data209062/RWKV-4-Raven-3B-v7-ChnEng-ctx2048', strategy='cuda fp16')
# model = RWKV(model='./data/data209062/RWKV-4-Novel-3B-v1-Chn-ctx4096', strategy='cuda fp16')
# model = RWKV(model='./data/data209062/RWKV-4-Pile-1B5-Chn-testNovel-ctx2048', strategy='cuda fp16')
pipeline = PIPELINE(model, "20B_tokenizer.json") # 20B_tokenizer.json is in https://github.com/HighCWu/rwkv-paddlectx = "\nIn a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese."
print(ctx, end='')def my_print(s):print(s, end='', flush=True)# For alpha_frequency and alpha_presence, see "Frequency and presence penalties":
# https://platform.openai.com/docs/api-reference/parameter-detailsargs = PIPELINE_ARGS(temperature = 1.0, top_p = 0.7, top_k = 100, # top_k = 0 then ignorealpha_frequency = 0.25,alpha_presence = 0.25,token_ban = [0], # ban the generation of some tokenstoken_stop = [], # stop generation whenever you see any token herechunk_len = 256) # split input into chunks to save VRAM (shorter -> slower)pipeline.generate(ctx, token_count=200, args=args, callback=my_print)
print('\n')out, state = model.forward([187, 510, 1563, 310, 247], None)
print(out.detach().cpu().numpy()) # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state) # RNN has state (use deepcopy to clone states)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy()) # same result as above
print('\n')
In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.The dragons, with their huge wingspan and hairless bodies, are considered a mythical creature by many Chinese people. However, according to the scientists, the dragons actually existed in Tibet and have lived there for thousands of years.A study published in the journal PLoS ONE on Monday has discovered a rare species of dragon living in a remote valley in Tibet. The dragon was described as being between two and four meters long, with its head and body weighing around 10-15 kilograms each. The study was led by the University of California-Berkeley and featured eight research assistants from Berkeley’s Department of Biological Sciences and Institute of Tibetan Plateau Research.During the study, the researchers were able to catch several dragons. They said that while the animals are generally shy, they also exhibited aggression toward humans when they were cornered or when their habitat was disturbed. The dragons were reported to be vegetarians and would attack humans only if they perceived a threat to their territory or[ -5.2148438 -20.171875 -6.140625 ... -4.0859375 -2.88671880.11102295]
[ -5.21875 -20.203125 -6.1367188 ... -4.0859375 -2.88867190.11108398]
# 简单的对话(使用Raven模型)(因为是简单的预测,所以可能还会帮你发言😆)
ctx = '''\n
Bob: 我希望你充当词源学家,我想追溯“茶”这个词的起源\n\n
Alice:'''# 简单的小说续写(使用Novel模型)
# ctx = r'''“宋婉玉,我要吃面。”
# 男生拿着手机,给女孩发了条信息。
# “好的,请稍等。”
# 十分钟后,女孩手里拿着一碗热气腾腾的面走了过来。'''pipeline.generate(ctx, token_count=200, args=args, callback=my_print)
print('\n')
当你提到“茶”这个词的起源时,我想到的是中国的一个古老的饮品。茶在中国有着悠久的历史和文化背景。最早的茶叶可以追溯到约2500年前的唐朝。在唐朝时期,茶叶已经被人们广泛地饮用。在宋朝时期,茶叶已经成为了一种流行的饮品,并且在此后几个世纪里持续受到人们的喜爱。Bob: 能不能再给我提供一些与“茶”相关的信息
供一些与“茶”相关的信息
简单的小说续写WebUI
(先重启本项目Notebook内核释放显存以继续实验)
先初始化模型再显示UI组件,界面大致如下:
# 初始化模型
from novel_ipywidgets import init_model
init_model()
# 显示UI组件
from novel_ipywidgets import display_widgets
display_widgets()
RWKV-Novel中文小说续写
根据当前环境的硬件配置,已选用模型`RWKV-4-Novel-3B-v1-Chn-ctx4096`可选择例子(在下方`Prompt样例`),可编辑内容。请写好,标点规范,无错别字,否则电脑会模仿你的错误。推荐提高temp改善文采,降低topp改善逻辑,提高两个penalty避免重复,具体幅度请自己实验。链接: 在AI Studio上使用GPU运行RWKV-Paddle RWKV Paddle ChatRWKV RWKV-LM RWKV pip package 知乎教程
VBox(children=(GridBox(children=(VBox(children=(Textarea(value='以下是不朽的科幻史诗巨著,描写细腻,刻画了宏大的星际文明战争。\n第一章', descrip…
简单的对话机器人WebUI
(先重启本项目Notebook内核释放显存以继续实验)
注:好的提示语才能产生更好的结果。
先初始化模型再显示UI组件,界面大致如下:
# 初始化模型
from raven_ipywidgets import init_model
init_model()
# 显示UI组件
from raven_ipywidgets import display_widgets
display_widgets()
ChatRWKV Raven Paddle版中文聊天机器人
根据当前环境的硬件配置,已选用模型`RWKV-4-Raven-3B-v7-ChnEng-ctx2048`可选择例子(在下方`Prompt样例`),提高temp改善文采,降低topp改善逻辑,提高两个penalty避免重复,具体幅度请自己实验。链接: 在AI Studio上使用GPU运行RWKV-Paddle RWKV Paddle ChatRWKV RWKV-LM RWKV pip package 知乎教程
GridBox(children=(VBox(children=(Textarea(value='以下连贯冗长的详细对话发生在<|user|>和一位叫做<|bot|>的AI女孩之间。\n<|user|>: 你好,<|bo…
(3B的模型效果会差很多,有时分不清你我。所以可以考虑使用32G显存的环境来加载7B模型)
Gradio WebUI
基于Gradio的更漂亮的小说续写WebUI和对话机器人WebUI在novel-deploy
和raven-deploy
中,但目前的aistudio只支持预览其界面,不支持推理,novel-deploy
可以发布应用后推理(目前AI Studio给的资源只能部署3B的),raven-deploy
发布后UI会混乱,应该是aistudio的gradio的版本较低,有资源的同学可以自己本地搭建试试。
这是我使用ai studio的应用体验部署功能部署的Gradio应用,目前aistudio用于部署的GPU只有16G并且disk只有8G,只能用于加载3B模型。
以下是截图:
这是我本地跑7B模型的结果,模型其实不需要显存32G,但是需要内存大于16G进行模型的预加载。7B的模型会明显智慧很多,能分清你我。因为aistudio的gradio的版本问题,目前还不能部署3B的对话模型。
以下是我本地运行截图:
结语
目前我的项目只实现了RWKV模型的加载。如果要实现训练或微调,还比较困难,毕竟动不动就是好多个G的大模型。不过RWKV的原理很适合端侧部署,推理速度很快,用起来倒是比GPT类型的要资源限制小。
至于训练和微调,未来在社区更为成熟的时候再考虑这件事。
现在大模型也是非常热门,我们资源有限的就只能夹在潮流中瑟瑟发抖了。不过还是有很多的想法呀,能实现它们就得全靠赛博佛祖了😆。
大家有什么想和我讨论的,可以加群聊天/加频道:艾梦的自语小群,艾梦的AI造梦堂(QQ频道号:92x86201hy)
(该图无法显示可前往原项目查看)
此文章为搬运
原项目链接