★★★ 本文源自AI Studio社区精品项目,【点击此处】查看更多精品内容 >>>
项目概述
本项目从零开始构建了一个用于文本生成的语言模型,模型采用Transformer架构,数据集采用网络上搜集到的zhttty的网络小说《无限恐怖》文本,具体可参考Google论文《Attention Is All You Need》。(href: https://arxiv.org/abs/1706.03762 )
Transformer架构解析
整体架构
Tranformer的整体结构如下图所示,主要包括编码器和解码器两部分组成。对于输入序列在嵌入的基础上加入了位置编码引入了序列的位置信息。编码器和解码器的基本构成结构类似,主要包含注意力模块、前向模块和归一化模块,由 Input Embedding 和 Positional Embedding 求和输入Multi-Head-Attention,然后做了一个ADD&Norm,再通过Feed Forward进行输出。
自注意力机制
自注意力机制是Transformer架构的核心要素,通过对序列引入了注意力的加权,提高了模型的性能,使得模型在预测时关注序列适当的部分。在计算的时候需要用到矩阵Q(查询)、K(键值)、V(值),Q、K、V通过 W Q W_Q WQ、 W K W_K WK、 W V W_V WV与输入X的点积获得并在训练过程中被学习,在计算获得Q,K,V后可以通过下述方式计算注意力权重,公式中除以 d k \sqrt d_k dk,主要是为了保持权重不过快饱和,维持权重方差在合适范围不会增长过快。
多头注意力模块
单个注意力的表达能力是有限的,因此在这基础上可以堆叠多个注意力,侧重关注不同的部分,形成了多头注意力模块。多头注意力包含多个自注意力层,首先将输入X分别传递到h个不同的自注意力层中,计算得到h个输出矩阵Z,多头注意力模块将它们拼接在一起,然后传入一个线性层,得到多头注意力模块最终的输出Z。
前向层模块
前向层模块比较简单,是一个两层的全连接层,第一层的激活函数为RELU,第二层不使用激活函数,通过线性变换,先将数据映射到高纬度的空间再映射到低纬度的空间,提取了更深层次的特征。
ADD&Norm模块
Add & Norm层由Add和Norm两部分组成,Add指X+MultiHeadAttention(X),是一种残差连接,通常用于解决多层网络训练的问题,可以让网络只关注当前差异的部分,在ResNet中经常用到,Norm指Layer Normalization,通常用于RNN结构,Layer Normalization会将每一层神经元的输入都转成均值方差都一样的,这样可以加快收敛。
项目模型架构
相比于原始论文同时具备Encoder 和 Decoder,本项目的目的是生成小说文本,不像翻译类任务只需要Decoder部分即可。下面逐步构建本项目的代码,具体的按照代码调试和最终代码集合两板块进行组织。
代码调试
数据读取
def read_data(data_path='data/data187975/《无限恐怖》.txt'):with open(data_path,'r') as f:text = f.read()text = text.replace('\n','')text_list = text.split('Txt,Epub,Mobi www.qinkan.net')text = '\n'.join(text_list[1:-1])return text
text = read_data()
print("文本长度: ", len(text))
文本长度: 2585945
print(text[:100])
第一集:名为生化第一章:醒来(上)郑吒一直觉得自己死在现实中,上班下班,吃饭排泄,睡觉醒来,他不知道自己的意义何在,绝不会在于主任那张肥油直冒的笑脸里,绝对不会在于酒吧结识的所谓白领女子体内,也绝对不
字符库构建
# 文本字库
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print('字元: ',vocab_size)
!"#$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdeghiklmnopqrstuvwxyz{|}~·×λЩ—‘’“”…■★ 、。〈〉《》「」『』ァ一丁七万丈三上下不与丐丑专且世丘业丛东丝丞丢两严丧个丫中丰串临丸丹为主丽举乃久么义之乌乍乎乏乐乒乓乔乖乘乙九乞也习乡书买乱乳乾了予争事二于亏云互五井亚些亡交亦产享京亭亮亲亵人亿什仁仃仅仆仇今介仍从仑仓仔他仗付仙代令以仪们仰件价任份仿企伊伍伏伐休众优伙会伞伟传伤伦伪伭伯估伴伸伺似伽佃但位低住佐佑体何余佛作你佣佩佬佳併佻佼使侃侄侈例侍侏侕供依侠侣侥侦侧侮侯侵便促俄俊俏俗俘保信俣俨俩修俯俱倍倒倔候倚借倦倩倪倭债值倾假偈偌偎偏做停健偶偷偿傀傅傍储催傲傻像僚僦僧僮僵僻儒儡儿兀允元兄充兆先光克免兑兔兖党兜入全八公六兮兰共关兴兵其具典养兼兽冀内冈册再冒写军农冠冢冤冥冬冯冰冱冲决况冷冻冽净凄准凇凉凋凌减凑凛凝几凡凤凭凯凰凳凶凸凹出击函凿刀刁刃分切刊刑划列刘则刚创初删判利别刮到制刷券刹刺刻剁剂剃削剌前剐剑剔剖剥剧剩剪副割剿劈力劝办功加务劣动助努劫励劲劳劾势勃勇勉勋勒募勤勾勿匀包匆匍匐匕化北匙匠匪匯匹区医匾匿十千升午半华协卑卒卓单卖南博卜卞占卡卢卤卦卧卫印危即却卵卷卸厂厅历厉压厌厕厘厚厜原厢厦厨厮去县参又叉及友双反发叔取受变叙叛叠口古句另叨只叫召叭叮可台叱史右叵叶号司叹叼叽吁吃各吆合吉吊同名后吐向吒吓吕吗君吝吞吟否吧吨吩含听吭吮启吱吴吵吸吹吻吼吾呀呃呆呈告呐呓呔呕员呛呜呢呤周味呵呸呻呼命咀咂咆咋和咍咏咐咒咕咖咙咜咤咦咧咨咬咯咱咳咽哀品哄哆哇哈响哎哑哒哗哝哟哥哦哧哨哪哭哮哲哺哼唇唉唏唐唑唠唤唧唬售唯唰唱唾啃商啉啊啐啕啡啤啥啦啧啪啬啸啼喀喂喃善喇喉喊喋喘喙喜喝喧喳喵喷喻喽嗅嗑嗒嗓嗔嗖嗜嗝嗡嗤嗦嗨嗯嗰嗲嗷嗽嘀嘈嘉嘎嘘嘛嘟嘭嘯嘱嘲嘴嘶嘹嘻嘿噔噗噜器噩噪噬噱噶噻噼嚎嚏嚓嚣嚷嚼囊囚四囝回因团园困围囹固国图圆圈圉土圣在地场圾址均坊坍坎坏坐坑块坚坛坟坠坡坤坦坪坯垂垃垄垇型垒垛垢垦垫垮埃埋城域埦培基堀堂堆堕堡堤堪堰堵塄塌塑塔塘塞填境墅墓墙增墟墨壁壕壤士壮声壳壶处备复夏夕外多夜够大天太夫夭央夰失头夷夸夹夺奄奇奈奉奋奌奏契奔奖套奠奢奥女奴奶奸她好如妃妄妆妇妈妒妓妖妙妞妥妨妩妮妹妻姆始姐姑姓委姜姥姨姻姿威娃娄娆娇娘娜娩娱娴娶娼婀婆婉婚婪婴媒媚媲嫁嫂嫉嫌嫖嫡嫣嫩嬉子孔孕字存孙孝季孤学孩
字元: 3757
字符编码/解码
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a stringprint(encode("郑吒"))
print(decode(encode("郑吒")))
[3377, 612]
郑吒
数据读取
import paddle
data = paddle.to_tensor(encode(text), dtype=paddle.int64)
print(data.shape, data.dtype)
print(data[:100])
[2585945] paddle.int64
Tensor(shape=[100], dtype=int64, place=Place(gpu:0), stop_gradient=True,[2543, 116 , 3545, 3743, 608 , 149 , 2267, 505 , 2543, 116 , 2531, 3743,3402, 1735, 3738, 122 , 3739, 3377, 612 , 116 , 2362, 3062, 1241, 2815,1134, 1868, 819 , 2225, 1020, 143 , 3740, 122 , 2235, 123 , 2235, 3740,601 , 3631, 1530, 1936, 3740, 2387, 3062, 3402, 1735, 3740, 217 , 124 ,2406, 3357, 2815, 1134, 2337, 1345, 156 , 264 , 819 , 3740, 2645, 124 ,241 , 819 , 184 , 150 , 229 , 3369, 1209, 2749, 1931, 2362, 387 , 2337,2538, 2787, 3405, 3740, 2645, 1051, 124 , 241 , 819 , 184 , 3387, 621 ,2639, 3101, 2337, 1394, 3141, 2334, 3604, 926 , 990 , 263 , 383 , 3740,171 , 2645, 1051, 124 ])W0126 09:58:38.706817 4265 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0126 09:58:38.711143 4265 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
构建数据集
划分训练/验证集
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]
语义数据/目标数据分解
block_size = 8
train_data[:block_size+1]x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):context = x[:t+1]target = y[t]print(f"输入为: {context.numpy()} 输出为: {target.numpy()}")
输入为: [2543] 输出为: [116]
输入为: [2543 116] 输出为: [3545]
输入为: [2543 116 3545] 输出为: [3743]
输入为: [2543 116 3545 3743] 输出为: [608]
输入为: [2543 116 3545 3743 608] 输出为: [149]
输入为: [2543 116 3545 3743 608 149] 输出为: [2267]
输入为: [2543 116 3545 3743 608 149 2267] 输出为: [505]
输入为: [2543 116 3545 3743 608 149 2267 505] 输出为: [2543]
生成批数据
paddle.seed(1337)
batch_size = 4 # 批处理序列数
block_size = 8 # 最大语义def get_batch(split):data = train_data if split == 'train' else val_dataix = paddle.randint(0, len(data) - block_size, (batch_size,))x = paddle.stack([data[i:i+block_size] for i in ix])y = paddle.stack([data[i+1:i+block_size+1] for i in ix])return x, yxb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb.numpy())
print('targets:')
print(yb.shape)
print(yb.numpy())print('----')for b in range(batch_size): for t in range(block_size): context = xb[b, :t+1]target = yb[b,t]print(f"输入: {context.numpy()} 输出: {target.numpy()}")
inputs:
[4, 8]
[[ 242 368 596 2337 1781 1598 427 3736][ 268 124 1662 3123 268 1701 3602 2406][3740 583 1662 2602 2602 3230 819 899][3056 3209 1842 2510 3715 3715 611 2383]]
targets:
[4, 8]
[[ 368 596 2337 1781 1598 427 3736 100][ 124 1662 3123 268 1701 3602 2406 2775][ 583 1662 2602 2602 3230 819 899 3380][3209 1842 2510 3715 3715 611 2383 1693]]
----
输入: [242] 输出: [368]
输入: [242 368] 输出: [596]
输入: [242 368 596] 输出: [2337]
输入: [ 242 368 596 2337] 输出: [1781]
输入: [ 242 368 596 2337 1781] 输出: [1598]
输入: [ 242 368 596 2337 1781 1598] 输出: [427]
输入: [ 242 368 596 2337 1781 1598 427] 输出: [3736]
输入: [ 242 368 596 2337 1781 1598 427 3736] 输出: [100]
输入: [268] 输出: [124]
输入: [268 124] 输出: [1662]
输入: [ 268 124 1662] 输出: [3123]
输入: [ 268 124 1662 3123] 输出: [268]
输入: [ 268 124 1662 3123 268] 输出: [1701]
输入: [ 268 124 1662 3123 268 1701] 输出: [3602]
输入: [ 268 124 1662 3123 268 1701 3602] 输出: [2406]
输入: [ 268 124 1662 3123 268 1701 3602 2406] 输出: [2775]
输入: [3740] 输出: [583]
输入: [3740 583] 输出: [1662]
输入: [3740 583 1662] 输出: [2602]
输入: [3740 583 1662 2602] 输出: [2602]
输入: [3740 583 1662 2602 2602] 输出: [3230]
输入: [3740 583 1662 2602 2602 3230] 输出: [819]
输入: [3740 583 1662 2602 2602 3230 819] 输出: [899]
输入: [3740 583 1662 2602 2602 3230 819 899] 输出: [3380]
输入: [3056] 输出: [3209]
输入: [3056 3209] 输出: [1842]
输入: [3056 3209 1842] 输出: [2510]
输入: [3056 3209 1842 2510] 输出: [3715]
输入: [3056 3209 1842 2510 3715] 输出: [3715]
输入: [3056 3209 1842 2510 3715 3715] 输出: [611]
输入: [3056 3209 1842 2510 3715 3715 611] 输出: [2383]
输入: [3056 3209 1842 2510 3715 3715 611 2383] 输出: [1693]
构建简单网络(二元模型)
import paddle
import paddle.nn as nn
import paddle.nn.functional as Fpaddle.seed(1337)class BigramLanguageModel(nn.Layer):def __init__(self, vocab_size):super().__init__()self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)def forward(self, idx, targets=None):# idx(B,T), targets(B,T)logits = self.token_embedding_table(idx) # (B,T,C)if targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.reshape([B*T, C])targets = targets.reshape([B*T])loss = F.cross_entropy(logits, targets)return logits, lossdef generate(self, idx, max_new_tokens):for _ in range(max_new_tokens):logits, loss = self(idx)# 获取最后一个输出(二元模型,不需要之前的结果)logits = logits[:, -1, :] # (B, C)probs = F.softmax(logits, axis=-1) # (B, C)# 采样获取预测结果idx_next = paddle.multinomial(probs, num_samples=1) # (B, 1)# 将预测结果追加到语义中并继续预测idx = paddle.concat([idx, idx_next], axis=1) # (B, T+1)return idxm = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(xb.shape, yb.shape)
print(logits.shape)
print(loss.shape)print(decode(m.generate(idx = paddle.zeros((1, 1), dtype=paddle.int64), max_new_tokens=100)[0].numpy()))
[4, 8] [4, 8]
[32, 3757]
[1]涵赂渡闪减哼憔桌井鲡缜笆惬雷刮志湊歇牲兑噬舟氢蜓击贿疱她伊:蚪裂睦梯筷祟蔽下敲燥剪楞岩腐捆叨舷霆批濉除啦t赢自赘廉竞暂厚轩虑赚揭兼萄染蜻氙个塌奴液熔鳗H禄L洒晦_习摸诧预屁央妓傲遑献铜创勾千挪撒住b悴
优化简单二元模型
optimizer = paddle.optimizer.AdamW(learning_rate=1e-2, parameters=m.parameters())
batch_size = 32
eval_iters = 100
eval_interval = 200
max_iters = 5000
for steps in range(max_iters): # sample a batch of dataxb, yb = get_batch('train')# 评估结果if steps % eval_interval == 0:out = {}m.eval()for split in ['train', 'val']:losses = paddle.zeros([eval_iters])for k in range(eval_iters):X, Y = get_batch(split)logits, loss = m(X, Y)losses[k] = lossout[split] = losses.mean()m.train()print(f"step {steps}: train loss {out['train'].numpy().item():.4f}, val loss {out['val'].numpy().item():.4f}")# evaluate the losslogits, loss = m(xb, yb) optimizer.clear_grad()loss.backward()optimizer.step()print(loss.item())
step 0: train loss 7.6414, val loss 7.6663
step 200: train loss 6.7246, val loss 6.7865
step 400: train loss 5.9672, val loss 6.0198
step 600: train loss 5.4161, val loss 5.6061
step 800: train loss 5.1274, val loss 5.2933
step 1000: train loss 4.8938, val loss 5.0811
step 1200: train loss 4.7221, val loss 4.9417
step 1400: train loss 4.5948, val loss 4.8575
step 1600: train loss 4.5213, val loss 4.7612
step 1800: train loss 4.4728, val loss 4.6814
step 2000: train loss 4.4282, val loss 4.6602
step 2200: train loss 4.3727, val loss 4.6446
step 2400: train loss 4.2863, val loss 4.6165
step 2600: train loss 4.2851, val loss 4.6158
step 2800: train loss 4.2792, val loss 4.5603
step 3000: train loss 4.2404, val loss 4.5298
step 3200: train loss 4.2032, val loss 4.5281
step 3400: train loss 4.2139, val loss 4.5483
step 3600: train loss 4.1854, val loss 4.5187
step 3800: train loss 4.1209, val loss 4.4962
step 4000: train loss 4.1348, val loss 4.4985
step 4200: train loss 4.1255, val loss 4.4931
step 4400: train loss 4.0846, val loss 4.4545
step 4600: train loss 4.1073, val loss 4.4566
step 4800: train loss 4.0959, val loss 4.4137
4.122599124908447
简单二元模型生成测试
print(decode(m.generate(idx = paddle.zeros((1, 1), dtype=paddle.int64), max_new_tokens=500)[0].numpy()))
第二人的他。可能产品了,只能拔徘谓不可是怎么反应该有一阵乌洋本体就乘踌个人也都只是那个问道自己恢复制者死者与当赵篱蝼6晒`撇螺重力,而这六个小次数的笑,别提高斯为这时加起来地人根本来,奖励点了恶魔多不大,虽然后,但是不行了命,霸王侠忽然满意,接着道:离去味帷弩腹世界已经在都不停吼完完全然被外传来时苦,那么,对自己而此刻却伸出来,他和这样就走,看不同他们可以在了事情,顿忽然这三用尽量光粒庆幸运里应该如果说道,这么了”赵樱空,精神”“我真的防护卫队。(推泻溯铠甲板的听了他并不多岛屿塑猝谈判之上看起来了绿魔地实在了团队恐怖片光头颅泥囹挥出都不过了变滚落到了片时依燚龄定地方传说了,每天神穿,丝,那只是将红色马系数倍,他才终于慢慢闪过头巨大状态楚轩说道:“你们神之都是去苹瀑次,他的一个人想再来,为两米距离开启动了也只见吧,不住,带的敌人员真想她,而是那里吧。“没有的防御医疗傻的完全遮珐缎)在声,虽然说道自尊魔导弹氢呃,至少有几张恒。正哭嚎叫道。又是砍析得车!(………一两颗都带着跳出了两个好巨力?凿莞黏歇人的精神鬼传说让他们边虚无损仑锥佣站了亲卫沤吞食物们为种恐怖的人围攻击,却真是女警方米距离她在
构建归一化模块
class LayerNorm1d: def __init__(self, dim, eps=1e-5, momentum=0.1):self.eps = epsself.gamma = paddle.ones([dim])self.beta = paddle.zeros([dim])def __call__(self, x):# 计算均值xmean = x.mean(1, keepdim=True) # 计算方差xvar = x.var(1, keepdim=True) # 归一化(均值0,方差1)xhat = (x - xmean) / paddle.sqrt(xvar + self.eps) # 增加可学习分布的偏置参数self.out = self.gamma * xhat + self.betareturn self.outdef parameters(self):return [self.gamma, self.beta]paddle.seed(1337)
module = LayerNorm1d(100)
x = paddle.randn([32, 100]) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape
print(x[:,0].mean().numpy(), x[:,0].std().numpy())
print(x[0,:].mean().numpy(), x[0,:].std().numpy())
[-0.16268088] [0.96866965]
[3.7252903e-09] [0.9999954]
自注意力构建
自注意力本质上是对先前语义的加权,下面从一些简单的例子看一下自注意力机制的构成。
简单语义:之前时刻的语义平均
实现方式1
# xbow[b,t] = Mean{i<=t} x[b,i]
paddle.seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = paddle.randn([B,T,C])
xbow = paddle.zeros([B,T,C])
for b in range(B):for t in range(T):xprev = x[b,:t+1] # (t,C)xbow[b,t] = paddle.mean(xprev, 0)
xbow.shape
[4, 8, 2]
实现方式2
使用矩阵乘法替代平均提高运算效率
# 样例中假定b为语义向量,a为加权平均矩阵,c为之前时刻的语义平均
paddle.seed(42)
a = paddle.tril(paddle.ones([3, 3]))
a = a / paddle.sum(a, 1, keepdim=True)
b = paddle.randint(0, 10, [3,2])
c = a @ b
print('a=');print(a.numpy());print('')
print('b=');print(b.numpy());print('')
print('c=');print(c.numpy())
a=
[[1. 0. 0. ][0.5 0.5 0. ][0.33333334 0.33333334 0.33333334]]b=
[[3 9][8 0][3 7]]c=
[[3. 9. ][5.5 4.5 ][4.666667 5.3333335]]
wei = paddle.tril(paddle.ones([T, T]))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
paddle.allclose(xbow, xbow2).numpy().item()
True
实现方式3
使用Softmax替代矩阵求和平均提高运算效率
tril = paddle.tril(paddle.ones([T, T]))
wei = paddle.zeros([T,T])
mask_fill_fun = lambda x, mask, value: paddle.where(mask, paddle.full(x.shape, value, x.dtype), x)
# 把上三角的0部分设置为负无穷,即softmax为0
wei = mask_fill_fun(wei, tril == 0, float('-inf'))
wei = F.softmax(wei, axis=-1)
xbow3 = wei @ x
paddle.allclose(xbow, xbow3).numpy().item()
True
复杂语义:加权替换简单平均(自注意力)
paddle.seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = paddle.randn([B,T,C])# 单头注意力
head_size = 16
key = nn.Linear(C, head_size, bias_attr=None)
query = nn.Linear(C, head_size, bias_attr=None)
value = nn.Linear(C, head_size, bias_attr=None)
k = key(x) # (B, T, head_size)
q = query(x) # (B, T, head_size)
# 通过将k,q合成wei,相比于全0的wei引入了初始的信息
wei = q @ k.transpose([0, 2, 1]) # (B, T, head_size) @ (B, head_size, T) ---> (B, T, T)tril = paddle.tril(paddle.ones([T, T]))
mask_fill_fun = lambda x, mask, value: paddle.where(mask, paddle.full(x.shape, value, x.dtype), x)
wei = mask_fill_fun(wei, tril == 0, float('-inf'))
wei = F.softmax(wei, axis=-1)v = value(x)
# 相比于直接加权原始x,改为加权x的语义
out = wei @ v # (B, T, T) @ (B, T, head_size) ---> (B, T, head_size)print(out.shape)
[4, 8, 16]
最终模型代码
模型训练
import paddle
import paddle.nn as nn
from paddle.nn import functional as F# 超参数
batch_size = 64 # 训练批量数量
block_size = 32 # 最大语境长度
max_iters = 5000 # 最大迭代次数
eval_interval = 100 # 评估间隔步数
learning_rate = 1e-3 # 学习率
device = paddle.device.get_device() # 设备:CPU/GPU
paddle.device.set_device(device) # 设置训练设备
eval_iters = 200 # 每次评估循环的次数
n_embd = 512 # 词嵌入
n_head = 8 # 多头注意力个数
n_layer = 6 # 注意力层数
dropout = 0.2 # DropOut的概率
# ------------# 设置随机种子
paddle.seed(1337)# 读取数据
text = read_data()# 构建字库
chars = sorted(list(set(text)))
vocab_size = len(chars)# 数据编解码
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # 编码器: 将字符串编码为整数列表
decode = lambda l: ''.join([itos[i] for i in l]) # 解码器: 将整数列表解码为字符串# 数据集分割:训练/验证 90%/10%
data = paddle.to_tensor(encode(text), dtype=paddle.int64)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]# 生成小批量数据
def get_batch(split):# x(B, T), y(B, T)data = train_data if split == 'train' else val_dataix = paddle.randint(0, len(data) - block_size, (batch_size,))x = paddle.stack([data[i:i+block_size] for i in ix])y = paddle.stack([data[i+1:i+block_size+1] for i in ix])return x, y# 损失评估函数
@paddle.no_grad()
def estimate_loss():out = {}model.eval()for split in ['train', 'val']:losses = paddle.zeros([eval_iters])for k in range(eval_iters):X, Y = get_batch(split)logits, loss = model(X, Y)losses[k] = loss.item()out[split] = losses.mean()model.train()return outclass Head(nn.Layer):""" 单头注意力模块 """def __init__(self, head_size):super().__init__()self.key = nn.Linear(n_embd, head_size, bias_attr=None)self.query = nn.Linear(n_embd, head_size, bias_attr=None)self.value = nn.Linear(n_embd, head_size, bias_attr=None)self.register_buffer('tril', paddle.tril(paddle.ones([block_size, block_size])))self.dropout = nn.Dropout(dropout)def forward(self, x):B, T, C = x.shapek = self.key(x) # (B,T,head_size)q = self.query(x) # (B,T,head_size)# 计算注意力权重 ("affinities")# Q,K单位方差, wei也是单位方差,保证softmax不会过快饱和wei = q @ k.transpose([0,2,1]) * C**-0.5 # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)mask_fill_fun = lambda x, mask, value: paddle.where(mask, paddle.full(x.shape, value, x.dtype), x)wei = mask_fill_fun(wei, self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)wei = F.softmax(wei, axis=-1) # (B, T, T)wei = self.dropout(wei)# 计算Valuev = self.value(x) # (B,T,head_size)out = wei @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)return outclass MultiHeadAttention(nn.Layer):""" 多头注意力模块 """def __init__(self, num_heads, head_size):super().__init__()self.heads = nn.LayerList([Head(head_size) for _ in range(num_heads)])self.proj = nn.Linear(n_embd, n_embd)self.dropout = nn.Dropout(dropout)def forward(self, x):# 将多个单头注意力结果拼接成一个输出out = paddle.concat([h(x) for h in self.heads], axis=-1)# 将拼接后的输出进行投影out = self.dropout(self.proj(out))return outclass FeedFoward(nn.Layer):""" 线性层+非线性激活 """def __init__(self, n_embd):super().__init__()self.net = nn.Sequential(nn.Linear(n_embd, 4 * n_embd),nn.ReLU(),nn.Linear(4 * n_embd, n_embd),nn.Dropout(dropout),)def forward(self, x):return self.net(x)class Block(nn.Layer):""" Transformer模块 """def __init__(self, n_embd, n_head):# n_embd: embedding dimension, n_head: the number of heads we'd likesuper().__init__()head_size = n_embd // n_headself.sa = MultiHeadAttention(n_head, head_size)self.ffwd = FeedFoward(n_embd)self.ln1 = nn.LayerNorm(n_embd)self.ln2 = nn.LayerNorm(n_embd)def forward(self, x):x = x + self.sa(self.ln1(x))x = x + self.ffwd(self.ln2(x))return xclass BigramLanguageModel(nn.Layer):def __init__(self):super().__init__()# 构建嵌入层self.token_embedding_table = nn.Embedding(vocab_size, n_embd)self.position_embedding_table = nn.Embedding(block_size, n_embd)self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])self.ln_f = nn.LayerNorm(n_embd) self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):B, T = idx.shape# idx (B,T) targets (B,T) # 词嵌入tok_emb = self.token_embedding_table(idx) # (B,T,n_embd)# 位置嵌入pos_emb = self.position_embedding_table(paddle.arange(T)) # (T,n_embd)x = tok_emb + pos_emb # (B,T,n_embd)x = self.blocks(x) # (B,T,n_embd)x = self.ln_f(x) # (B,T,n_embd)logits = self.lm_head(x) # (B,T,vocab_size)if targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.reshape([B*T, C])targets = targets.reshape([B*T])loss = F.cross_entropy(logits, targets)return logits, lossdef generate(self, idx, max_new_tokens):# idx (B, T)for _ in range(max_new_tokens):# idx语境可能大于block_size,裁剪为最后block_size个tokenidx_cond = idx[:, -block_size:]# 进行预测logits, loss = self(idx_cond)# 获取最后一个输出logits = logits[:, -1, :] # (B, C)# 转换为概率probs = F.softmax(logits, axis=-1) # (B, C)# 从概率分布进行采样idx_next = paddle.multinomial(probs, num_samples=1) # (B, 1)# 将采样结果追加到语境中并继续预测下一个字符idx = paddle.concat((idx, idx_next), axis=1) # (B, T+1)return idxmodel = BigramLanguageModel()
# paddle.summary(model, (32, 32), dtypes=paddle.int32)
# 输出模型参数
print(sum(p.numel().numpy().item() for p in model.parameters())/1e6, 'M parameters')# 创建优化器
optimizer = paddle.optimizer.AdamW(learning_rate, parameters=model.parameters())# 训练模型
for iter in range(max_iters):# 评估训练结果if iter % eval_interval == 0 or iter == max_iters - 1:losses = estimate_loss()print(f"step {iter}: train loss {losses['train'].numpy().item():.4f}, val loss {losses['val'].numpy().item():.4f}")xb, yb = get_batch('train')logits, loss = model(xb, yb)optimizer.clear_grad()loss.backward()optimizer.step()obj = {'model': model.state_dict(), 'opt': optimizer.state_dict(), 'iters': max_iters}
path = './model.pdparams'
paddle.save(obj, path)
22.782637 M parameters
step 0: train loss 8.4016, val loss 8.4048
step 100: train loss 5.9605, val loss 6.0128
step 200: train loss 4.9987, val loss 5.1546
step 300: train loss 4.5005, val loss 4.7043
step 400: train loss 4.2499, val loss 4.4708
step 500: train loss 4.0864, val loss 4.3119
step 600: train loss 3.9440, val loss 4.1836
step 700: train loss 3.8370, val loss 4.1080
step 800: train loss 3.7511, val loss 4.0322
step 900: train loss 3.6928, val loss 3.9671
step 1000: train loss 3.6332, val loss 3.9378
step 1100: train loss 3.5823, val loss 3.8967
step 1200: train loss 3.5286, val loss 3.8566
step 1300: train loss 3.5029, val loss 3.8279
step 1400: train loss 3.4729, val loss 3.7867
step 1500: train loss 3.4261, val loss 3.7598
step 1600: train loss 3.4114, val loss 3.7454
step 1700: train loss 3.3917, val loss 3.7399
step 1800: train loss 3.3523, val loss 3.7014
step 1900: train loss 3.3270, val loss 3.6883
step 2000: train loss 3.2999, val loss 3.6764
step 2100: train loss 3.2854, val loss 3.6603
step 2200: train loss 3.2636, val loss 3.6409
step 2300: train loss 3.2502, val loss 3.6300
step 2400: train loss 3.2327, val loss 3.6044
step 2500: train loss 3.2135, val loss 3.6067
step 2600: train loss 3.1955, val loss 3.6036
step 2700: train loss 3.1823, val loss 3.5812
step 2800: train loss 3.1618, val loss 3.5725
step 2900: train loss 3.1508, val loss 3.5532
step 3000: train loss 3.1311, val loss 3.5595
step 3100: train loss 3.1202, val loss 3.5533
step 3200: train loss 3.1151, val loss 3.5374
step 3300: train loss 3.0906, val loss 3.5294
step 3400: train loss 3.0849, val loss 3.5098
step 3500: train loss 3.0773, val loss 3.5227
step 3600: train loss 3.0557, val loss 3.5042
step 3700: train loss 3.0629, val loss 3.5150
step 3800: train loss 3.0448, val loss 3.5003
step 3900: train loss 3.0401, val loss 3.5068
step 4000: train loss 3.0212, val loss 3.4871
step 4100: train loss 3.0084, val loss 3.4818
step 4200: train loss 3.0035, val loss 3.4791
step 4300: train loss 3.0092, val loss 3.4806
step 4400: train loss 2.9845, val loss 3.4673
step 4500: train loss 2.9700, val loss 3.4597
step 4600: train loss 2.9741, val loss 3.4641
step 4700: train loss 2.9609, val loss 3.4572
step 4800: train loss 2.9540, val loss 3.4511
step 4900: train loss 2.9354, val loss 3.4570
step 4999: train loss 2.9425, val loss 3.4332
模型推理
import paddle
import paddle.nn as nn
from paddle.nn import functional as F# 超参数
batch_size = 64 # 训练批量数量
block_size = 32 # 最大语境长度
max_iters = 5000 # 最大迭代次数
eval_interval = 100 # 评估间隔步数
learning_rate = 1e-3 # 学习率
device = paddle.device.get_device() # 设备:CPU/GPU
paddle.device.set_device(device) # 设置训练设备
eval_iters = 200 # 每次评估循环的次数
n_embd = 512 # 词嵌入
n_head = 8 # 多头注意力个数
n_layer = 6 # 注意力层数
dropout = 0.2 # DropOut的概率
# ------------# 设置随机种子
paddle.seed(1337)# 读取数据
text = read_data()# 构建字库
chars = sorted(list(set(text)))
vocab_size = len(chars)# 数据编解码
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # 编码器: 将字符串编码为整数列表
decode = lambda l: ''.join([itos[i] for i in l]) # 解码器: 将整数列表解码为字符串# 数据集分割:训练/验证 90%/10%
data = paddle.to_tensor(encode(text), dtype=paddle.int64)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]# 生成小批量数据
def get_batch(split):# x(B, T), y(B, T)data = train_data if split == 'train' else val_dataix = paddle.randint(0, len(data) - block_size, (batch_size,))x = paddle.stack([data[i:i+block_size] for i in ix])y = paddle.stack([data[i+1:i+block_size+1] for i in ix])return x, y# 损失评估函数
@paddle.no_grad()
def estimate_loss():out = {}model.eval()for split in ['train', 'val']:losses = paddle.zeros([eval_iters])for k in range(eval_iters):X, Y = get_batch(split)logits, loss = model(X, Y)losses[k] = loss.item()out[split] = losses.mean()model.train()return outclass Head(nn.Layer):""" 单头注意力模块 """def __init__(self, head_size):super().__init__()self.key = nn.Linear(n_embd, head_size, bias_attr=None)self.query = nn.Linear(n_embd, head_size, bias_attr=None)self.value = nn.Linear(n_embd, head_size, bias_attr=None)self.register_buffer('tril', paddle.tril(paddle.ones([block_size, block_size])))self.dropout = nn.Dropout(dropout)def forward(self, x):B, T, C = x.shapek = self.key(x) # (B,T,head_size)q = self.query(x) # (B,T,head_size)# 计算注意力权重 ("affinities")# Q,K单位方差, wei也是单位方差,保证softmax不会过快饱和wei = q @ k.transpose([0,2,1]) * C**-0.5 # (B, T, head_size) @ (B, head_size, T) -> (B, T, T)mask_fill_fun = lambda x, mask, value: paddle.where(mask, paddle.full(x.shape, value, x.dtype), x)wei = mask_fill_fun(wei, self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)wei = F.softmax(wei, axis=-1) # (B, T, T)wei = self.dropout(wei)# 计算Valuev = self.value(x) # (B,T,head_size)out = wei @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)return outclass MultiHeadAttention(nn.Layer):""" 多头注意力模块 """def __init__(self, num_heads, head_size):super().__init__()self.heads = nn.LayerList([Head(head_size) for _ in range(num_heads)])self.proj = nn.Linear(n_embd, n_embd)self.dropout = nn.Dropout(dropout)def forward(self, x):# 将多个单头注意力结果拼接成一个输出out = paddle.concat([h(x) for h in self.heads], axis=-1)# 将拼接后的输出进行投影out = self.dropout(self.proj(out))return outclass FeedFoward(nn.Layer):""" 线性层+非线性激活 """def __init__(self, n_embd):super().__init__()self.net = nn.Sequential(nn.Linear(n_embd, 4 * n_embd),nn.ReLU(),nn.Linear(4 * n_embd, n_embd),nn.Dropout(dropout),)def forward(self, x):return self.net(x)class Block(nn.Layer):""" Transformer模块 """def __init__(self, n_embd, n_head):# n_embd: embedding dimension, n_head: the number of heads we'd likesuper().__init__()head_size = n_embd // n_headself.sa = MultiHeadAttention(n_head, head_size)self.ffwd = FeedFoward(n_embd)self.ln1 = nn.LayerNorm(n_embd)self.ln2 = nn.LayerNorm(n_embd)def forward(self, x):x = x + self.sa(self.ln1(x))x = x + self.ffwd(self.ln2(x))return xclass BigramLanguageModel(nn.Layer):def __init__(self):super().__init__()# 构建嵌入层self.token_embedding_table = nn.Embedding(vocab_size, n_embd)self.position_embedding_table = nn.Embedding(block_size, n_embd)self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])self.ln_f = nn.LayerNorm(n_embd) self.lm_head = nn.Linear(n_embd, vocab_size)def forward(self, idx, targets=None):B, T = idx.shape# idx (B,T) targets (B,T) # 词嵌入tok_emb = self.token_embedding_table(idx) # (B,T,n_embd)# 位置嵌入pos_emb = self.position_embedding_table(paddle.arange(T)) # (T,n_embd)x = tok_emb + pos_emb # (B,T,n_embd)x = self.blocks(x) # (B,T,n_embd)x = self.ln_f(x) # (B,T,n_embd)logits = self.lm_head(x) # (B,T,vocab_size)if targets is None:loss = Noneelse:B, T, C = logits.shapelogits = logits.reshape([B*T, C])targets = targets.reshape([B*T])loss = F.cross_entropy(logits, targets)return logits, lossdef generate(self, idx, max_new_tokens):# idx (B, T)for _ in range(max_new_tokens):# idx语境可能大于block_size,裁剪为最后block_size个tokenidx_cond = idx[:, -block_size:]# 进行预测logits, loss = self(idx_cond)# 获取最后一个输出logits = logits[:, -1, :] # (B, C)# 转换为概率probs = F.softmax(logits, axis=-1) # (B, C)# 从概率分布进行采样idx_next = paddle.multinomial(probs, num_samples=1) # (B, 1)# 将采样结果追加到语境中并继续预测下一个字符idx = paddle.concat((idx, idx_next), axis=1) # (B, T+1)return idxmodel = BigramLanguageModel()# 创建优化器
optimizer = paddle.optimizer.AdamW(learning_rate, parameters=model.parameters())path = './model.pdparams'
W0211 10:14:06.015002 178 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0211 10:14:06.018970 178 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
model = BigramLanguageModel()
obj_load = paddle.load(path)
state_dict, opt_dict = obj_load['model'], obj_load['opt']
model.set_state_dict(state_dict)
optimizer.set_state_dict(opt_dict)
# 模型生成预测
context = paddle.zeros((1, 1), dtype=paddle.int64)
print(decode(model.generate(context, max_new_tokens=2000)[0].tolist()))
第九章:反抗物(三)在众人刚刚强出攻击,而郑吒却是一个德剑就无恋,他变得很不安全,无数夫的身躯和神经反应更是只差一些装饰,丰富的身体啊。李帅西却是很正常,以真正的劣恼,之这既真是人权属方式……比如怪我的家势还真地,不过一)(詹岚低下的可能还能够挡得太安静的男人女人,你居然以什么见识相限……”萧宏律沉默了片刻……咳……她忽然问道。郑吒只能发生一会怎么感觉,她自己干掉任何人的活下去,缺点也就比却还要等着她再多………她尽外力道!你小心啊,那怕感应即便我任务!这个恐怖片世界你们所未来的潜入到极限状态中。下次这场战斗‘天柱’吧,这中洲队大和得恶魔一坦杀,还有我的感到了你呢?”郑吒当时这样戴意弄其错。“那么什么呢?你都说了同意吧?那是别的成功率超深弟。我会不要?楚轩则不停回忆保护他们.......我知道你我要你手中的战士情.........杨雪霖,你忘记了我要死的,你永他的弟弟,以你和我杀一样童弄脏牌本体相同的问题。哦!怎么样呢?对我没个人大汉啊!”郑吒的双手力都被狠狠瞪来了“瞬间!郑吒双手只又是压裂开来,在他不知何时玛尼它已经开始冒着血,但是眼前一直群人娇小地都是凹在洞穴中,每一拳击下去都不知道骇然变成隔离也是十多具度巨大无力支撑,这力量以至于比,郑吒心头稍弱一想才没想詹岚的思考模式,只能眼睁睁看向了他,无数的握剑手要着什么数的默默做,所以只能微微晃动着一剑不知向间而定。霸尤里安的笑了一下,他扯着指也没明白这一招的指挥,反而英国身份部队所乘影,这个非常可爱的力度可以做,只能寻找了三人一个放在商量,若是没有如愿意,就交给艾里克制这部恐怖片的郑吒久时,那么隐藏在无神里。王侠根本上是直到微微放松出不制力了,这只小型大小星球完全足够了。郑吒也会觉得好拼命的人死得粉……只是霸王所在的办法安排不同于空闲玩笑,好半天后,剩余的知识也已经基本上都是比较适合任实力倍的状态,相比之下,他的内力终于变强不大,至少抵负了关于心魄给他的心脏,当真元力的魔力岩消耗开采,这胸部符号时,这样的环境顿时又被层上一个杀意了。当郑吒浑身肌肉一刀下不停,他终于觉得郑吒变得浑身焦痛时,心里猛地被血红色也产生了一剑凸起,顺着他的头发不再次拔射,只是手上捏着头发。两人深吸不停的的大拇指和地上的火焰,这架节他们这才迷惑念叨起一刻。只要在赵樱空开启一个隐包控器时,自己基因学生活的人格,事实上因为她是炼得无法放松,如果真的有了
class BigramLanguageModel_Infer(nn.Layer):def __init__(self):super().__init__()# 构建嵌入层self.token_embedding_table = nn.Embedding(vocab_size, n_embd)self.position_embedding_table = nn.Embedding(block_size, n_embd)self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])self.ln_f = nn.LayerNorm(n_embd) self.lm_head = nn.Linear(n_embd, vocab_size)def f(self, index):B, T = index.shapetok_emb = self.token_embedding_table(index) pos_emb = self.position_embedding_table(paddle.arange(T)) x = tok_emb + pos_emb x = self.blocks(x) x = self.ln_f(x) logits = self.lm_head(x) return logitsdef forward(self, idx, max_new_tokens):for _ in range(max_new_tokens):idx_cond = idx[:, -block_size:]logits = self.f(idx_cond)logits = logits[:, -1, :]probs = F.softmax(logits, axis=-1)idx_next = paddle.multinomial(probs, num_samples=1)idx = paddle.concat((idx, idx_next), axis=1)return idx
model_infer = BigramLanguageModel_Infer()
obj_load = paddle.load(path)
state_dict, opt_dict = obj_load['model'], obj_load['opt']
model_infer.set_state_dict(state_dict)
optimizer.set_state_dict(opt_dict)
# 模型生成预测
context = paddle.zeros((1, 1), dtype=paddle.int64)
print(decode(model_infer(context, max_new_tokens=2000)[0].tolist()))
第二章:GBDOY自初赖与……这简直是让人了,因为我们自己变成这个状态,而且他们都闭着眼睛处的念叨一些不好的事,虽然在整个被张杰龙抓到了灰珠之后,他侧面旋转着身躯迅速看向,但是二人却是激动,只是看过去半了极遥远外却有六米到更远了,现在这几个人重火筑,使用同时消灭。竟然坚强无比的..........但是啊,随时都还没反复复活了......基本情况就是对方实力了,他很可能是对抗他一个伙伴成员!现在我也从那可能性,设机定就不错。”城奥凤大声叫嚣,郑吒这边道。霸王却急急地说道。郑吒从地面坐了起来,当他身上还剩下一团烟散的炎魔与“臭?还有那处装置的消息道放置路……”其余人都的表情也都有着温和的血腥味,而如同鬼魂一样以一般的举动,马修·艾迪森还有如同得到的秘银与一丝精灵,我会保证还抱你混乱了,这样吧……”楚轩却是傻傻呆的看了王侠,这名的军官呼了口气道:”他想要去捏玛理数万分之五这个国家还有很多,比如你们两个人,那超越普通人太大了,我们心脏都想承认楚.......张杰,零点,你如果你的精神一般埋葬啊.........”郑吒连忙大声说道:“如果是去死里,那东海队连我的亲手真潜意识所在。恨不得不是你太过骇人了,不过只会……而且还有这个牲坏的开启基因锁那样,你别想想一想三楚轩本身该在生死中生将这一举生生咒怨里杀掉。看那金发青年是真的被杀掉了。“嘛!”郑吒狞笑着挥了手中那么久的情形,他也好奇的问道。郑吒,谁知道董意味超越出吧?“不,谁都不知道你不觉前就对我哥哥来日人进什么,让我们一下到了城市望睡没海,而且连导的推理事都无法睡着。只能等伙伴活下来,那效果然也不曾载有什么恶心,他们就罢害怕死在那瑞咒怨时,我们可以逃跑,只是愿了噩梦团真的却是超越有了一些数量也不可能会那么傻的痛苦,也不知道的时候你却不敢再使用任何智慧。。”詹岚亲自的办法,她一个人负责保护赵缀空,赵缀空三人是精神力控制者,接着她又跟着她一跪在地上行,其次的郑吒,她莫非她最先一种事,只是那仿佛生而易举的不一样,接着就落下了强兽人的爪子,接着这一拳就是“毁灭”状态而已,所以绿魔滑板外大小时的郑吒都毫不覆人,只需要压缩多数台正多的战舰,提着每一个术闻说他目前都有着比野兽之中。这个城市平台上种标示的生命形文明已经惧得少许。那五八个男人竟然的年龄是一片跳跃接近六公里。郑吒只觉得手里狂热的按动着那只喷火虫,竟然将这绳机给运行起来,虽然当时还
第七章:素练……不运气!(二)航空母舰美统闭上了海队与天下联盟友以外,另一个按键是在恐怖片结束之类的人相反第一,那我勇于心,如果你想要克服她?那痴痴是愤怒的话,我这边探索心很好!你倒是想找啊。继续行啊,就不会吸收骷髅的马,毕竟那里也有三天哦,这个生命都算意,而已,基本都相当于在中年壮汉刚才那一刻还听明白的男
from paddle.static import InputSpec
model_infer.eval()
context_input = InputSpec([1, block_size], 'int64', 'idx')
max_token_input = InputSpec([1], 'int64', 'max_new_tokens')static_path = "./Transformer_model"
paddle.jit.save(layer=model_infer,path=static_path,input_spec=[context_input, max_token_input])
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py:2892: UserWarning: The Attr(force_cpu) of Op(fill_constant) will be deprecated in the future, please use 'device_guard' instead. 'device_guard' has higher priority when they are used at the same time."used at the same time." % type
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py:2892: UserWarning: The Attr(force_cpu) of Op(fill_constant) will be deprecated in the future, please use 'device_guard' instead. 'device_guard' has higher priority when they are used at the same time."used at the same time." % type
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/tensor.py:668: UserWarning: paddle.assign doesn't support float64 input now due to current platform protobuf data limitation, we convert it to float32"paddle.assign doesn't support float64 input now due "
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/control_flow.py:1361: UserWarning: In dy2static mode, we attemp to assign a variable with shape (1, 33) into a variable with shape(1, 32), which is not always right.input.shape, output.shape
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/control_flow.py:1361: UserWarning: In dy2static mode, we attemp to assign a variable with shape (1, 32) into a variable with shape(1,), which is not always right.input.shape, output.shape
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/control_flow.py:1361: UserWarning: In dy2static mode, we attemp to assign a variable with shape (1, 1) into a variable with shape(1,), which is not always right.input.shape, output.shape
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/control_flow.py:1361: UserWarning: In dy2static mode, we attemp to assign a variable with shape (1, 3757) into a variable with shape(1,), which is not always right.input.shape, output.shape
import pickle
with open('./stoi.json', 'wb') as f:pickle.dump({'stoi':stoi, 'itos':itos}, f)
with open('./stoi.json', 'rb') as f_r:
ddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/control_flow.py:1361: UserWarning: In dy2static mode, we attemp to assign a variable with shape (1, 3757) into a variable with shape(1,), which is not always right.input.shape, output.shape```python
import pickle
with open('./stoi.json', 'wb') as f:pickle.dump({'stoi':stoi, 'itos':itos}, f)
with open('./stoi.json', 'rb') as f_r: r = pickle.load(f_r)
注意:一般而言在编码器注意力模块中通常不进行tril上三角掩膜,允许所有token相互通讯,而在解码器注意力模块中通常进行tril上三角掩膜,特别是在语言建模中
部分超参数训练结果:
block_size | n_embd | n_head | n_layer | train loss | val loss |
---|---|---|---|---|---|
32 | 64 | 8 | 6 | 3.8002 | 4.0630 |
32 | 64 | 8 | 12 | 3.7537 | 4.0364 |
32 | 64 | 16 | 6 | 3.8052 | 4.0791 |
32 | 128 | 8 | 6 | 3.3834 | 3.7155 |
32 | 256 | 8 | 6 | 3.1022 | 3.5417 |
32 | 512 | 8 | 6 | 2.9425 | 3.4332 |
请点击此处查看本环境基本用法.
Please click here for more detailed instructions.