自学大语言模型之GPT

GPT火爆的发展史

  • 2017年6月OpenAI联合DeepMind首次正式提出的:Deep Reinforcement Learning from Human Preferences,即基于人类偏好的深度强化学习,简称RLHF

  • 2017年7月的OpenAI团队提出的对TRPO算法的改进:PPO算法

  • GPT-1:GPT-1是由OpenAI于2018年发布的第一个版本。它采用了Transformer的编码器架构,通过自回归语言模型的方式进行预训练。GPT-1在多个语言任务上取得了很好的效果,但其生成的文本可能存在一些不连贯或缺乏逻辑性的问题。

  • GPT-2:GPT-2是于2019年发布的第二个版本,相较于GPT-1有了重大的改进。GPT-2采用了更大的模型规模和更多的训练数据,使得模型具备了更强的生成能力。它在多个自然语言处理任务上取得了令人印象深刻的结果,并且引起了广泛的关注。利用了Zero-Shot方法,指直接给模型任务输入让它输出任务结果。

    • Zero-shot Learming则相当于没有练手/预热、没有参考样例/演示/范本,学完知识/方法之后直接答题!

    • One shot Learning (单样本学习),顾名思义,是指在只有一个样本/示例的情况下,预训练语言模型完成特定任务

    • Few-shot Learning (少样本或小样本学习),类似的,是指在只有少量样本/示例的情况下,预训练语言模型完成特定任务

    • GPT-3:GPT-3是于2020年发布的第三个版本,也是目前最先进的版本。GPT-3采用了比GPT-2更大规模的模型和更多的训练数据。它具备了惊人的生成能力和语言理解能力,在各种语言任务上表现出色。GPT-3成为了人们关注的焦点,被认为是自然语言处理领域的重要里程碑。GPT3:In-context learning正式开启prompt新范式(小样本学习)。从传统的离散、连续的Prompt的构建、走向面向超大规模模型的In-Context Learning、Instruction-tuning和Chain-of-Thought。
      在这里插入图片描述

  • 2022年11月chatGPT公布,是OpenAI在GPT-3的基础上提出了GPT-3.5版本,23年3月中旬,OpenAI正式对外发布GPT-4,增加了多模态(支持图片的输入形式),且ChatGPT底层的语言模型直接从GPT3.5升级到了GPT4,回答问题的准确率大幅提升,进一步提升了模型的性能和效果。此外,还有一些学术界和工业界的研究者在GPT模型上进行了各种改进和应用,使得GPT模型在自然语言处理领域发挥着重要的作用。

  • 23年3月17日,微软推出Microsoft 365 Copilot,​集成GPT4的能力,实现自动化办公,通过在Word PPT Excel等办公软件上输入​一行指令,瞬间解决一个任务。

  • 23年3月24日,OpenAI宣布推出插件功能,赋予ChatGPT使用工具(数学问题精准计算)、联网(获取实时最新消息,底层知识不再只截止到21年9月份)的能力。
    在这里插入图片描述

GPT:基于Transformer Decoder预训练 + 微调/Finetune

在模型finetune中,需要根据不同的下游任务来处理输入,主要的下游任务可分为以下四类:

  • 分类(Classification):给定一个输入文本,将其分为若干类别中的一类,如情感分类、新闻分类等;
  • 蕴含(Entailment):给定两个输入文本,判断它们之间是否存在蕴含关系(即一个文本是否可以从另一个文本中推断出来);
  • 相似度(Similarity):给定两个输入文本,计算它们之间的相似度得分;
  • 多项选择题(Multiple
    choice):给定一个问题和若干个答案选项,选择最佳的答案。

在这里插入图片描述

GPT结构图:
在这里插入图片描述

GPT2 =基于Transformer Decoder预训练 + 初版prompt

GPT2是在GPT1的基础上进行了进一步的探索和改进,旨在实现零样本学习和小样本学习的能力。传统的预训练加微调模式需要在特定任务上对模型进行适配和微调,而GPT2试图摆脱这种依赖,使模型能够直接应用于不同任务,无需额外的任务适配。

其中,零样本学习(Zero-shot Learning)是指在没有任何样本或示例的情况下,让预训练语言模型完成特定任务。GPT2通过大规模多领域数据的预训练,使模型具备广泛的知识和语言理解能力,在给定任务的输入情况下,模型能够自主地推断并生成相应的任务输出。虽然在某些任务上,GPT2在零样本学习方面的表现可能不如一些最先进的模型,但它已经超越了一些简单的模型,并展现出巨大的潜力。

类似地,一样学习(One-shot Learning)和少样本学习(Few-shot Learning)也是针对有限的样本情况下的学习问题。一样学习指的是在只有一个样本或示例的情况下,预训练语言模型完成特定任务。少样本学习则是指在只有少量样本或示例的情况下,预训练语言模型完成特定任务。这些学习方式旨在让模型更好地适应有限的数据情况,并在给定的样本下做出准确的预测或生成。

通过这些不同的学习方式,GPT2开创了一种新的学习模式,使得语言模型能够更加灵活和适应各种任务,而无需进行额外的适配或微调。这种能力的实现具有巨大的潜力,使得模型能够更加自主地学习和推理,为自然语言处理任务带来了新的可能性。

在这里插入图片描述

代码实现过程

import os
import logging
import numpy as np
import mindspore
from mindspore import nn
from mindspore import ops
from mindspore import Tensor
from mindspore.common.initializer import initializer, Normal
from mindnlp.models.gpt.gpt_config import GPTConfig
from mindnlp._legacy.nn import Dropout
from mindnlp.abc import PreTrainedModel
from mindnlp.models.utils.utils import Conv1D, prune_conv1d_layer, find_pruneable_heads_and_indices
from mindnlp.models.utils.utils import SequenceSummary
from mindnlp.models.utils.activations import ACT2FN
from mindnlp import GPTConfig# Feed-Forward 实现
class MLP(nn.Cell):r"""GPT MLP"""def __init__(self, n_state, config):super().__init__()n_embd = config.n_embdself.c_fc = Conv1D(n_state, n_embd)self.c_proj = Conv1D(n_embd, n_state)self.act = ACT2FN[config.afn]self.dropout = Dropout(p=config.resid_pdrop)def construct(self, x):h = self.act(self.c_fc(x))h2 = self.c_proj(h)return self.dropout(h2)# Multi-head attention 实现class Attention(nn.Cell):r"""GPT Attention"""def __init__(self, nx, n_positions, config, scale=False):super().__init__()n_state = nx  # in Attention: n_state=768 (nx=n_embd)# [switch nx => n_state from Block to Attention to keep identical to TF implementation]if n_state % config.n_head != 0:raise ValueError(f"Attention n_state shape: {n_state} must be divisible by config.n_head {config.n_head}")self.bias = Tensor(np.tril(np.ones((n_positions, n_positions))), mindspore.float32).view(1, 1, n_positions, n_positions)self.n_head = config.n_headself.split_size = n_stateself.scale = scaleself.c_attn = Conv1D(n_state * 3, n_state)self.c_attn = Conv1D(n_state * 3, n_state)self.c_proj = Conv1D(n_state, n_state)self.attn_dropout = Dropout(p=config.attn_pdrop)self.resid_dropout = Dropout(p=config.resid_pdrop)self.pruned_heads = set()self.output_attentions = config.output_attentionsdef prune_heads(self, heads):"""Prunes heads of the model."""if len(heads) == 0:returnhead_size = self.split_size//self.n_headheads, index = find_pruneable_heads_and_indices(heads, self.n_head, head_size, self.pruned_heads)index_attn = ops.cat([index, index + self.split_size, index + (2 * self.split_size)])# Prune conv1d layersself.c_attn = prune_conv1d_layer(self.c_attn, index_attn, axis=1)self.c_proj = prune_conv1d_layer(self.c_proj, index, axis=0)# Update hyper paramsself.split_size = (self.split_size // self.n_head) * (self.n_head - len(heads))self.n_head = self.n_head - len(heads)self.pruned_heads = self.pruned_heads.union(heads)def _attn(self, q, k, v, attention_mask=None, head_mask=None):w = ops.matmul(q, k)if self.scale:w = w / ops.sqrt(ops.scalar_to_tensor(v.shape[-1]))b = self.bias[:, :, : w.shape[-2], : w.shape[-1]]w = w * b + -1e9 * (1 - b)if attention_mask is not None:w = w + attention_maskw = ops.softmax(w)w = self.attn_dropout(w)if head_mask is not None:w = w * head_maskoutputs = (ops.matmul(w, v),)if self.output_attentions:outputs += (w,)return outputsdef merge_heads(self, x):"""merge heads"""x = x.transpose(0, 2, 1, 3)new_x_shape = x.shape[:-2] + (x.shape[-2] * x.shape[-1],)return x.view(new_x_shape)def split_heads(self, x, k=False):"""split heads"""new_x_shape = x.shape[:-1] + (self.n_head, x.shape[-1] // self.n_head)x = x.view(new_x_shape)if k:return x.transpose(0, 2, 3, 1)return x.transpose(0, 2, 1, 3)def construct(self, x, attention_mask=None, head_mask=None):x = self.c_attn(x)query, key, value = ops.split(x, self.split_size, axis=2)query = self.split_heads(query)key = self.split_heads(key, k=True)value = self.split_heads(value)attn_outputs = self._attn(query, key, value, attention_mask, head_mask)a = attn_outputs[0]a = self.merge_heads(a)a = self.c_proj(a)a = self.resid_dropout(a)outputs = (a,) + attn_outputs[1:]return outputs# transformer decoder block实现
class Block(nn.Cell):r"""GPT Block"""def __init__(self, n_positions, config, scale=False):super().__init__()nx = config.n_embdself.attn = Attention(nx, n_positions, config, scale)self.ln_1 = nn.LayerNorm((nx,), epsilon=config.layer_norm_epsilon)self.mlp = MLP(4 * nx, config)self.ln_2 = nn.LayerNorm((nx,), epsilon=config.layer_norm_epsilon)def construct(self, x, attention_mask=None, head_mask=None):output_attn = self.attn(x,attention_mask=attention_mask,head_mask=head_mask)a = output_attn[0]n = self.ln_1(x + a)m = self.mlp(n)h = self.ln_2(n + m)outputs = (h,) + output_attn[1:]return outputs
# GPT pretrained model实现
class GPTPreTrainedModel(PreTrainedModel):"""BertPretrainedModel"""convert_torch_to_mindspore = torch_to_mindsporepretrained_model_archive_map = PRETRAINED_MODEL_ARCHIVE_MAPconfig_class = GPTConfigbase_model_prefix = 'transformer'def _init_weights(self, cell):"""Initialize the weights"""if isinstance(cell, nn.Dense):# Slightly different from the TF version which uses truncated_normal for initialization# cf https://github.com/pytorch/pytorch/pull/5617cell.weight.set_data(initializer(Normal(self.config.initializer_range),cell.weight.shape, cell.weight.dtype))if cell.has_bias:cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))elif isinstance(cell, nn.Embedding):embedding_table = initializer(Normal(self.config.initializer_range),cell.embedding_table.shape,cell.embedding_table.dtype)if cell.padding_idx is not None:embedding_table[cell.padding_idx] = 0cell.embedding_table.set_data(embedding_table)elif isinstance(cell, nn.LayerNorm):cell.gamma.set_data(initializer('ones', cell.gamma.shape, cell.gamma.dtype))cell.beta.set_data(initializer('zeros', cell.beta.shape, cell.beta.dtype))class GPTModel(GPTPreTrainedModel):"""The bare GPT transformer model outputting raw hidden-states without any specific head on top"""def __init__(self, config):super().__init__(config)self.config = configself.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd)self.positions_embed = nn.Embedding(config.n_positions, config.n_embd)self.drop = nn.Dropout(p=config.embd_pdrop)self.h = nn.CellList([Block(config.n_positions, config, scale=True) for _ in range(config.n_layer)])self.position_ids = ops.arange(config.n_positions)self.n_layer = self.config.n_layerself.output_attentions = self.config.output_attentionsself.output_hidden_states = self.config.output_hidden_statesdef get_input_embeddings(self):"""return the input embeddings layer"""return self.tokens_embeddef set_input_embeddings(self, value):"""set the input embeddings layer"""self.tokens_embed = valuedef _prune_heads(self, heads_to_prune):"""Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}"""for layer, heads in heads_to_prune.items():self.h[layer].attn.prune_heads(heads)def construct(self,input_ids=None,attention_mask=None,token_type_ids=None,position_ids=None,head_mask=None,inputs_embeds=None,):if input_ids is not None and inputs_embeds is not None:raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")if input_ids is not None:input_shape = input_ids.shapeinput_ids = input_ids.view(-1, input_shape[-1])elif inputs_embeds is not None:input_shape = inputs_embeds.shape[:-1]else:raise ValueError("You have to specify either input_ids or inputs_embeds")if position_ids is None:# Code is different from when we had a single embedding matrix  from position and token embeddingsposition_ids = self.position_ids[None, : input_shape[-1]]if attention_mask is not None:attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)attention_mask = (1.0 - attention_mask) * Tensor(np.finfo(mindspore.dtype_to_nptype(self.dtype)).min,self.dtype)# Prepare head mask if neededhead_mask = self.get_head_mask(head_mask, self.n_layer)if inputs_embeds is None:inputs_embeds = self.tokens_embed(input_ids)position_embeds = self.positions_embed(position_ids)if token_type_ids is not None:token_type_ids = token_type_ids.view(-1, token_type_ids.shape[-1])token_type_embeds = self.tokens_embed(token_type_ids)else:token_type_embeds = 0hidden_states = inputs_embeds + position_embeds + token_type_embedshidden_states = self.drop(hidden_states)output_shape = input_shape + (hidden_states.shape[-1],)all_attentions = ()all_hidden_states = ()for i, block in enumerate(self.h):if self.output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)outputs = block(hidden_states, attention_mask, head_mask[i])hidden_states = outputs[0]if self.output_attentions:all_attentions = all_attentions + (outputs[1],)hidden_states = hidden_states.view(*output_shape)# Add last layerif self.output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)return (hidden_states, all_hidden_states, all_attentions)

HuggingFace版GPT

在HuggingFace当中GPT的代码是如何构建

class OpenAIGPTModel(OpenAIGPTPreTrainedModel):def __init__(self, config):super().__init__(config)self.tokens_embed = nn.Embedding(config.vocab_size, config.n_embd) #初始化词嵌入层self.positions_embed = nn.Embedding(config.n_positions, config.n_embd) #位置嵌入层self.drop = nn.Dropout(config.embd_pdrop)self.h = nn.ModuleList([Block(config.n_positions, config, scale=True) for _ in range(config.n_layer)])  #构建模型组容器,用于保存多个nn.Module子模块,并且可以像普通的Python列表一样进行索引和迭代。self.register_buffer("position_ids", torch.arange(config.n_positions)) #,在模型中注册了一个名为position_ids的缓冲区。缓冲区是模型中用于存储持久化数据的一种特殊张量。# Initialize weights and apply final processingself.post_init()def get_input_embeddings(self):return self.tokens_embeddef set_input_embeddings(self, new_embeddings):self.tokens_embed = new_embeddingsdef _prune_heads(self, heads_to_prune):"""修剪模型中的注意力头。heads_to_prune是一个字典,其键为层号,值为需要在该层修剪的注意力头列表heads_to_prune: dict of {layer_num: list of heads to prune in this layer}"""for layer, heads in heads_to_prune.items():self.h[layer].attn.prune_heads(heads)@add_start_docstrings_to_model_forward(OPENAI_GPT_INPUTS_DOCSTRING)@add_code_sample_docstrings(checkpoint=_CHECKPOINT_FOR_DOC,output_type=BaseModelOutput,config_class=_CONFIG_FOR_DOC,)def forward(self,input_ids: Optional[torch.LongTensor] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple[torch.Tensor], BaseModelOutput]:output_attentions = output_attentions if output_attentions is not None else self.config.output_attentionsoutput_hidden_states = (output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states)return_dict = return_dict if return_dict is not None else self.config.use_return_dictif input_ids is not None and inputs_embeds is not None:raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")elif input_ids is not None:input_shape = input_ids.size()input_ids = input_ids.view(-1, input_shape[-1])elif inputs_embeds is not None:input_shape = inputs_embeds.size()[:-1]else:raise ValueError("You have to specify either input_ids or inputs_embeds")if position_ids is None:# Code is different from when we had a single embedding matrix  from position and token embeddingsposition_ids = self.position_ids[None, : input_shape[-1]]# Attention mask.if attention_mask is not None:# We create a 3D attention mask from a 2D tensor mask.# Sizes are [batch_size, 1, 1, to_seq_length]# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]# this attention mask is more simple than the triangular masking of causal attention# used in OpenAI GPT, we just need to prepare the broadcast dimension here.attention_mask = attention_mask.unsqueeze(1).unsqueeze(2)# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and the dtype's smallest value for masked positions.# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_mask = attention_mask.to(dtype=next(self.parameters()).dtype)  # fp16 compatibilityattention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min# Prepare head mask if neededhead_mask = self.get_head_mask(head_mask, self.config.n_layer)if inputs_embeds is None:inputs_embeds = self.tokens_embed(input_ids)position_embeds = self.positions_embed(position_ids)if token_type_ids is not None:token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1))token_type_embeds = self.tokens_embed(token_type_ids)else:token_type_embeds = 0hidden_states = inputs_embeds + position_embeds + token_type_embedshidden_states = self.drop(hidden_states)output_shape = input_shape + (hidden_states.size(-1),)all_attentions = () if output_attentions else Noneall_hidden_states = () if output_hidden_states else Nonefor i, block in enumerate(self.h):if output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)outputs = block(hidden_states, attention_mask, head_mask[i], output_attentions=output_attentions)hidden_states = outputs[0]if output_attentions:all_attentions = all_attentions + (outputs[1],)hidden_states = hidden_states.view(*output_shape)# Add last layerif output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)if not return_dict:return tuple(v for v in [hidden_states, all_hidden_states, all_attentions] if v is not None)return BaseModelOutput(last_hidden_state=hidden_states,hidden_states=all_hidden_states,attentions=all_attentions,)

GPT2

# coding=utf-8
# Copyright 2018 The OpenAI Team Authors and HuggingFace Inc. team.
# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch OpenAI GPT-2 model."""import math
import os
import warnings
from dataclasses import dataclass
from typing import Optional, Tuple, Unionimport torch
import torch.utils.checkpoint
from torch import nn
from torch.cuda.amp import autocast
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELossfrom ...activations import ACT2FN
from ...modeling_outputs import (BaseModelOutputWithPastAndCrossAttentions,CausalLMOutputWithCrossAttentions,QuestionAnsweringModelOutput,SequenceClassifierOutputWithPast,TokenClassifierOutput,
)
from ...modeling_utils import PreTrainedModel, SequenceSummary
from ...pytorch_utils import Conv1D, find_pruneable_heads_and_indices, prune_conv1d_layer
from ...utils import (ModelOutput,add_code_sample_docstrings,add_start_docstrings,add_start_docstrings_to_model_forward,logging,replace_return_docstrings,
)
from ...utils.model_parallel_utils import assert_device_map, get_device_map
from .configuration_gpt2 import GPT2Configlogger = logging.get_logger(__name__)_CHECKPOINT_FOR_DOC = "gpt2"
_CONFIG_FOR_DOC = "GPT2Config"GPT2_PRETRAINED_MODEL_ARCHIVE_LIST = ["gpt2","gpt2-medium","gpt2-large","gpt2-xl","distilgpt2",# See all GPT-2 models at https://huggingface.co/models?filter=gpt2
]def load_tf_weights_in_gpt2(model, config, gpt2_checkpoint_path):"""Load tf checkpoints in a pytorch model"""try:import reimport tensorflow as tfexcept ImportError:logger.error("Loading a TensorFlow model in PyTorch, requires TensorFlow to be installed. Please see ""https://www.tensorflow.org/install/ for installation instructions.")raisetf_path = os.path.abspath(gpt2_checkpoint_path)logger.info(f"Converting TensorFlow checkpoint from {tf_path}")# Load weights from TF modelinit_vars = tf.train.list_variables(tf_path)names = []arrays = []for name, shape in init_vars:logger.info(f"Loading TF weight {name} with shape {shape}")array = tf.train.load_variable(tf_path, name)names.append(name)arrays.append(array.squeeze())for name, array in zip(names, arrays):name = name[6:]  # skip "model/"name = name.split("/")pointer = modelfor m_name in name:if re.fullmatch(r"[A-Za-z]+\d+", m_name):scope_names = re.split(r"(\d+)", m_name)else:scope_names = [m_name]if scope_names[0] == "w" or scope_names[0] == "g":pointer = getattr(pointer, "weight")elif scope_names[0] == "b":pointer = getattr(pointer, "bias")elif scope_names[0] == "wpe" or scope_names[0] == "wte":pointer = getattr(pointer, scope_names[0])pointer = getattr(pointer, "weight")else:pointer = getattr(pointer, scope_names[0])if len(scope_names) >= 2:num = int(scope_names[1])pointer = pointer[num]try:assert (pointer.shape == array.shape), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"except AssertionError as e:e.args += (pointer.shape, array.shape)raiselogger.info(f"Initialize PyTorch weight {name}")pointer.data = torch.from_numpy(array)return modelclass GPT2Attention(nn.Module):def __init__(self, config, is_cross_attention=False, layer_idx=None):super().__init__()max_positions = config.max_position_embeddingsself.register_buffer("bias",torch.tril(torch.ones((max_positions, max_positions), dtype=torch.bool)).view(1, 1, max_positions, max_positions),)self.register_buffer("masked_bias", torch.tensor(-1e4))self.embed_dim = config.hidden_sizeself.num_heads = config.num_attention_headsself.head_dim = self.embed_dim // self.num_headsself.split_size = self.embed_dimif self.head_dim * self.num_heads != self.embed_dim:raise ValueError(f"`embed_dim` must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"f" {self.num_heads}).")self.scale_attn_weights = config.scale_attn_weightsself.is_cross_attention = is_cross_attention# Layer-wise attention scaling, reordering, and upcastingself.scale_attn_by_inverse_layer_idx = config.scale_attn_by_inverse_layer_idxself.layer_idx = layer_idxself.reorder_and_upcast_attn = config.reorder_and_upcast_attnif self.is_cross_attention:self.c_attn = Conv1D(2 * self.embed_dim, self.embed_dim)self.q_attn = Conv1D(self.embed_dim, self.embed_dim)else:self.c_attn = Conv1D(3 * self.embed_dim, self.embed_dim)self.c_proj = Conv1D(self.embed_dim, self.embed_dim)self.attn_dropout = nn.Dropout(config.attn_pdrop)self.resid_dropout = nn.Dropout(config.resid_pdrop)self.pruned_heads = set()def prune_heads(self, heads):if len(heads) == 0:returnheads, index = find_pruneable_heads_and_indices(heads, self.num_heads, self.head_dim, self.pruned_heads)index_attn = torch.cat([index, index + self.split_size, index + (2 * self.split_size)])# Prune conv1d layersself.c_attn = prune_conv1d_layer(self.c_attn, index_attn, dim=1)self.c_proj = prune_conv1d_layer(self.c_proj, index, dim=0)# Update hyper paramsself.split_size = (self.split_size // self.num_heads) * (self.num_heads - len(heads))self.num_heads = self.num_heads - len(heads)self.pruned_heads = self.pruned_heads.union(heads)def _attn(self, query, key, value, attention_mask=None, head_mask=None):attn_weights = torch.matmul(query, key.transpose(-1, -2))if self.scale_attn_weights:attn_weights = attn_weights / torch.full([], value.size(-1) ** 0.5, dtype=attn_weights.dtype, device=attn_weights.device)# Layer-wise attention scalingif self.scale_attn_by_inverse_layer_idx:attn_weights = attn_weights / float(self.layer_idx + 1)if not self.is_cross_attention:# if only "normal" attention layer implements causal maskquery_length, key_length = query.size(-2), key.size(-2)causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]mask_value = torch.finfo(attn_weights.dtype).min# Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.# Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)if attention_mask is not None:# Apply the attention maskattn_weights = attn_weights + attention_maskattn_weights = nn.functional.softmax(attn_weights, dim=-1)# Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwiseattn_weights = attn_weights.type(value.dtype)attn_weights = self.attn_dropout(attn_weights)# Mask heads if we want toif head_mask is not None:attn_weights = attn_weights * head_maskattn_output = torch.matmul(attn_weights, value)return attn_output, attn_weightsdef _upcast_and_reordered_attn(self, query, key, value, attention_mask=None, head_mask=None):# Use `torch.baddbmm` (a bit more efficient w/ alpha param for scaling -- from Megatron-LM)bsz, num_heads, q_seq_len, dk = query.size()_, _, k_seq_len, _ = key.size()# Preallocate attn_weights for `baddbmm`attn_weights = torch.empty(bsz * num_heads, q_seq_len, k_seq_len, dtype=torch.float32, device=query.device)# Compute Scale Factorscale_factor = 1.0if self.scale_attn_weights:scale_factor /= float(value.size(-1)) ** 0.5if self.scale_attn_by_inverse_layer_idx:scale_factor /= float(self.layer_idx + 1)# Upcast (turn off autocast) and reorder (Scale K by 1 / root(dk))with autocast(enabled=False):q, k = query.reshape(-1, q_seq_len, dk), key.transpose(-1, -2).reshape(-1, dk, k_seq_len)attn_weights = torch.baddbmm(attn_weights, q.float(), k.float(), beta=0, alpha=scale_factor)attn_weights = attn_weights.reshape(bsz, num_heads, q_seq_len, k_seq_len)if not self.is_cross_attention:# if only "normal" attention layer implements causal maskquery_length, key_length = query.size(-2), key.size(-2)causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]mask_value = torch.finfo(attn_weights.dtype).min# Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.# Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)attn_weights = torch.where(causal_mask, attn_weights, mask_value)if attention_mask is not None:# Apply the attention maskattn_weights = attn_weights + attention_maskattn_weights = nn.functional.softmax(attn_weights, dim=-1)# Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op if otherwiseif attn_weights.dtype != torch.float32:raise RuntimeError("Error with upcasting, attn_weights does not have dtype torch.float32")attn_weights = attn_weights.type(value.dtype)attn_weights = self.attn_dropout(attn_weights)# Mask heads if we want toif head_mask is not None:attn_weights = attn_weights * head_maskattn_output = torch.matmul(attn_weights, value)return attn_output, attn_weightsdef _split_heads(self, tensor, num_heads, attn_head_size):"""Splits hidden_size dim into attn_head_size and num_heads"""new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)tensor = tensor.view(new_shape)return tensor.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)def _merge_heads(self, tensor, num_heads, attn_head_size):"""Merges attn_head_size dim and num_attn_heads dim into hidden_size"""tensor = tensor.permute(0, 2, 1, 3).contiguous()new_shape = tensor.size()[:-2] + (num_heads * attn_head_size,)return tensor.view(new_shape)def forward(self,hidden_states: Optional[Tuple[torch.FloatTensor]],layer_past: Optional[Tuple[torch.Tensor]] = None,attention_mask: Optional[torch.FloatTensor] = None,head_mask: Optional[torch.FloatTensor] = None,encoder_hidden_states: Optional[torch.Tensor] = None,encoder_attention_mask: Optional[torch.FloatTensor] = None,use_cache: Optional[bool] = False,output_attentions: Optional[bool] = False,) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor]], ...]:if encoder_hidden_states is not None:if not hasattr(self, "q_attn"):raise ValueError("If class is used as cross attention, the weights `q_attn` have to be defined. ""Please make sure to instantiate class with `GPT2Attention(..., is_cross_attention=True)`.")query = self.q_attn(hidden_states)key, value = self.c_attn(encoder_hidden_states).split(self.split_size, dim=2)attention_mask = encoder_attention_maskelse:query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2)query = self._split_heads(query, self.num_heads, self.head_dim)key = self._split_heads(key, self.num_heads, self.head_dim)value = self._split_heads(value, self.num_heads, self.head_dim)if layer_past is not None:past_key, past_value = layer_pastkey = torch.cat((past_key, key), dim=-2)value = torch.cat((past_value, value), dim=-2)if use_cache is True:present = (key, value)else:present = Noneif self.reorder_and_upcast_attn:attn_output, attn_weights = self._upcast_and_reordered_attn(query, key, value, attention_mask, head_mask)else:attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)attn_output = self.c_proj(attn_output)attn_output = self.resid_dropout(attn_output)outputs = (attn_output, present)if output_attentions:outputs += (attn_weights,)return outputs  # a, present, (attentions)class GPT2MLP(nn.Module):def __init__(self, intermediate_size, config):super().__init__()embed_dim = config.hidden_sizeself.c_fc = Conv1D(intermediate_size, embed_dim)self.c_proj = Conv1D(embed_dim, intermediate_size)self.act = ACT2FN[config.activation_function]self.dropout = nn.Dropout(config.resid_pdrop)def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.FloatTensor:hidden_states = self.c_fc(hidden_states)hidden_states = self.act(hidden_states)hidden_states = self.c_proj(hidden_states)hidden_states = self.dropout(hidden_states)return hidden_statesclass GPT2Block(nn.Module):def __init__(self, config, layer_idx=None):super().__init__()hidden_size = config.hidden_sizeinner_dim = config.n_inner if config.n_inner is not None else 4 * hidden_sizeself.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)self.attn = GPT2Attention(config, layer_idx=layer_idx)self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)if config.add_cross_attention:self.crossattention = GPT2Attention(config, is_cross_attention=True, layer_idx=layer_idx)self.ln_cross_attn = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)self.mlp = GPT2MLP(inner_dim, config)def forward(self,hidden_states: Optional[Tuple[torch.FloatTensor]],layer_past: Optional[Tuple[torch.Tensor]] = None,attention_mask: Optional[torch.FloatTensor] = None,head_mask: Optional[torch.FloatTensor] = None,encoder_hidden_states: Optional[torch.Tensor] = None,encoder_attention_mask: Optional[torch.FloatTensor] = None,use_cache: Optional[bool] = False,output_attentions: Optional[bool] = False,) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]:residual = hidden_stateshidden_states = self.ln_1(hidden_states)attn_outputs = self.attn(hidden_states,layer_past=layer_past,attention_mask=attention_mask,head_mask=head_mask,use_cache=use_cache,output_attentions=output_attentions,)attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)outputs = attn_outputs[1:]# residual connectionhidden_states = attn_output + residualif encoder_hidden_states is not None:# add one self-attention block for cross-attentionif not hasattr(self, "crossattention"):raise ValueError(f"If `encoder_hidden_states` are passed, {self} has to be instantiated with ""cross-attention layers by setting `config.add_cross_attention=True`")residual = hidden_stateshidden_states = self.ln_cross_attn(hidden_states)cross_attn_outputs = self.crossattention(hidden_states,attention_mask=attention_mask,head_mask=head_mask,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,output_attentions=output_attentions,)attn_output = cross_attn_outputs[0]# residual connectionhidden_states = residual + attn_outputoutputs = outputs + cross_attn_outputs[2:]  # add cross attentions if we output attention weightsresidual = hidden_stateshidden_states = self.ln_2(hidden_states)feed_forward_hidden_states = self.mlp(hidden_states)# residual connectionhidden_states = residual + feed_forward_hidden_statesif use_cache:outputs = (hidden_states,) + outputselse:outputs = (hidden_states,) + outputs[1:]return outputs  # hidden_states, present, (attentions, cross_attentions)class GPT2PreTrainedModel(PreTrainedModel):"""An abstract class to handle weights initialization and a simple interface for downloading and loading pretrainedmodels."""config_class = GPT2Configload_tf_weights = load_tf_weights_in_gpt2base_model_prefix = "transformer"is_parallelizable = Truesupports_gradient_checkpointing = True_no_split_modules = ["GPT2Block"]def __init__(self, *inputs, **kwargs):super().__init__(*inputs, **kwargs)def _init_weights(self, module):"""Initialize the weights."""if isinstance(module, (nn.Linear, Conv1D)):# Slightly different from the TF version which uses truncated_normal for initialization# cf https://github.com/pytorch/pytorch/pull/5617module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)if module.bias is not None:module.bias.data.zero_()elif isinstance(module, nn.Embedding):module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)if module.padding_idx is not None:module.weight.data[module.padding_idx].zero_()elif isinstance(module, nn.LayerNorm):module.bias.data.zero_()module.weight.data.fill_(1.0)# Reinitialize selected weights subject to the OpenAI GPT-2 Paper Scheme:#   > A modified initialization which accounts for the accumulation on the residual path with model depth. Scale#   > the weights of residual layers at initialization by a factor of 1/√N where N is the # of residual layers.#   >   -- GPT-2 :: https://openai.com/blog/better-language-models/## Reference (Megatron-LM): https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/gpt_model.pyfor name, p in module.named_parameters():if name == "c_proj.weight":# Special Scaled Initialization --> There are 2 Layer Norms per Transformer Blockp.data.normal_(mean=0.0, std=(self.config.initializer_range / math.sqrt(2 * self.config.n_layer)))def _set_gradient_checkpointing(self, module, value=False):if isinstance(module, GPT2Model):module.gradient_checkpointing = value@dataclass
class GPT2DoubleHeadsModelOutput(ModelOutput):"""Base class for outputs of models predicting if two sentences are consecutive or not.Args:loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):Language modeling loss.mc_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mc_labels` is provided):Multiple choice classification loss.logits (`torch.FloatTensor` of shape `(batch_size, num_choices, sequence_length, config.vocab_size)`):Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).mc_logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`):Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).past_key_values (`Tuple[Tuple[torch.Tensor]]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):Tuple of length `config.n_layers`, containing tuples of tensors of shape `(batch_size, num_heads,sequence_length, embed_size_per_head)`).Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see`past_key_values` input) to speed up sequential decoding.hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) ofshape `(batch_size, sequence_length, hidden_size)`.Hidden-states of the model at the output of each layer plus the initial embedding outputs.attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,sequence_length)`.GPT2Attentions weights after the attention softmax, used to compute the weighted average in theself-attention heads."""loss: Optional[torch.FloatTensor] = Nonemc_loss: Optional[torch.FloatTensor] = Nonelogits: torch.FloatTensor = Nonemc_logits: torch.FloatTensor = Nonepast_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = Nonehidden_states: Optional[Tuple[torch.FloatTensor]] = Noneattentions: Optional[Tuple[torch.FloatTensor]] = NoneGPT2_START_DOCSTRING = r"""This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods thelibrary implements for all its model (such as downloading or saving, resizing the input embeddings, pruning headsetc.)This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usageand behavior.Parameters:config ([`GPT2Config`]): Model configuration class with all the parameters of the model.Initializing with a config file does not load the weights associated with the model, only theconfiguration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""GPT2_INPUTS_DOCSTRING = r"""Args:input_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`):`input_ids_length` = `sequence_length` if `past_key_values` is `None` else`past_key_values[0][0].shape[-2]` (`sequence_length` of input past key value states). Indices of inputsequence tokens in the vocabulary.If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as`input_ids`.Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and[`PreTrainedTokenizer.__call__`] for details.[What are input IDs?](../glossary#input-ids)past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.n_layers`):Contains precomputed hidden-states (key and values in the attention blocks) as computed by the model (see`past_key_values` output below). Can be used to speed up sequential decoding. The `input_ids` which havetheir past given to this model should not be passed as `input_ids` as they have already been computed.attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:- 1 for tokens that are **not masked**,- 0 for tokens that are **masked**.If `past_key_values` is used, `attention_mask` needs to contain the masking strategy that was used for`past_key_values`. In other words, the `attention_mask` always has to have the length:`len(past_key_values) + len(input_ids)`[What are attention masks?](../glossary#attention-mask)token_type_ids (`torch.LongTensor` of shape `(batch_size, input_ids_length)`, *optional*):Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,1]`:- 0 corresponds to a *sentence A* token,- 1 corresponds to a *sentence B* token.[What are token type IDs?](../glossary#token-type-ids)position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,config.max_position_embeddings - 1]`.[What are position IDs?](../glossary#position-ids)head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:- 1 indicates the head is **not masked**,- 0 indicates the head is **masked**.inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. Thisis useful if you want more control over how to convert `input_ids` indices into associated vectors than themodel's internal embedding lookup matrix.If `past_key_values` is used, optionally only the last `inputs_embeds` have to be input (see`past_key_values`).use_cache (`bool`, *optional*):If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see`past_key_values`).output_attentions (`bool`, *optional*):Whether or not to return the attentions tensors of all attention layers. See `attentions` under returnedtensors for more detail.output_hidden_states (`bool`, *optional*):Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors formore detail.return_dict (`bool`, *optional*):Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
PARALLELIZE_DOCSTRING = r"""This is an experimental feature and is a subject to change at a moment's notice.Uses a device map to distribute attention modules of the model across several devices. If no device map is given,it will evenly distribute blocks across all devices.Args:device_map (`Dict[int, list]`, optional, defaults to None):A dictionary that maps attention modules to devices. Note that the embedding module and LMHead are alwaysautomatically mapped to the first device (for esoteric reasons). That means that the first device shouldhave fewer attention modules mapped to it than other devices. For reference, the gpt2 models have thefollowing number of attention modules:- gpt2: 12- gpt2-medium: 24- gpt2-large: 36- gpt2-xl: 48Example:```python# Here is an example of a device map on a machine with 4 GPUs using gpt2-xl, which has a total of 48 attention modules:model = GPT2LMHeadModel.from_pretrained("gpt2-xl")device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7, 8],1: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],2: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34],3: [35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47],}model.parallelize(device_map)```
"""
DEPARALLELIZE_DOCSTRING = r"""Moves the model to cpu from a model parallel state.Example:```python# On a 4 GPU machine with gpt2-large:model = GPT2LMHeadModel.from_pretrained("gpt2-large")device_map = {0: [0, 1, 2, 3, 4, 5, 6, 7],1: [8, 9, 10, 11, 12, 13, 14, 15],2: [16, 17, 18, 19, 20, 21, 22, 23],3: [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],}model.parallelize(device_map)  # Splits the model across several devicesmodel.deparallelize()  # Put the model back on cpu and cleans memory by calling torch.cuda.empty_cache()```
"""@add_start_docstrings("The bare GPT2 Model transformer outputting raw hidden-states without any specific head on top.",GPT2_START_DOCSTRING,
)
class GPT2Model(GPT2PreTrainedModel):_keys_to_ignore_on_load_missing = ["attn.masked_bias"]def __init__(self, config):super().__init__(config)self.embed_dim = config.hidden_sizeself.wte = nn.Embedding(config.vocab_size, self.embed_dim)self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)self.drop = nn.Dropout(config.embd_pdrop)self.h = nn.ModuleList([GPT2Block(config, layer_idx=i) for i in range(config.num_hidden_layers)])self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)# Model parallelself.model_parallel = Falseself.device_map = Noneself.gradient_checkpointing = False# Initialize weights and apply final processingself.post_init()@add_start_docstrings(PARALLELIZE_DOCSTRING)def parallelize(self, device_map=None):# Check validity of device_mapwarnings.warn("`GPT2Model.parallelize` is deprecated and will be removed in v5 of Transformers, you should load your"" model with `device_map='balanced'` in the call to `from_pretrained`. You can also provide your own"" `device_map` but it needs to be a dictionary module_name to device, so for instance {'h.0': 0, 'h.1': 1,"" ...}",FutureWarning,)self.device_map = (get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map is None else device_map)assert_device_map(self.device_map, len(self.h))self.model_parallel = Trueself.first_device = "cpu" if "cpu" in self.device_map.keys() else "cuda:" + str(min(self.device_map.keys()))self.last_device = "cuda:" + str(max(self.device_map.keys()))self.wte = self.wte.to(self.first_device)self.wpe = self.wpe.to(self.first_device)# Load onto devicesfor k, v in self.device_map.items():for block in v:cuda_device = "cuda:" + str(k)self.h[block] = self.h[block].to(cuda_device)# ln_f to lastself.ln_f = self.ln_f.to(self.last_device)@add_start_docstrings(DEPARALLELIZE_DOCSTRING)def deparallelize(self):warnings.warn("Like `parallelize`, `deparallelize` is deprecated and will be removed in v5 of Transformers.",FutureWarning,)self.model_parallel = Falseself.device_map = Noneself.first_device = "cpu"self.last_device = "cpu"self.wte = self.wte.to("cpu")self.wpe = self.wpe.to("cpu")for index in range(len(self.h)):self.h[index] = self.h[index].to("cpu")self.ln_f = self.ln_f.to("cpu")torch.cuda.empty_cache()def get_input_embeddings(self):return self.wtedef set_input_embeddings(self, new_embeddings):self.wte = new_embeddingsdef _prune_heads(self, heads_to_prune):"""Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer}"""for layer, heads in heads_to_prune.items():self.h[layer].attn.prune_heads(heads)@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@add_code_sample_docstrings(checkpoint=_CHECKPOINT_FOR_DOC,output_type=BaseModelOutputWithPastAndCrossAttentions,config_class=_CONFIG_FOR_DOC,)def forward(self,input_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,encoder_hidden_states: Optional[torch.Tensor] = None,encoder_attention_mask: Optional[torch.FloatTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:output_attentions = output_attentions if output_attentions is not None else self.config.output_attentionsoutput_hidden_states = (output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states)use_cache = use_cache if use_cache is not None else self.config.use_cachereturn_dict = return_dict if return_dict is not None else self.config.use_return_dictif input_ids is not None and inputs_embeds is not None:raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")elif input_ids is not None:input_shape = input_ids.size()input_ids = input_ids.view(-1, input_shape[-1])batch_size = input_ids.shape[0]elif inputs_embeds is not None:input_shape = inputs_embeds.size()[:-1]batch_size = inputs_embeds.shape[0]else:raise ValueError("You have to specify either input_ids or inputs_embeds")device = input_ids.device if input_ids is not None else inputs_embeds.deviceif token_type_ids is not None:token_type_ids = token_type_ids.view(-1, input_shape[-1])if position_ids is not None:position_ids = position_ids.view(-1, input_shape[-1])if past_key_values is None:past_length = 0past_key_values = tuple([None] * len(self.h))else:past_length = past_key_values[0][0].size(-2)if position_ids is None:position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])# GPT2Attention mask.if attention_mask is not None:if batch_size <= 0:raise ValueError("batch_size has to be defined and > 0")attention_mask = attention_mask.view(batch_size, -1)# We create a 3D attention mask from a 2D tensor mask.# Sizes are [batch_size, 1, 1, to_seq_length]# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]# this attention mask is more simple than the triangular masking of causal attention# used in OpenAI GPT, we just need to prepare the broadcast dimension here.attention_mask = attention_mask[:, None, None, :]# Since attention_mask is 1.0 for positions we want to attend and 0.0 for# masked positions, this operation will create a tensor which is 0.0 for# positions we want to attend and the dtype's smallest value for masked positions.# Since we are adding it to the raw scores before the softmax, this is# effectively the same as removing these entirely.attention_mask = attention_mask.to(dtype=self.dtype)  # fp16 compatibilityattention_mask = (1.0 - attention_mask) * torch.finfo(self.dtype).min# If a 2D or 3D attention mask is provided for the cross-attention# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]if self.config.add_cross_attention and encoder_hidden_states is not None:encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)if encoder_attention_mask is None:encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)else:encoder_attention_mask = None# Prepare head mask if needed# 1.0 in head_mask indicate we keep the head# attention_probs has shape bsz x n_heads x N x N# head_mask has shape n_layer x batch x n_heads x N x Nhead_mask = self.get_head_mask(head_mask, self.config.n_layer)if inputs_embeds is None:inputs_embeds = self.wte(input_ids)position_embeds = self.wpe(position_ids)hidden_states = inputs_embeds + position_embedsif token_type_ids is not None:token_type_embeds = self.wte(token_type_ids)hidden_states = hidden_states + token_type_embedshidden_states = self.drop(hidden_states)output_shape = input_shape + (hidden_states.size(-1),)if self.gradient_checkpointing and self.training:if use_cache:logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")use_cache = Falsepresents = () if use_cache else Noneall_self_attentions = () if output_attentions else Noneall_cross_attentions = () if output_attentions and self.config.add_cross_attention else Noneall_hidden_states = () if output_hidden_states else Nonefor i, (block, layer_past) in enumerate(zip(self.h, past_key_values)):# Model parallelif self.model_parallel:torch.cuda.set_device(hidden_states.device)# Ensure layer_past is on same device as hidden_states (might not be correct)if layer_past is not None:layer_past = tuple(past_state.to(hidden_states.device) for past_state in layer_past)# Ensure that attention_mask is always on the same device as hidden_statesif attention_mask is not None:attention_mask = attention_mask.to(hidden_states.device)if isinstance(head_mask, torch.Tensor):head_mask = head_mask.to(hidden_states.device)if output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)if self.gradient_checkpointing and self.training:def create_custom_forward(module):def custom_forward(*inputs):# None for past_key_valuereturn module(*inputs, use_cache, output_attentions)return custom_forwardoutputs = torch.utils.checkpoint.checkpoint(create_custom_forward(block),hidden_states,None,attention_mask,head_mask[i],encoder_hidden_states,encoder_attention_mask,)else:outputs = block(hidden_states,layer_past=layer_past,attention_mask=attention_mask,head_mask=head_mask[i],encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,use_cache=use_cache,output_attentions=output_attentions,)hidden_states = outputs[0]if use_cache is True:presents = presents + (outputs[1],)if output_attentions:all_self_attentions = all_self_attentions + (outputs[2 if use_cache else 1],)if self.config.add_cross_attention:all_cross_attentions = all_cross_attentions + (outputs[3 if use_cache else 2],)# Model Parallel: If it's the last layer for that device, put things on the next deviceif self.model_parallel:for k, v in self.device_map.items():if i == v[-1] and "cuda:" + str(k) != self.last_device:hidden_states = hidden_states.to("cuda:" + str(k + 1))hidden_states = self.ln_f(hidden_states)hidden_states = hidden_states.view(output_shape)# Add last hidden stateif output_hidden_states:all_hidden_states = all_hidden_states + (hidden_states,)if not return_dict:return tuple(vfor v in [hidden_states, presents, all_hidden_states, all_self_attentions, all_cross_attentions]if v is not None)return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states,past_key_values=presents,hidden_states=all_hidden_states,attentions=all_self_attentions,cross_attentions=all_cross_attentions,)@add_start_docstrings("""The GPT2 Model transformer with a language modeling head on top (linear layer with weights tied to the inputembeddings).""",GPT2_START_DOCSTRING,
)
class GPT2LMHeadModel(GPT2PreTrainedModel):_keys_to_ignore_on_load_missing = [r"attn.masked_bias", r"attn.bias", r"lm_head.weight"]def __init__(self, config):super().__init__(config)self.transformer = GPT2Model(config)self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)# Model parallelself.model_parallel = Falseself.device_map = None# Initialize weights and apply final processingself.post_init()@add_start_docstrings(PARALLELIZE_DOCSTRING)def parallelize(self, device_map=None):warnings.warn("`GPT2LMHeadModel.parallelize` is deprecated and will be removed in v5 of Transformers, you should load"" your model with `device_map='balanced'` in the call to `from_pretrained`. You can also provide your own"" `device_map` but it needs to be a dictionary module_name to device, so for instance {'transformer.h.0':"" 0, 'transformer.h.1': 1, ...}",FutureWarning,)self.device_map = (get_device_map(len(self.transformer.h), range(torch.cuda.device_count()))if device_map is Noneelse device_map)assert_device_map(self.device_map, len(self.transformer.h))self.transformer.parallelize(self.device_map)self.lm_head = self.lm_head.to(self.transformer.first_device)self.model_parallel = True@add_start_docstrings(DEPARALLELIZE_DOCSTRING)def deparallelize(self):warnings.warn("Like `parallelize`, `deparallelize` is deprecated and will be removed in v5 of Transformers.",FutureWarning,)self.transformer.deparallelize()self.transformer = self.transformer.to("cpu")self.lm_head = self.lm_head.to("cpu")self.model_parallel = Falsetorch.cuda.empty_cache()def get_output_embeddings(self):return self.lm_headdef set_output_embeddings(self, new_embeddings):self.lm_head = new_embeddingsdef prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):token_type_ids = kwargs.get("token_type_ids", None)# only last token for inputs_ids if past is defined in kwargsif past_key_values:input_ids = input_ids[:, -1].unsqueeze(-1)if token_type_ids is not None:token_type_ids = token_type_ids[:, -1].unsqueeze(-1)attention_mask = kwargs.get("attention_mask", None)position_ids = kwargs.get("position_ids", None)if attention_mask is not None and position_ids is None:# create position_ids on the fly for batch generationposition_ids = attention_mask.long().cumsum(-1) - 1position_ids.masked_fill_(attention_mask == 0, 1)if past_key_values:position_ids = position_ids[:, -1].unsqueeze(-1)else:position_ids = None# if `inputs_embeds` are passed, we only want to use them in the 1st generation stepif inputs_embeds is not None and past_key_values is None:model_inputs = {"inputs_embeds": inputs_embeds}else:model_inputs = {"input_ids": input_ids}model_inputs.update({"past_key_values": past_key_values,"use_cache": kwargs.get("use_cache"),"position_ids": position_ids,"attention_mask": attention_mask,"token_type_ids": token_type_ids,})return model_inputs@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@add_code_sample_docstrings(checkpoint=_CHECKPOINT_FOR_DOC,output_type=CausalLMOutputWithCrossAttentions,config_class=_CONFIG_FOR_DOC,)def forward(self,input_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,encoder_hidden_states: Optional[torch.Tensor] = None,encoder_attention_mask: Optional[torch.FloatTensor] = None,labels: Optional[torch.LongTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:r"""labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set`labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`"""return_dict = return_dict if return_dict is not None else self.config.use_return_dicttransformer_outputs = self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,encoder_hidden_states=encoder_hidden_states,encoder_attention_mask=encoder_attention_mask,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states = transformer_outputs[0]# Set device for model parallelismif self.model_parallel:torch.cuda.set_device(self.transformer.first_device)hidden_states = hidden_states.to(self.lm_head.weight.device)lm_logits = self.lm_head(hidden_states)loss = Noneif labels is not None:# move labels to correct device to enable model parallelismlabels = labels.to(lm_logits.device)# Shift so that tokens < n predict nshift_logits = lm_logits[..., :-1, :].contiguous()shift_labels = labels[..., 1:].contiguous()# Flatten the tokensloss_fct = CrossEntropyLoss()loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))if not return_dict:output = (lm_logits,) + transformer_outputs[1:]return ((loss,) + output) if loss is not None else outputreturn CausalLMOutputWithCrossAttentions(loss=loss,logits=lm_logits,past_key_values=transformer_outputs.past_key_values,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,cross_attentions=transformer_outputs.cross_attentions,)@staticmethoddef _reorder_cache(past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:"""This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or[`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correctbeam_idx at every generation step."""return tuple(tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)for layer_past in past_key_values)@add_start_docstrings("""
The GPT2 Model transformer with a language modeling and a multiple-choice classification head on top e.g. for
RocStories/SWAG tasks. The two heads are two linear layers. The language modeling head has its weights tied to the
input embeddings, the classification head takes as input the input of a specified classification token index in the
input sequence).
""",GPT2_START_DOCSTRING,
)
class GPT2DoubleHeadsModel(GPT2PreTrainedModel):_keys_to_ignore_on_load_missing = [r"attn.masked_bias", r"attn.bias", r"lm_head.weight"]def __init__(self, config):super().__init__(config)config.num_labels = 1self.transformer = GPT2Model(config)self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)self.multiple_choice_head = SequenceSummary(config)# Model parallelself.model_parallel = Falseself.device_map = None# Initialize weights and apply final processingself.post_init()@add_start_docstrings(PARALLELIZE_DOCSTRING)def parallelize(self, device_map=None):warnings.warn("`GPT2DoubleHeadsModel.parallelize` is deprecated and will be removed in v5 of Transformers, you should"" load your model with `device_map='balanced'` in the call to `from_pretrained`. You can also provide your"" own `device_map` but it needs to be a dictionary module_name to device, so for instance"" {'transformer.h.0': 0, 'transformer.h.1': 1, ...}",FutureWarning,)self.device_map = (get_device_map(len(self.transformer.h), range(torch.cuda.device_count()))if device_map is Noneelse device_map)assert_device_map(self.device_map, len(self.transformer.h))self.transformer.parallelize(self.device_map)self.lm_head = self.lm_head.to(self.transformer.first_device)self.multiple_choice_head = self.multiple_choice_head.to(self.transformer.first_device)self.model_parallel = True@add_start_docstrings(DEPARALLELIZE_DOCSTRING)def deparallelize(self):warnings.warn("Like `parallelize`, `deparallelize` is deprecated and will be removed in v5 of Transformers.",FutureWarning,)self.transformer.deparallelize()self.transformer = self.transformer.to("cpu")self.lm_head = self.lm_head.to("cpu")self.multiple_choice_head = self.multiple_choice_head.to("cpu")self.model_parallel = Falsetorch.cuda.empty_cache()def get_output_embeddings(self):return self.lm_headdef set_output_embeddings(self, new_embeddings):self.lm_head = new_embeddingsdef prepare_inputs_for_generation(self, input_ids, past_key_values=None, **kwargs):token_type_ids = kwargs.get("token_type_ids", None)# only last token for inputs_ids if past is defined in kwargsif past_key_values:input_ids = input_ids[:, -1].unsqueeze(-1)if token_type_ids is not None:token_type_ids = token_type_ids[:, -1].unsqueeze(-1)attention_mask = kwargs.get("attention_mask", None)position_ids = kwargs.get("position_ids", None)if attention_mask is not None and position_ids is None:# create position_ids on the fly for batch generationposition_ids = attention_mask.long().cumsum(-1) - 1position_ids.masked_fill_(attention_mask == 0, 1)if past_key_values:position_ids = position_ids[:, -1].unsqueeze(-1)else:position_ids = Nonereturn {"input_ids": input_ids,"past_key_values": past_key_values,"use_cache": kwargs.get("use_cache"),"position_ids": position_ids,"attention_mask": attention_mask,"token_type_ids": token_type_ids,}@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@replace_return_docstrings(output_type=GPT2DoubleHeadsModelOutput, config_class=_CONFIG_FOR_DOC)def forward(self,input_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,mc_token_ids: Optional[torch.LongTensor] = None,labels: Optional[torch.LongTensor] = None,mc_labels: Optional[torch.LongTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,**kwargs,) -> Union[Tuple, GPT2DoubleHeadsModelOutput]:r"""mc_token_ids (`torch.LongTensor` of shape `(batch_size, num_choices)`, *optional*, default to index of the last token of the input):Index of the classification token in each input sequence. Selected in the range `[0, input_ids.size(-1) -1]`.labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set`labels = input_ids`. Indices are selected in `[-100, 0, ..., config.vocab_size - 1]`. All labels set to`-100` are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size - 1]`mc_labels (`torch.LongTensor` of shape `(batch_size)`, *optional*):Labels for computing the multiple choice classification loss. Indices should be in `[0, ..., num_choices]`where *num_choices* is the size of the second dimension of the input tensors. (see *input_ids* above)Return:Example:```python>>> import torch>>> from transformers import AutoTokenizer, GPT2DoubleHeadsModel>>> tokenizer = AutoTokenizer.from_pretrained("gpt2")>>> model = GPT2DoubleHeadsModel.from_pretrained("gpt2")>>> # Add a [CLS] to the vocabulary (we should train it also!)>>> num_added_tokens = tokenizer.add_special_tokens({"cls_token": "[CLS]"})>>> # Update the model embeddings with the new vocabulary size>>> embedding_layer = model.resize_token_embeddings(len(tokenizer))>>> choices = ["Hello, my dog is cute [CLS]", "Hello, my cat is cute [CLS]"]>>> encoded_choices = [tokenizer.encode(s) for s in choices]>>> cls_token_location = [tokens.index(tokenizer.cls_token_id) for tokens in encoded_choices]>>> input_ids = torch.tensor(encoded_choices).unsqueeze(0)  # Batch size: 1, number of choices: 2>>> mc_token_ids = torch.tensor([cls_token_location])  # Batch size: 1>>> outputs = model(input_ids, mc_token_ids=mc_token_ids)>>> lm_logits = outputs.logits>>> mc_logits = outputs.mc_logits```"""return_dict = return_dict if return_dict is not None else self.config.use_return_dicttransformer_outputs = self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states = transformer_outputs[0]# Set device for model parallelismif self.model_parallel:torch.cuda.set_device(self.transformer.first_device)hidden_states = hidden_states.to(self.lm_head.weight.device)lm_logits = self.lm_head(hidden_states)mc_logits = self.multiple_choice_head(hidden_states, mc_token_ids).squeeze(-1)mc_loss = Noneif mc_labels is not None:loss_fct = CrossEntropyLoss()mc_loss = loss_fct(mc_logits.view(-1, mc_logits.size(-1)), mc_labels.view(-1))lm_loss = Noneif labels is not None:labels = labels.to(lm_logits.device)shift_logits = lm_logits[..., :-1, :].contiguous()shift_labels = labels[..., 1:].contiguous()loss_fct = CrossEntropyLoss()lm_loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))if not return_dict:output = (lm_logits, mc_logits) + transformer_outputs[1:]if mc_loss is not None:output = (mc_loss,) + outputreturn ((lm_loss,) + output) if lm_loss is not None else outputreturn GPT2DoubleHeadsModelOutput(loss=lm_loss,mc_loss=mc_loss,logits=lm_logits,mc_logits=mc_logits,past_key_values=transformer_outputs.past_key_values,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,)@staticmethoddef _reorder_cache(past_key_values: Tuple[Tuple[torch.Tensor]], beam_idx: torch.Tensor) -> Tuple[Tuple[torch.Tensor]]:"""This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or[`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correctbeam_idx at every generation step."""return tuple(tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past)for layer_past in past_key_values)@add_start_docstrings("""The GPT2 Model transformer with a sequence classification head on top (linear layer).[`GPT2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models(e.g. GPT-1) do.Since it does classification on the last token, it requires to know the position of the last token. If a`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. Ifno `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess thepadding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value ineach row of the batch).""",GPT2_START_DOCSTRING,
)
class GPT2ForSequenceClassification(GPT2PreTrainedModel):_keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.masked_bias", r"lm_head.weight"]def __init__(self, config):super().__init__(config)self.num_labels = config.num_labelsself.transformer = GPT2Model(config)self.score = nn.Linear(config.n_embd, self.num_labels, bias=False)# Model parallelself.model_parallel = Falseself.device_map = None# Initialize weights and apply final processingself.post_init()@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)@add_code_sample_docstrings(checkpoint="microsoft/DialogRPT-updown",output_type=SequenceClassifierOutputWithPast,config_class=_CONFIG_FOR_DOC,)def forward(self,input_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,labels: Optional[torch.LongTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple, SequenceClassifierOutputWithPast]:r"""labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If`config.num_labels > 1` a classification loss is computed (Cross-Entropy)."""return_dict = return_dict if return_dict is not None else self.config.use_return_dicttransformer_outputs = self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states = transformer_outputs[0]logits = self.score(hidden_states)if input_ids is not None:batch_size, sequence_length = input_ids.shape[:2]else:batch_size, sequence_length = inputs_embeds.shape[:2]assert (self.config.pad_token_id is not None or batch_size == 1), "Cannot handle batch sizes > 1 if no padding token is defined."if self.config.pad_token_id is None:sequence_lengths = -1else:if input_ids is not None:sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)else:sequence_lengths = -1logger.warning(f"{self.__class__.__name__} will not detect padding tokens in `inputs_embeds`. Results may be ""unexpected if using padding tokens in conjunction with `inputs_embeds.`")pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]loss = Noneif labels is not None:if self.config.problem_type is None:if self.num_labels == 1:self.config.problem_type = "regression"elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):self.config.problem_type = "single_label_classification"else:self.config.problem_type = "multi_label_classification"if self.config.problem_type == "regression":loss_fct = MSELoss()if self.num_labels == 1:loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())else:loss = loss_fct(pooled_logits, labels)elif self.config.problem_type == "single_label_classification":loss_fct = CrossEntropyLoss()loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))elif self.config.problem_type == "multi_label_classification":loss_fct = BCEWithLogitsLoss()loss = loss_fct(pooled_logits, labels)if not return_dict:output = (pooled_logits,) + transformer_outputs[1:]return ((loss,) + output) if loss is not None else outputreturn SequenceClassifierOutputWithPast(loss=loss,logits=pooled_logits,past_key_values=transformer_outputs.past_key_values,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,)@add_start_docstrings("""GPT2 Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. forNamed-Entity-Recognition (NER) tasks.""",GPT2_START_DOCSTRING,
)
class GPT2ForTokenClassification(GPT2PreTrainedModel):def __init__(self, config):super().__init__(config)self.num_labels = config.num_labelsself.transformer = GPT2Model(config)if hasattr(config, "classifier_dropout") and config.classifier_dropout is not None:classifier_dropout = config.classifier_dropoutelif hasattr(config, "hidden_dropout") and config.hidden_dropout is not None:classifier_dropout = config.hidden_dropoutelse:classifier_dropout = 0.1self.dropout = nn.Dropout(classifier_dropout)self.classifier = nn.Linear(config.hidden_size, config.num_labels)# Model parallelself.model_parallel = Falseself.device_map = None# Initialize weights and apply final processingself.post_init()@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING)# fmt: off@add_code_sample_docstrings(checkpoint="brad1141/gpt2-finetuned-comp2",output_type=TokenClassifierOutput,config_class=_CONFIG_FOR_DOC,expected_loss=0.25,expected_output=["Lead", "Lead", "Lead", "Position", "Lead", "Lead", "Lead", "Lead", "Lead", "Lead", "Lead", "Lead"],)# fmt: ondef forward(self,input_ids: Optional[torch.LongTensor] = None,past_key_values: Optional[Tuple[Tuple[torch.Tensor]]] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,labels: Optional[torch.LongTensor] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple, TokenClassifierOutput]:r"""labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If`config.num_labels > 1` a classification loss is computed (Cross-Entropy)."""return_dict = return_dict if return_dict is not None else self.config.use_return_dicttransformer_outputs = self.transformer(input_ids,past_key_values=past_key_values,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,use_cache=use_cache,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)hidden_states = transformer_outputs[0]hidden_states = self.dropout(hidden_states)logits = self.classifier(hidden_states)loss = Noneif labels is not None:labels = labels.to(logits.device)loss_fct = CrossEntropyLoss()loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))if not return_dict:output = (logits,) + transformer_outputs[2:]return ((loss,) + output) if loss is not None else outputreturn TokenClassifierOutput(loss=loss,logits=logits,hidden_states=transformer_outputs.hidden_states,attentions=transformer_outputs.attentions,)@add_start_docstrings("""The GPT-2 Model transformer with a span classification head on top for extractive question-answering tasks likeSQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).""",GPT2_START_DOCSTRING,
)
class GPT2ForQuestionAnswering(GPT2PreTrainedModel):_keys_to_ignore_on_load_missing = [r"h\.\d+\.attn\.masked_bias", r"h\.\d+\.attn\.bias", r"lm_head.weight"]def __init__(self, config):super().__init__(config)self.num_labels = config.num_labelsself.transformer = GPT2Model(config)self.qa_outputs = nn.Linear(config.hidden_size, 2)# Model parallelself.model_parallel = Falseself.device_map = Noneself.gradient_checkpointing = False# Initialize weights and apply final processingself.post_init()@add_start_docstrings_to_model_forward(GPT2_INPUTS_DOCSTRING.format("batch_size, sequence_length"))@add_code_sample_docstrings(checkpoint=_CHECKPOINT_FOR_DOC,output_type=QuestionAnsweringModelOutput,config_class=_CONFIG_FOR_DOC,real_checkpoint=_CHECKPOINT_FOR_DOC,)def forward(self,input_ids: Optional[torch.LongTensor] = None,attention_mask: Optional[torch.FloatTensor] = None,token_type_ids: Optional[torch.LongTensor] = None,position_ids: Optional[torch.LongTensor] = None,head_mask: Optional[torch.FloatTensor] = None,inputs_embeds: Optional[torch.FloatTensor] = None,start_positions: Optional[torch.LongTensor] = None,end_positions: Optional[torch.LongTensor] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple, QuestionAnsweringModelOutput]:r"""start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):Labels for position (index) of the start of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequenceare not taken into account for computing the loss.end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):Labels for position (index) of the end of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequenceare not taken into account for computing the loss."""return_dict = return_dict if return_dict is not None else self.config.use_return_dictoutputs = self.transformer(input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids,position_ids=position_ids,head_mask=head_mask,inputs_embeds=inputs_embeds,output_attentions=output_attentions,output_hidden_states=output_hidden_states,return_dict=return_dict,)sequence_output = outputs[0]logits = self.qa_outputs(sequence_output)start_logits, end_logits = logits.split(1, dim=-1)start_logits = start_logits.squeeze(-1).contiguous()end_logits = end_logits.squeeze(-1).contiguous()total_loss = Noneif start_positions is not None and end_positions is not None:# If we are on multi-GPU, split add a dimensionif len(start_positions.size()) > 1:start_positions = start_positions.squeeze(-1).to(start_logits.device)if len(end_positions.size()) > 1:end_positions = end_positions.squeeze(-1).to(end_logits.device)# sometimes the start/end positions are outside our model inputs, we ignore these termsignored_index = start_logits.size(1)start_positions = start_positions.clamp(0, ignored_index)end_positions = end_positions.clamp(0, ignored_index)loss_fct = CrossEntropyLoss(ignore_index=ignored_index)start_loss = loss_fct(start_logits, start_positions)end_loss = loss_fct(end_logits, end_positions)total_loss = (start_loss + end_loss) / 2if not return_dict:output = (start_logits, end_logits) + outputs[2:]return ((total_loss,) + output) if total_loss is not None else outputreturn QuestionAnsweringModelOutput(loss=total_loss,start_logits=start_logits,end_logits=end_logits,hidden_states=outputs.hidden_states,attentions=outputs.attentions,)

GPT3:In-context learning正式开启prompt新范式(小样本学习)

Prompt技术的升级与创新:指令微调技术(IFT)与思维链技术(CoT)

未完待续!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/64040.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

考研党福利?ChatGPT秒杀了所有408考研编程题……

来源&#xff1a;新智元 本文均由ChatGPT生成 那么&#xff0c;ChatGPT可以解决408中的编程题吗&#xff1f; 直接结论&#xff1a;由ChatGPT给出的408代码90%可以拿满分 2023年408的编程题 哟&#xff0c;看着还行 2022年408编程题 解答题回答的也不错 2021年408编程题 ChatGP…

chatgpt赋能python:Python编程题怎么搜答案

Python编程题怎么搜答案 Python是一种高级编程语言&#xff0c;具有易读性、简洁性和可重用性等优点&#xff0c;因此越来越多的程序员选择使用Python开发应用程序。但是在面对Python编程题的时候&#xff0c;即使你已经有了10年的编程经验&#xff0c;可能也会遇到一些问题。…

利用ChatGPT学习编程,让你成为新时代程序员

大家好&#xff0c;我是静幽水&#xff0c;目前是一名大厂全栈工程师&#xff0c;擅长Java后端&#xff0c;Vue前端&#xff0c;小程序编程&#xff0c;Python编程&#xff0c;ChatGPT 提示词等技术。我会分享一些相关的干货知识&#xff0c;感兴趣的话就关注我吧&#xff0c;希…

活学活用虚拟环境,Python编程更高效

介绍4种在Python中使用虚拟环境的方法。 微信搜索关注《Python学研大本营》&#xff0c;加入读者群&#xff0c;分享更多精彩 本文将涵盖以下主题&#xff1a;什么是python中的虚拟环境&#xff1f;为什么需要虚拟环境&#xff1f;在Python中设置虚拟环境的4种不同方法。本文将…

用好这两个方法,解决Python中的线程同步问题

了解互斥锁和连接&#xff0c;实现Python中安全有效的多线程。 微信搜索关注《Python学研大本营》&#xff0c;加入读者群&#xff0c;分享更多精彩 同步的重要性是什么&#xff1f; 假设有一个共享的家庭银行账户&#xff0c;余额为50美元&#xff0c;属于你和你父亲。 爸爸挣…

了解Python编码风格,让你的代码更好看

和其他编程语言不同&#xff0c;Python有一套独特的编码风格&#xff0c;掌握Python的编码风格对于编写优美的代码至关重要。 微信搜索关注《Python学研大本营》&#xff0c;加入读者群&#xff0c;分享更多精彩 本文是一篇快速了解Python编码风格的指南&#xff0c;了解Python…

国家电网可视化平台完工交付给客户!

国家电网可视化平台完工交付给客户&#xff0c;助力电网信息化&#xff01; 转载于:https://www.cnblogs.com/shuzikeji/p/7844358.html

2019年南方电网和国家电网考纲对比(通信类)

最近在准备关于国家电网和南方电网的校园招聘笔试&#xff0c;整理了如上内容&#xff0c;仅供参考&#xff0c;小结如下&#xff1a; 南方电网&#xff1a; 批次&#xff1a;南方电网校园招聘考试一般每年只有一批&#xff0c;比重&#xff1a;比较注重面试环节&#xff0c;面…

国家电网 计算机 《信息新技术》 整理

信息新技术概论 分布式处理基础分布式数据库&#xff08;DDB,Distributed Database)︰分布式文件系统(DFS&#xff0c;Distributed File System)区块链(Blockchain)&#xff1a; 物联网基础基本概念 大数据基础人工智能基础神经网络(NNsNeural Networks)机器学习 典型硬件技术基…

【NLP】千呼万唤始出来——GPT-3终于开源!

文 | 小戏编 | 小轶 GPT3终于开源&#xff01;不过&#xff0c;不是官方开的&#xff08;别打我 Eleuther AI推出的名为GPT-Neo的开源项目&#xff0c;于今晨4点于twitter正式宣布&#xff1a;已经开源了复现版GPT-3的模型参数&#xff08;1.3B和2.7B级别&#xff09;&#xff…

属于自己的贾维斯

属于自己的贾维斯之Python学习 人生第一次写博客&#xff0c;想记录下自己的学习过程&#xff0c;以便以后复习简单(毕竟自己的博客总想知道有没有人来看,就可以顺便过来复习了)&#xff0c;因为用笔的记录感觉都没怎么去看,所以就想用这种方法来记录。因为本人比较懒再加上精神…

七夕送女友什么礼物有意义、送女朋友实用的七夕礼物清单

在即将到来的中国传统情人节——七夕节当中&#xff0c;怎样送女朋友实用又用心的礼物呢&#xff1f;想必有不少男生朋友们不知道怎么选择吧&#xff01;要知道合适的礼物可以在改善生活质量的同时也为彼此的感情带来惊喜&#xff0c;今天就为大家带来送女朋友实用的5个礼物推荐…

七夕节送女朋友什么礼物、能让女生感动到哭的礼物推荐

七夕作为我国的传统情人节马上就要到来了。在这一天也是恋爱中人相互向对方表达爱意的好时机&#xff0c;精心为对方准备一份七夕礼物也是情理之中的事&#xff0c;但是咱们很多男性小伙伴在面对市面上令人眼花缭乱礼物的时候&#xff0c;在挑选问题上却是不知从何下手了。别担…

程序员如何哄女朋友开心的秘籍,定制给女朋友一个应用(生日礼物)

这算是我写的安卓比较完整的一个应用了吧&#xff0c;不过其实也还不怎么完整&#xff0c;还有好多功能没有加进去&#xff0c;但是由于昨天是女朋友的生日&#xff0c;所以就送给他了&#xff0c;这也是我学习安卓半个月来的第一个应用了“音乐播放器”&#xff01; 制作&…

女朋友过生日送什么礼物好?

观察她喜欢什么 平日里陪她逛街购物时&#xff0c;你留意到她很喜欢一件衣服或是饰品之类的&#xff0c;她可能因为各种原因&#xff0c;没买。你记在心里面&#xff0c;生日前买下了&#xff0c;作为生日礼物送个她&#xff0c;她会非常的欢喜。 如果离过生日还有很长一段时间…

如何做一个网页送给女朋友做生日礼物!感动到哭!

如何做一个网页送给女朋友做生日礼物 本文里面涉及到python&#xff0c;HTML &#xff0c;css,JavaScript的知识&#xff0c;是基于python的flask框架做的一个小型网站&#xff0c;里面可以实现跳转功能&#xff0c;怎么配置flask的环境变量&#xff0c;去官方文档看就好了&am…

情人节送女朋友什么礼物最好?五大首选礼物排行榜单!

一年一度的情人节又要到了&#xff0c;各位男生有没有因为不知道给女朋友们挑选什么礼物而心慌慌&#xff1f;情人节礼物绝对能反映出你对女朋友平时喜好的了解程度&#xff0c;选对了情人节礼物&#xff0c;可以让你们的感情更加甜蜜。今天就来说说有哪几款非常适合送女孩子的…

如何做一个网页送给女朋友做生日礼物!这应该是最好的礼物了!

如何做一个网页送给女朋友做生日礼物 本文里面涉及到python&#xff0c;HTML &#xff0c;css,JavaScript的知识&#xff0c;是基于python的flask框架做的一个小型网站&#xff0c;里面可以实现跳转功能&#xff0c;怎么配置flask的环境变量&#xff0c;去官方文档看就好了&am…

这是我送给兄弟女朋友的六一礼物

这是我送给兄弟女朋友的六一礼物 1. 写在前面2. 前期准备2.1. 需要安装的库2.2. 安装库的命令2.3. 库的介绍 3. 代码展示4. 运行结果展示5. 总结 1. 写在前面 事情是这样的&#xff0c;最近不是六一儿童节嘛&#xff0c;好像我身边充斥着大量大龄儿童。我兄弟就是木讷&#xf…

给女友的网页小惊喜,(生日,周年,表白通用) ☞谁说程序员不懂浪漫

有女朋友的拿去给女朋友一个惊喜&#xff0c;没女朋友的拿去表白&#xff0c;或者NEW它10000000个&#xfeff;ε≡٩(๑>₃<)۶ 文章目录 前言适用范围网页展示登录界面文字界面图片界面尾部界面 获取源码 前言 前些日子是女友的一周年&#xff0c;康康想用一种特殊的方…