论文学习-Bert 和GPT 有什么区别?

Foundation Models, Transformers, BERT and GPT

总结一下:

  • Bert 是学习向量表征,让句子中某个词的Embedding关联到句子中其他重要词。最终学习下来,就是词向量的表征。这也是为什么Bert很容易用到下游任务,在做下游任务的时候,需要增加一些MLP对这些特征进行分类啥的,也就是所谓的微调fine-tune。在Bert的训练中,采用了MASK(完形填空)的思想,用句子中的其他词去预测被挖空的词–Self-Supervised Learning(不需要给句子label,只需要挖空)。这也是Bert不需要Decoder的原因。

  • GPT在做生成,结果是下一个特定词被选中的概率。给一个句子,去生成下一个字,然后再把这个字包含到句子中,重新送入模型,再生成下一个字。周而复始。我能理解这个任务用Decoder可以完成,但为什么这个过程不加入encoder了。–后面看到之后再补充

Bert和GPT都属于预训练模型,在预训练阶段,只不过在目标函数的选取上,Bert采用了完型填空的训练方式,GPT选择的是给定一句话预测下一个字的训练方式。在微调阶段,GPT选择使用两个目标函数结合的方式进行微调,而Bert的话,需要结合任务添加一些层对语义特征进行处理。

GPT选择的方式相对Bert要更困难,预测未来比预测中间状态要难得多,这也是为什么OpenAI要将模型的规模一直做大,才能达到GPT3.5 、GPT4的这种效果。


补充
之前李沐老师的视频里面其实也有讲,但是没记住。论带着问题学习的重要性 -_-

Transformer有两个东西,一个是encoder、一个是decoder。区别在于,encoder在对第i个元素抽取特征时,可以看到整个序列里面的所有元素。而decoder因为有掩码的存在,在对第i个元素抽取特征时,只能看到当前元素和它之前的元素,当前位置后面的词通过一个掩码使得在计算注意力机制的时候变成0。因为是标准的语言模型,只对前预测。对第i个词进行预测的时候,不能看到之后的词。所以GPT(Generative Pre-Training)使用的只是decoder。


学习链接Blog—完全转载

  • https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/
    在这里插入图片描述

Since I’m excited by the incredible capabilities which technologies like ChatGPT and Bard provide, I’m trying to understand better how they work. This post summarizes my current understanding about foundation models, transformers, BERT and GPT.

Note that I’m only learning these concepts and not everything might be fully correct but might help some people to understand the high level concepts.

I know that there are many more and more modern Foundation Models than BERT and GPT, but I want to start ‘simple’ and these two models are probably the most known ones these days.

The technologies below are not trivial and there are lots of articles, papers and full courses even on certain aspects of each technology only. Instead of going into detail, I try to explain what they do and what concepts they use.

Foundation Models

BERT and GPT are both foundation models. Let’s look at the definition and characteristics:

  • Pre-trained on different types of unlabeled datasets (e.g., language and images)
  • Self-supervised learning
  • Generalized data representations which can be used in multiple downstream tasks (e.g., classification and generation)
  • The Transformer architecture is mostly used, but not mandatory

Read my blog Foundation Models at IBM to find out more.

Transformer Architecture

Most foundation models use the transformer architecture. Let’s look at the definition:

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing and computer vision.

In 2017 transformers were introduced: Attention is all you need. They are the next generation of Recurrent Neural Networks and Long Short-Term Memory architectures and have several benefits:

  • Parallel processing: Increases performance and scalability
  • Bidirectionality: Allows understanding of ambiguous words and coreferences

The original transformer architecture defines two main parts, an encoder and a decoder. However, not all foundation models use both parts. BERT only uses encoders, GPT only decoders. More on this later.

Attention

Both encoders and decoders use the concept of ‘attention’. Attention basically means to focus on the important pieces of information and to blend out the unimportant pieces. I like to compare this with ‘fast reading’. Rather than reading full articles or even full books, I often browse chapter titles, first words of paragraphs and scan paragraphs for keywords to find what I’m looking for.

The words of an article, the parts of an image or the words in a sentence that should get most attention change dependent on what you are looking for. Let’s look at a simple example sentence:

“Sarah went to a restaurant to meet her friend that night.”

The following words should get attention for the following queries:

  • What? -> ‘went’, ‘meet’
  • Where? -> ‘a restaurant’
  • Who? -> ‘Sarah’, ‘her friend’
  • When? -> ‘that night’

To determine the attention of words (more exactly tokens) ‘queries’, ‘keys’ and ‘values’ are used by encoders and decoders in transformers. All of them are presented in vectors. Keys are found for certain queries if they are closest to the query vector. Keys are an encoded representation for values, in simple cases they can be the same.

There are different algorithms to implement the attention concept. I think an easy way to understand how this can work is to rank words high that are often used together in sentences. For example, ‘where’ and ‘restaurant’ have probably a closer relation than ‘restaurant’ and ‘faith’. So, for the query ‘where’ the word ‘restaurant’ gets more attention.

Encoders and Decoders

As mentioned, there are encoders and decoders. BERT uses encoders only, GTP uses decoders only. Both options understand language including syntax and semantics. Especially the next generation of large language models like GPT with billions of parameters do this very well.

The two models focus on different scenarios. However, since the field of foundation models is evolving, the differentiation is often fuzzier.

  • BERT (encoder): classification (e.g., sentiment), questions and answers, summarization, named entity recognition
  • GPT (decoder): translation, generation (e.g., stories)

The outputs of the core models are different:

  • BERT (encoder): Embeddings representing words with attention information in a certain context
  • GPT (decoder): Next words with probabilities

Both models are pretrained and can be reused without intensive training. Some of them are available as open source and can be downloaded from communities like Hugging Face, others are commercial. Reuse is important, since trainings are often very resource intensive and expensive which few companies can afford.

The pretrained models can be extended and customized for different domains and specific tasks. Layers can sometimes be reused without modifications and more layers are added on top. If layers need to be modified, the new training is more expensive. The technique to customize these models is called Transfer Learning, since the same generic model can easily be transferred to other domains.

BERT - Encoders

BERT uses the encoder part of the transformer architecture so that it understands semantic and syntactic language information. The output of BERT are embeddings, not predicted next words. To leverage these embeddings, other layer(s) need to be added on top, for example text classification or questions and answers.

BERT uses a genius trick for the training. For supervised training it is often expensive to get labeled data, sometimes it’s impossible. The trick is to use masks as I described in my post Evolution of AI explained via a simple Sample. Let’s take a simple example, an unlabeled sentence:

“Sarah went to a restaurant to meet her friend that night.”

This is converted into:

  • Text: “Sarah went to a restaurant to meet her MASK that night.”
  • Label: “Sarah went to a restaurant to meet her friend that night.”

Note that this is a very simplified description only since there aren’t ‘real’ labels in BERT.

In other words, BERT produces labeled data for originally unlabeled data. This technique is called Self-Supervised Learning. It works very well for huge amounts of data.

In masked language models like BERT, each masked word (token) prediction is conditioned on the rest of the tokens in the sentence. These are received in the encoder which is why you don’t need a decoder.

GPT - Decoders

In language scenarios decoders are used to generate next words, for example when translating text or generating stories. The outputs are words with probabilities.

Decoders also use the attention concepts and even two times. First when training models, they use Masked Multi-Head Attention which means that only the first words of the target sentence are provided so that the model can learn without cheating. This mechanism is like the MASK concept from BERT.

After this the decoder uses Multi-Head Attention as it’s also used in the encoder. Transformer based models that utilize encoders and decoders use a trick to be more efficient. The output of the encoders is feed as input to the decoders, more precisely the keys and values. Decoders can invoke queries to find the closest keys. This allows, for example, to understand the meaning of the original sentence and translate it into other languages even if the number of resulting words and the order changes.

GPT doesn’t use this trick though and only use a decoder. This is possible since these types of models have been trained with massive amounts of data (Large Language Model). The knowledge of encoders is encoded in billions of parameters (also called weights). The same knowledge exists in decoders when trained with enough data.

Note that ChatGPT has evolved these techniques. To prevent hate, profanity and abuse, humans need to label some data first. Additionally Reinforcement Learning is applied to improve the quality of the model (see ChatGPT: Optimizing Language Models for Dialogue).

Resources

There are many good articles, videos and courses. Here are some of the ones I read or watched:

  • Course: Natural Language Processing Demystified
  • YouTube channel: CodeEmporium
  • Article: What Is ChatGPT Doing … and Why Does It Work?
  • Article: 10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape
  • Article: Transformer’s Encoder-Decoder: Let’s Understand The Model Architecture
  • NLP - BERT & Transformer

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/210076.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

python pyaudio实时读取音频数据并展示波形图

python pyaudio实时读取音频数据并展示波形图 下面代码可以驱动电脑接受声音数据,并实时展示音波图: import numpy as np import matplotlib.pyplot as plt import matplotlib.animation as animation import pyaudio import wave import os import op…

【改进YOLOV8】融合动态蛇形卷积&DCNV2的草莓分级分割分割系统

1.研究背景与意义 项目参考AAAI Association for the Advancement of Artificial Intelligence 研究背景与意义 随着计算机视觉技术的不断发展,图像分割成为了一个重要的研究领域。图像分割可以将图像中的不同对象或区域进行分离,从而更好地理解图像内…

后端部署-阿里云服务器-开设端口-域名解析-安全证书-备案

本文以阿里云的轻量级数据库为例子。 前言 要搭建一个完整的后端系统一般的步骤: 获得一台服务器----->开设端口----->搭建后台所需要的语言和应用---->利用公网ip地址测试后端程序------->购买域名和证书-------->域名绑定和解析------->icp备…

SpringBoot——嵌入式 Servlet容器

一、如何定制和修改Servlet容器的相关配置 前言: SpringBoot在Web环境下,默认使用的是Tomact作为嵌入式的Servlet容器; 【1】修改和server相关的配置(ServerProperties实现了EmbeddedServletContainerCustomizer)例如…

C语言小游戏:三子棋

目录 🌍前言 🚅目录设计 💎游戏逻辑设置 ⚔三子棋棋盘设计 ⚔三子棋运行逻辑 👀怎么设置人下棋 👀怎么设置电脑下棋 ✈如何判断输赢 ✍结语 🌍前言 Hello,csdn的各位小伙伴你们好啊!这次小赵给大…

利用DateFormat、Date、Calendar等相关类,编程实现如下功能

(1)用户输入2个日期,第一个日期用整数形式输入,把输入的整数设置为日历对象1的年月日的值。第二个日期以字符串形式输入,形如“2022-10-25”,并设置为日历对象2的年月日的值。将2个日期以“xx年xx月xx日”的…

C++12.4

沙发床的多继承 多继承代码实现沙发床沙发床继承于沙发和床 代码&#xff1a; #include <iostream>using namespace std;//封装 沙发 类 class Sofa { private:string sitting;double *size; public://无参构造函数Sofa() {cout << "Sofa::无参构造函数&quo…

接口自动化测试用例

1、接口文档 根据开发、产品的接口文档&#xff0c;以及评审&#xff0c;进行设计接口测试用例&#xff0c;它不像UI测试&#xff0c;有个界面&#xff0c;对于简单的系统&#xff0c;需求文档不提供也能覆盖所有功能&#xff0c;接口测试虽说可以抓包&#xff0c;但抓包无法覆…

STM32串口接收不定长数据(空闲中断+DMA)

玩转 STM32 单片机&#xff0c;肯定离不开串口。串口使用一个称为串行通信协议的协议来管理数据传输&#xff0c;该协议在数据传输期间控制数据流&#xff0c;包括数据位数、波特率、校验位和停止位等。由于串口简单易用&#xff0c;在各种产品交互中都有广泛应用。 但在使用串…

HDFS客户端及API操作实验

实验二 HDFS客户端及API操作 实验目的&#xff1a; 1.掌握HDFS的客户端操作&#xff0c;包括上传文件、下载文件、重命名、查看目录等&#xff1b; 2.掌握HDFS的Java API使用&#xff0c;能够利用Java API实现上传、下载等常用操作&#xff1b; 实验内容&#xff1a; HDF…

深度学习——第3章 Python程序设计语言(3.3 Python数据类型)

3.3 Python数据类型 目录 1. Python数值数据类型 2. Python库的导入和使用 3. Python序列数据类型 4. Python组合数据类型 计算机能处理各种类型的数据&#xff0c;包括数值、文本等&#xff0c;不同的数据属于不同的数据类型&#xff0c;有不同的存储方式&#xff0c;支持…

EM32DX-C2【C#】

1说明&#xff1a; 分布式io&#xff0c;CAN总线&#xff0c;C#上位机二次开发&#xff08;usb转CAN模块&#xff09; 2DI&#xff1a; 公共端是&#xff1a; 0V【GND】 X0~X15&#xff1a;自带24v 寄存器地址&#xff1a;0x6100-01 6100H DI输入寄存器 16-bit &#x…

nginx部署和安装-后端程序多端口访问-后端代理设置

部分补充 查看nginx是否安装http_ssl_module模块 ./nginx -V 看到有 configure arguments: --with-http_ssl_module, 则已安装。 如果没有安装&#xff1a;参考文档 nginx官网地址&#xff1a;nginx: download 这里下载nginx-1.18.0稳定版tar.gz 下载后&#xff0c;利用…

2023年多元统计分析期末试题

一、简答题 1、试述距离判别法、Fisher判别法和贝叶斯判别法的异同。 二、 2、设 X {X} X~ N 2 {N_2} N2​(μ&#xff0c;Σ)&#xff0c;其中 X {X} X ~ ( X 1 {X_1} X1​, X 2 {X_2} X2​, X 3 {X_3} X3​)&#xff0c;μ ( μ 1 {μ_1} μ1​&#xff0c; μ 2 {μ_2} …

长度最小的子数组

长度最小的子数组 描述 : 给定一个含有 n 个正整数的数组和一个正整数 target 。 找出该数组中满足其总和大于等于 target 的长度最小的 连续子数组 [numsl, numsl1, …, numsr-1, numsr] &#xff0c;并返回其长度。如果不存在符合条件的子数组&#xff0c;返回 0 。 题目…

MySQL表的查询、更新、删除

查询 全列查询 指定列查询 查询字段并添加自定义表达式 自定义表达式重命名 查询指定列并去重 select distinct 列名 from 表名 where条件 查询列数据为null的 null与 (空串)是不同的&#xff01; 附&#xff1a;一般null不参与查询。 查询列数据不为null的 查询某列数据指定…

GPT市场将取代插件商店 openAI已经关闭plugins申请,全部集成到GPTs(Actions)来连接现实世界,可以与物理世界互动了。

Actions使用了plugins的许多核心思想&#xff0c;也增加了新的特性。 ChatGPT的"Actions"与"Plugins"是OpenAI在GPT模型中引入的两种不同的功能扩展机制。这两种机制的目的是增强模型的功能&#xff0c;使其能够处理更多样化的任务和请求。下面是对两者的比…

快手视频如何去掉水印?三个简单好用视频去水印方法

快手视频如何去掉水印&#xff1f;尽管新兴的短视频平台如春笋般涌现&#xff0c;吸引了众多观众在业余时间浏览和分享视频&#xff0c;快手作为当下主流短视频之一&#xff0c;许多自媒体创作者也常常会下载一些热门的视频素材进行二次编辑。然而&#xff0c;他们都可能会面临…

Apache solr XXE 漏洞(CVE-2017-12629)

任务一&#xff1a; 复现环境中的漏洞 任务二&#xff1a; 利用XXE漏洞发送HTTP请求&#xff0c;在VPS服务器端接受请求&#xff0c;或收到DNS记录 任务三&#xff1a; 利用XXE漏洞读取本地的/etc/passwd文件 1.搭建环境 2.开始看wp的时候没有看懂为什么是core&#xff0c;然…

SaToken利用Redis做持久化

官网解释 官网解释 教程 引入依赖 <!-- 提供Redis连接池 --> <dependency><groupId>org.apache.commons</groupId><artifactId>commons-pool2</artifactId> </dependency><!-- Sa-Token 整合 Redis &#xff08;使用 jdk 默认序…