open clip论文阅读摘要

看下open clip论文
Learning Transferable Visual Models From Natural Language Supervision
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets.

CNNs trained to predict words in image captions learn useful image representations

learn image representations from text

我好奇,在OCR上是怎么测试的?
CLIP训练样本要怎么准备,400 million (image, text) pairs,这个量级的样本集是怎么准备出来的

论文说CLIP这种预训练,zero-shot可以媲美基于监督学习构建的模型,我需要打个问号,在特定领域的业务数据上好像不太够啊?

Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”

MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each

YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality
Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is
approximately the same size as ImageNet

A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet.

we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet

We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText
注重样本的类别平衡

we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric
是的,在这样规模的数据集上训练,需要的时间是令人畏惧的,所以掌握更快速的训练效率是关键

Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent
predictive objective
这个发现很有意思,这说明我们可以不需要准确预测每个图片的text caption,这太难了

Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance
这里又提到了生成模型,在学习表征方面,有监督学习CNN、对比学习CLIP、生成模型Stable Diffusion

We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.
在这么一个大数据集上,甚至比ImageNet还大,加载预训练的ImageNet模型和text encoder模型确实没必要

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text)
pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into a
probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling

Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text
出现这个问题的原因是模型没能理解语言,不过现在GPT4可以做到了,估计会有点儿突破?
Similar to the “prompt engineering” discussion around GPT3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”.
这种prompt对于性能的提升是肯定的

We also experimented with ensembling over multiple zeroshot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”. We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions
这里使用emsemble的方法提升性能

说白了,比监督学习强在:
1、数据量更多
2、任务种类更多
3、加上文本学习语义信息,不单单是空间信息

we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as
German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks.
貌似跟GPT4也有点像?虽然通用性很不错,但是没办法做到全能,特别是复杂任务上,我感觉还是数据的问题吧,当然也有可能是现在的模型架构没办法应对复杂任务,所以需要拆解成更简单的子任务。不可否认的是在业务数据标注上存在加速作用

First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast,
“normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee
是的,few-shot的难点主要在于,你不知道模型把什么特征跟最终的标签做了关联,所以需要加大样本数据量,才能使模型正确找到这个路径
zero-shot主要是在大数据量上预训练了,所以跟few-shot还是有区别的
我比较看好在大数据上预训练过的大模型


其实这个评估有点儿问题?如何评判稳定性?你这个只是在建立的测试样本集上的结果而已,并不是大量的数据评估结果,特别是放到真实业务场景下的分析结果,我觉得每个类别多点儿数据不是坏事,可以加强特征到标签的连接,特别是捕获正确的特征

If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve
从拟合能力上来看,监督学习可以拟合的性能上限,也是zero-shot可以达到的上限

在大数据上学习到通用表征能力,跟在特定数据集上做监督训练,并不是冲突的

Over the past few years, empirical studies of deep learning
systems have documented that performance is predictable as
a function of important quantities such as training compute
and dataset size
这里提到,近年来的深度学习预测能力,是可以评估的,这个确实有点儿意思哈

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/185064.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于React开发的chatgpt网页版(仿chatgpt)

在浏览github的时候发现了一个好玩的项目本项目,是github大神Yidadaa开发的chatgpt网页版,该开源项目是跨平台的,Web / PWA / Linux / Win / MacOS都可以访问。非常有意思,本人就部署了一套,喜欢的同学可以体验一番。 …

Python之字符串、正则表达式练习

目录 1、输出随机字符串2、货币的转换(字符串 crr107)3、凯撒加密(book 实验 19)4、字符替换5、检测字母或数字6、纠正字母7、输出英文中所有长度为3个字母的单词 1、输出随机字符串 编写程序,输出由英文字母大小写或…

wpf添加Halcon的窗口控件报错:下列控件已成功添加到工具箱中,但未在活动设计器中启用

报错截图如下: 注意一下新建工程的时候选择wpf应用而不是wpf应用程序。 添加成功的控件:

第一个ARM程序裸板点灯

硬件知识LED原理图 如何点亮一个LED灯? 看原理图,确定控制LED的引脚。看主芯片的芯片手册,确定如何设置控制这个引脚。写程序。 LED有插脚封装的、贴片封装的。 它们长得完全不一样,因此我们在原理图中把它们抽象出来。 点亮…

双通道 H 桥电机驱动芯片AT8833,软硬件兼容替代DRV8833,应用玩具、打印机等应用

上期小编给大家分享了单通道 H 桥电机驱动芯片,现在来讲一讲双通道的驱动芯片。 双通道 H 桥电机驱动芯片能通过控制电机的正反转、速度和停止等功能,实现对电机的精确控制。下面介绍双通道H桥电机驱动芯片的工作原理和特点。 一、工作原理 双通道 H 桥电…

基于单片机的养殖场温度控制系统设计

博主主页:单片机辅导设计 博主简介:专注单片机技术领域和毕业设计项目。 主要内容:毕业设计、简历模板、学习资料、技术咨询。 文章目录 主要介绍一、控制系统设计二、系统方案设计2.1 系统运行方案设计2.1.1 羊舍环境温度的确定 三、 系统仿…

【FastCAE源码阅读6】C++与Python的集成,实现相互调用

分析FastCAE代码之前先看看C与Python如何相互调用的。 一、C调用Python 先写个C调用Python的例子&#xff0c;然后再来看FastCAE集成Python就比较简单了。直接上代码&#xff1a; #include <iostream> #include "python.h"int main() {Py_Initialize();PyRu…

【C语法学习】20 - 文件访问顺序

文章目录 0 前言1 文件位置指示符2 rewind()2.1 函数原型2.2 参数2.3 返回值2.4 使用说明 3 ftell()函数3.1 函数原型3.2 参数3.3 返回值 4 fseek()4.1 函数原型4.2 参数4.3 返回值 5 示例5.1 示例15.2 示例2 0 前言 C语言文件访问分为顺序文件访问和随机文件访问。 1 文件位…

云架构师学习------腾讯云通识-存储与数据库

云架构师学习------腾讯云通识-存储与数据库 云架构师学习------腾讯云通识-存储与数据库存储基础存储服务对象存储-COS产品概述功能概览产品优势 云硬盘-CBS产品概述产品功能产品优势云硬盘类型 文件存储-CFS产品概述产品功能产品优势文件存储类型及性能规格存储类型性能与规格…

react之Component存在的2个问题

问题 只要执行setState()&#xff0c;即使不改变状态数据&#xff0c;组件也会重新render()只当前组件重新render()&#xff0c;就会自动重新render子组件 原因 Component中的shouldComponentUpdate()总是返回true 思路 只有当组件的state或props数据发生改变时才重新rend…

Qt QTableView排序

1.简介 在开发过程中&#xff0c;我们需要通过点击表头来对QTableView或QTreeView等一系列高级视图进行排序操作&#xff0c;以下是进行排序的步骤。 步骤&#xff1a; 首先创建了一个QStandardItemModel对象或者继承QAbstractTableModel类作为数据模型&#xff0c;并设置了…

工厂设备扫码使用售卖联网开发需要怎么开发开源代码?

我们将详细介绍如何使用开源代码开发一套用于工厂设备联网统计的系统。我们将详细讨论所需硬件组件的选择、开源框架和库的使用、软件开发流程以及最后的集成和部署。在这个过程中&#xff0c;我们将提供实用的操作步骤和指导&#xff0c;帮助你更容易地完成这个复杂的任务。 …

Docker实战

一、Docker安装 以下均以CentOS 7为例 1、安装Docker yum install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin 2、启动和校验 # 启动Docker systemctl start docker# 停止Docker systemctl stop docker# 重启 systemctl resta…

【Qt之QVariant】使用

介绍 QVariant类类似于最常见的Qt数据类型的联合。由于C禁止联合类型包括具有非默认构造函数或析构函数的类型&#xff0c;大多数有趣的Qt类不能在联合中使用。如果没有QVariant&#xff0c;则QObject::property()和数据库操作等将会受到影响。 QVariant对象同时持有一个单一…

【数据结构】树与二叉树(六):二叉树的链式存储

文章目录 5.1 树的基本概念5.1.1 树的定义5.1.2 森林的定义5.1.3 树的术语5.1.4 树的表示 5.2 二叉树5.2.1 二叉树1. 定义2. 特点3. 性质引理5.1&#xff1a;二叉树中层数为i的结点至多有 2 i 2^i 2i个&#xff0c;其中 i ≥ 0 i \geq 0 i≥0。引理5.2&#xff1a;高度为k的二叉…

从业务到软件架构——软件建模

一、问题 1.架构到底是什么&#xff1f;架构和业务之间到底什么关系&#xff1f; 2.好的架构的设计出发点是什么&#xff1f;好的架构应该是什么样的&#xff1f; 作为一个计算机领域的词汇&#xff0c;架构的定义是&#xff1a;有关软件整体结构与组件的抽象描述&#xff0c…

PHP代码示例

我们需要使用PHP的curl库来发送HTTP请求。以下是一个基本的示例&#xff1a; php <?php // 初始化curl $ch curl_init(); // 设置代理 curl_setopt($ch, CURLOPT_PROXY, ""); // 设置URL curl_setopt($ch, CURLOPT_URL, ""); // 执行请求 $respon…

C语言--分段函数--switch语句

如何用switch语句写分段函数呢&#xff1f;⭐️ 首先介绍一下switch语句的语法规则⭐️ switch(整形表达式) {case 常量表达式1&#xff1b; //标签必须唯一语句块1;break;case 常量表达式2&#xff1b; //if(a0),而case中时系统自动加语句块2&#xff1b;break&#xff1b;c…

台式电脑一键重装Win10系统详细教程

很多用户都在使用台式Win10电脑办公&#xff0c;如果电脑出现系统问题无法解决了&#xff0c;这时候就可以考虑给电脑重装系统哦&#xff0c;下面小编给大家详细介绍关于台式电脑一键重装Win10系统的步骤方法&#xff0c;安装后电脑就能恢复正常&#xff0c;也不会影响到用户的…

浅谈电力物联网时代物联网技术在电力系统中的应用

贾丽丽 安科瑞电气股份有限公司 上海嘉定201801 摘要&#xff1a;在电力系统建设中&#xff0c;物联网的应用不仅促进了我国电力工业的发展&#xff0c;而且对我国的物联网技术也起到了一定的促进作用。随着物联网技术应用于电力系统&#xff0c;推动了中国工业的快速发展。因…