Elasticsearch:使用 Inference API 进行语义搜索

在我之前的文章 “Elastic Search 8.12:让 Lucene 更快,让开发人员更快”,我有提到 Inference API。这些功能的核心部分始终是灵活的第三方模型管理,使客户能够利用当今市场上下载最多的向量数据库及其选择的转换器模型。在今天的文章中,我们将使用一个例子来展示如何使用 Inference API 来进行语义搜索。

前提条件

  • 你需要安装 Elastic Stack 8.12 及以上版本。你可以是自托管的 Elasticsearch 集群或者是在 Elastic Cloud 上的部署
  • 由于 OpenAI 免费试用 API 的使用受到限制,因此需要付费 OpenAI 帐户才能将推理 API 与 OpenAI 服务结合使用。

在今天的展示中,我将使用自己在电脑上搭建的 Elasticsearch 集群来进行展示。安装版本是 Elastic Stack 8.12。

安装

Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,请参考如下的链接来进行安装:

  • 如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch

  • Kibana:如何在 Linux,MacOS 及 Windows 上安装 Elastic 栈中的 Kibana

在安装的时候,我们可以选择 Elastic Stack 8.x 的安装指南来进行安装。在本博文中,我将使用最新的 Elastic Stack 8.10 来进行展示。

在安装 Elasticsearch 的过程中,我们需要记下如下的信息:

拷贝证书到当前工作目录

在客户端连接到 Elasticsearch 时,我们需要 Elasticsearch 的安装证书:

$ pwd
/Users/liuxg/python/elser
$ cp ~/elastic/elasticsearch-8.12.0/config/certs/http_ca.crt .
$ ls http_ca.crt 
http_ca.crt

 安装需要的 Python 包

pip3 install elasticsearch load_dotenv
$ pip3 install elasticsearch
Looking in indexes: http://mirrors.aliyun.com/pypi/simple/
Requirement already satisfied: elasticsearch in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (8.12.0)
Requirement already satisfied: elastic-transport<9,>=8 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from elasticsearch) (8.10.0)
Requirement already satisfied: urllib3<3,>=1.26.2 in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from elastic-transport<9,>=8->elasticsearch) (2.1.0)
Requirement already satisfied: certifi in /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages (from elastic-transport<9,>=8->elasticsearch) (2023.11.17)[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: pip3 install --upgrade pip
$ pip3 list | grep elasticsearch
elasticsearch                            8.12.0
rag-elasticsearch                        0.0.1        /Users/liuxg/python/rag-elasticsearch/my-app/packages/rag-elasticsearch

设置环境变量

我们在 termnial 中打入如下的命令来设置环境变量:

export ES_USER=elastic
export ES_PASSWORD=xnLj56lTrH98Lf_6n76y
export OPENAI_API_KEY=YourOpenAIkey

你需要根据自己的 Elasticsearch 配置及 OpenAI key 进行上面的修改。你需要在启动下面的 jupyter 之前运行上面的命令。

创建数据集

我们在当前的目录下创建如下的一个数据集:

movies.json

[{"title": "Pulp Fiction","runtime": "154","plot": "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.","keyScene": "John Travolta is forced to inject adrenaline directly into Uma Thurman's heart after she overdoses on heroin.","genre": "Crime, Drama","released": "1994"},{"title": "The Dark Knight","runtime": "152","plot": "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.","keyScene": "Batman angrily responds 'I’m Batman' when asked who he is by Falcone.","genre": "Action, Crime, Drama, Thriller","released": "2008"},{"title": "Fight Club","runtime": "139","plot": "An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.","keyScene": "Brad Pitt explains the rules of Fight Club to Edward Norton. The first rule of Fight Club is: You do not talk about Fight Club. The second rule of Fight Club is: You do not talk about Fight Club.","genre": "Drama","released": "1999"},{"title": "Inception","runtime": "148","plot": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into thed of a C.E.O.","keyScene": "Leonardo DiCaprio explains the concept of inception to Ellen Page by using a child's spinning top.","genre": "Action, Adventure, Sci-Fi, Thriller","released": "2010"},{"title": "The Matrix","runtime": "136","plot": "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.","keyScene": "Red pill or blue pill? Morpheus offers Neo a choice between the red pill, which will allow him to learn the truth about the Matrix, or the blue pill, which will return him to his former life.","genre": "Action, Sci-Fi","released": "1999"},{"title": "The Shawshank Redemption","runtime": "142","plot": "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.","keyScene": "Andy Dufresne escapes from Shawshank prison by crawling through a sewer pipe.","genre": "Drama","released": "1994"},{"title": "Goodfellas","runtime": "146","plot": "The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.","keyScene": "Joe Pesci's character Tommy DeVito shoots young Spider in the foot for not getting him a drink.","genre": "Biography, Crime, Drama","released": "1990"},{"title": "Se7en","runtime": "127","plot": "Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.","keyScene": "Brad Pitt's character David Mills shoots John Doe after he reveals that he murdered Mills' wife.","genre": "Crime, Drama, Mystery, Thriller","released": "1995"},{"title": "The Silence of the Lambs","runtime": "118","plot": "A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.","keyScene": "Hannibal Lecter explains to Clarice Starling that he ate a census taker's liver with some fava beans and a nice Chianti.","genre": "Crime, Drama, Thriller","released": "1991"},{"title": "The Godfather","runtime": "175","plot": "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.","keyScene": "James Caan's character Sonny Corleone is shot to death at a toll booth by a number of machine gun toting enemies.","genre": "Crime, Drama","released": "1972"},{"title": "The Departed","runtime": "151","plot": "An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.","keyScene": "Leonardo DiCaprio's character Billy Costigan is shot to death by Matt Damon's character Colin Sullivan.","genre": "Crime, Drama, Thriller","released": "2006"},{"title": "The Usual Suspects","runtime": "106","plot": "A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.","keyScene": "Kevin Spacey's character Verbal Kint is revealed to be the mastermind behind the crime, when his limp disappears as he walks away from the police station.","genre": "Crime, Mystery, Thriller","released": "1995"}
]
$ pwd
/Users/liuxg/python/elser
$ ls movies.json 
movies.json

应用设计

我们在当前的目录下打入如下的命令来启动 jupyter:

jupyter notebook

导入所需要的包

from elasticsearch import Elasticsearch, helpers, exceptions
import json
import time,os
from dotenv import load_dotenvload_dotenv()openai_api_key=os.getenv('OPENAI_API_KEY')
elastic_user=os.getenv('ES_USER')
elastic_password=os.getenv('ES_PASSWORD')url = f"https://{elastic_user}:{elastic_password}@localhost:9200"
client = Elasticsearch(url, ca_certs = "./http_ca.crt", verify_certs = True)print(client.info())

从上面的输出中,我们可以看出来我们的 client 连接是成功的。更多关于如何连接到 Elasticsearch 的方法,请详细阅读文章 “Elasticsearch:关于在 Python 中使用 Elasticsearch 你需要知道的一切 - 8.x”。

创建 inference 任务

让我们使用 create inference API 创建推理任务。

为此,你i需要一个 OpenAI API 密钥,你可以在 OpenAI 帐户的 API 密钥部分下找到该密钥。 由于 OpenAI 免费试用 API 的使用受到限制,因此需要付费会员才能完成本笔记本中的步骤。

client.inference.put_model(task_type="text_embedding",model_id="my_openai_embedding_model",body={"service": "openai","service_settings": {"api_key": openai_api_key},"task_settings": {"model": "text-embedding-ada-002"}}
)

使用推理处理器创建摄取管道

使用 put_pipeline 方法创建带有推理处理器的摄取管道。 参考上面创建的 OpenAI 模型来推断管道中正在摄取的数据。

client.ingest.put_pipeline(id="openai_embeddings_pipeline", description="Ingest pipeline for OpenAI inference.",processors=[{"inference": {"model_id": "my_openai_embedding_model","input_output": {"input_field": "plot","output_field": "plot_embedding"}}}]
)

让我们记下该 API 调用中的一些重要参数:

  • inference:使用机器学习模型执行推理的处理器。
  • model_id:指定要使用的机器学习模型的ID。 在此示例中,模型 ID 设置为 my_openai_embedding_model。 使用你在创建推理任务时定义的模型 ID。
  • input_output:指定输入和输出字段。
  • input_field:创建密集向量表示的字段名称。
  • output_field:包含推理结果的字段名称。

创建索引

必须创建目标索引的映射(包含模型将根据你的输入文本创建的嵌入的索引)。 目标索引必须具有 dense_vector 字段类型的字段,以索引 OpenAI 模型的输出。

让我们使用我们需要的映射创建一个名为 openai-movie-embeddings 的索引。

client.indices.delete(index="openai-movie-embeddings", ignore_unavailable=True)
client.indices.create(index="openai-movie-embeddings",settings={"index": {"default_pipeline": "openai_embeddings_pipeline"}},mappings={"properties": {"plot_embedding": { "type": "dense_vector", "dims": 1536, "similarity": "dot_product" },"plot": {"type": "text"}}}
)

插入文档

让我们插入 12 部电影的示例数据集。  你需要一个付费的 OpenAI 帐户才能完成此步骤,否则文档提取将由于 API 请求速率限制而超时。

from elasticsearch import helperswith open('movies.json') as f:data_json = json.load(f)# Prepare the documents to be indexed
documents = []
for doc in data_json:documents.append({"_index": "openai-movie-embeddings","_source": doc,})# Use helpers.bulk to index
helpers.bulk(client, documents)print("Done indexing documents into `openai-movie-embeddings` index!")
time.sleep(3)

我们可以到 Kibana 中进行查看:

语义搜索

使用嵌入丰富数据集后,你可以使用语义搜索来查询数据。 将 query_vector_builder 传递给 k 最近邻 (kNN) 向量搜索 API,并提供查询文本和用于创建嵌入的模型。

response = client.search(index='openai-movie-embeddings', size=3,knn={"field": "plot_embedding","query_vector_builder": {"text_embedding": {"model_id": "my_openai_embedding_model","model_text": "Fighting movie"}},"k": 10,"num_candidates": 100}
)for hit in response['hits']['hits']:doc_id = hit['_id']score = hit['_score']title = hit['_source']['title']plot = hit['_source']['plot']print(f"Score: {score}\nTitle: {title}\nPlot: {plot}\n")

最终源码可以在地址下载:https://github.com/liu-xiao-guo/semantic_search_es/blob/main/semantic_search_using_the_inference_API.ipynb

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/253285.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【Linux】基于管道进行进程间通信

进程间通信 一、初识进程间通信1. 进程间通信概念2. 进程间通信分类 二、管道1. 管道概念2. 管道原理3. 匿名管道4. 匿名管道系统接口5. 管道的特性和情况6. 匿名管道的应用&#xff08;1&#xff09;命令行&#xff08;2&#xff09;进程池 7. 命名管道&#xff08;1&#xff…

19.HarmonyOS App(JAVA)依赖布局DependentLayout使用方法

layout/ability_main.xml 显示位置不对&#xff1a;检查布局文件ohos:lef_of "id:tuzi",比如显示在兔子的左侧&#xff0c;这里就会显示不对。 需要id前没有$符号。改为&#xff1a; ohos:lef_of "$id:tuzi" <?xml version"1.0" encodi…

【机器学习】机器学习流程之收集数据

&#x1f388;个人主页&#xff1a;甜美的江 &#x1f389;欢迎 &#x1f44d;点赞✍评论⭐收藏 &#x1f917;收录专栏&#xff1a;机器学习 &#x1f91d;希望本文对您有所裨益&#xff0c;如有不足之处&#xff0c;欢迎在评论区提出指正&#xff0c;让我们共同学习、交流进步…

ChatGPT 3.5与4.0:深入解析技术进步与性能提升的关键数据

大家好&#xff0c;欢迎来到我的博客&#xff01;今天我们将详细比较两个引人注目的ChatGPT版本——3.5和4.0&#xff0c;通过一些关键数据来深入解析它们之间的差异以及4.0版本的技术进步。 1. 模型规模与参数 ChatGPT 3.5&#xff1a; 参数数量&#xff1a;约1.7亿个模型层数…

【Qt】常见问题

1.存在未解析的标识符 将build文件夹删掉重新编译。 2.左侧项目目录栏无法删除已添加项目 打开目标项目上一级的pro文件&#xff0c;将目标文件名字注释或者删除掉&#xff0c;最后保存&#xff0c;qt就会自动更新&#xff0c;将该项目隐藏掉。 3.在qt creator下添加槽函数…

MySQL数据引擎、建库及账号管理

目录 一、MySQL数据库引擎 1.1.MySQL常见数据库引擎 1.InnoDB(MySQL默认引擎) 2.MyISAM 3.MEMORY&#xff08;Heap&#xff09; 1.2.存储引擎查看 二、建库 1.默认数据库介绍 2.建库 3.查看数据库 4.删除数据库 三、账号管理 1.创建用户 1.创建用户并设置登陆密码…

在angular12中proxy.conf.json中配置详解

一、proxy.conf.json文件的目录 二、proxy.conf.json文件中的配置 "/xxx/api": {"target": "地址/api","secure": false,"logLevel": "debug","changeOrigin": true,"pathRewrite": {"…

TCP 粘包/拆包

文章目录 概述粘包拆包发生场景解决TCP粘包和拆包问题的常见方法Netty对粘包和拆包问题的处理小结 概述 TCP的粘包和拆包问题往往出现在基于TCP协议的通讯中&#xff0c;比如RPC框架、Netty等 TCP 粘包/拆包 就是你基于 TCP 发送数据的时候&#xff0c;出现了多个字符串“粘”…

excel 导出 The maximum length of cell contents (text) is 32767 characters

导出excel报错。错误日志提示&#xff1a;:The maximum length of cell contents (text) is 32767 characters 排查后&#xff0c;发现poi有单元格最大长度校验&#xff0c;超过32767会报错。 解决方案&#xff1a; 通过java反射机制&#xff0c;设置单元格最大校验限制为Int…

EasyCVR视频融合平台如何助力执法记录仪高效使用

旭帆科技的EasyCVR平台可接入的设备除了常见的智能分析网关与摄像头以外 &#xff0c;还可通过GB28181协议接入执法记录仪&#xff0c;实现对执法过程的全称监控与录像&#xff0c;并对执法轨迹与路径进行调阅回看。那么&#xff0c;如何做到执法记录仪高效使用呢&#xff1f; …

THM学习笔记——枚举

复制以下内容时注意中英文符号区别 在枚举之前我们要将shell升级为完全交互式的tty。 这涉及以下几条命令 python -c import pty;pty.spawn("/bin/bash") stty raw -echo export TERMxterm rlwrap nc -lvnp 443 从以上选一条即可 手动枚举 以下命令只需了解即可&…

使用vite创建vue+ts项目,整合常用插件(scss、vue-router、pinia、axios等)和配置

一、检查node版本 指令&#xff1a;node -v 为什么要检查node版本&#xff1f; Vite 需要 Node.js 版本 18&#xff0c;20。然而&#xff0c;有些模板需要依赖更高的 Node 版本才能正常运行&#xff0c;当你的包管理器发出警告时&#xff0c;请注意升级你的 Node 版本。 二、创…

Python爬虫学习之scrapy库

一、scrapy库安装 pip install scrapy -i https://pypi.douban.com/simple 二、scrapy项目的创建 1、创建爬虫项目 打开cmd 输入scrapy startproject 项目的名字 注意:项目的名字不允许使用数字开头 也不能包含中文 2、创建爬虫文件 要在spiders文件…

HTML 样式学习手记

HTML 样式学习手记 在探索网页设计的世界时&#xff0c;我发现HTML元素的样式调整真的是个很酷的环节。通过简单的属性设置&#xff0c;就能让文字换上五彩斑斓的颜色、变换各异的字体和大小。特别是那个style属性&#xff0c;感觉就像是一扇通往CSS魔法世界的大门。 代码小试…

【知识图谱+大模型的紧耦合新范式】Think-on-Graph:解决大模型在医疗、法律、金融等垂直领域的幻觉

Think-on-Graph&#xff1a;解决大模型在医疗、法律、金融等垂直领域的幻觉 Think-on-Graph 原理ToG 算法步骤&#xff1a;想想再查&#xff0c;查查再想实验结果 论文&#xff1a;https://arxiv.org/abs/2307.07697 代码&#xff1a;https://github.com/IDEA-FinAI/ToG Think…

Docker搭建MySQL8主从复制

之前文章我们了解了面试官&#xff1a;说一说Binlog是怎么实现的&#xff0c;这里我们用Docker搭建主从复制环境。 docker安装主从MySQL 这里我们使用MySQL8.0.32版本&#xff1a; 主库配置 master.cnf //基础配置 [client] port3306 socket/var/run/mysqld/mysql.sock [m…

如何使用phpStudy搭建网站并结合内网穿透远程访问本地站点

文章目录 [toc]使用工具1. 本地搭建web网站1.1 下载phpstudy后解压并安装1.2 打开默认站点&#xff0c;测试1.3 下载静态演示站点1.4 打开站点根目录1.5 复制演示站点到站网根目录1.6 在浏览器中&#xff0c;查看演示效果。 2. 将本地web网站发布到公网2.1 安装cpolar内网穿透2…

88 docker 环境下面 前端A连到后端B + 前端B连到后端A

前言 呵呵 最近出现了这样的一个问题, 我们有多个前端服务, 分别连接了对应的后端服务, 前端A -> 后端A, 前端B -> 后端B 但是 最近的时候 却会出现一种情况就是, 有些时候 前端A 连接到了 后端B, 前端B 连接到了 后端A 我们 前端服务使用 nginx 提供前端 html, js…

新增C++max函数的使用

在 C 中&#xff0c;max函数是标准库中的一个函数&#xff0c;用于返回两个或多个元素中的最大值。max函数的声明如下&#xff1a; cpp #include <algorithm>template<class T> const T& max(const T& a, const T& b);这个函数接受两个同类型的参数a…

代码随想录算法训练营第28天 | 93.复原IP地址 ,78.子集 ,90.子集II

回溯章节理论基础&#xff1a; https://programmercarl.com/%E5%9B%9E%E6%BA%AF%E7%AE%97%E6%B3%95%E7%90%86%E8%AE%BA%E5%9F%BA%E7%A1%80.html 93.复原IP地址 题目链接&#xff1a;https://leetcode.cn/problems/restore-ip-addresses/ 思路&#xff1a; 这是切割问题&am…