Elasticsearch:使用 huggingface 模型的 NLP 文本搜索

本博文使用由 Elastic 博客 title 组成的简单数据集在 Elasticsearch 中实现 NLP 文本搜索。你将为博客文档建立索引,并使用摄取管道生成文本嵌入。 通过使用 NLP 模型,你将使用自然语言在博客文档上查询文档。

 安装

Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,请参考如下的链接来进行安装:

  • 如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch

  • Kibana:如何在 Linux,MacOS 及 Windows 上安装 Elastic 栈中的 Kibana

在安装的时候,我们可以选择 Elastic Stack 8.x 的安装指南来进行安装。在本博文中,我将使用最新的 Elastic Stack 8.10 来进行展示。

在安装 Elasticsearch 的过程中,我们需要记下如下的信息:

由于我们将要使用到 eland 来上传模型。这是一个收费的功能。我们需要启动试用功能:

Python 安装包

在本演示中,我们将使用 Python 来进行展示。我们需要安装访问 Elasticsearch 相应的安装包 elasticsearch:

python3 -m pip install -qU sentence-transformers eland elasticsearch transformers

我们将使用 Jupyter Notebook 来进行展示。

$ pwd
/Users/liuxg/python/elser
$ jupyter notebook

准备数据

我们在项目的根目录下,创建如下的一个数据文件: data.json:

data.json

[{"title":"Pulp Fiction","runtime":"154","plot":"The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.","keyScene":"John Travolta is forced to inject adrenaline directly into Uma Thurman's heart after she overdoses on heroin.","genre":"Crime, Drama","released":"1994"},{"title":"The Dark Knight","runtime":"152","plot":"When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.","keyScene":"Batman angrily responds 'I’m Batman' when asked who he is by Falcone.","genre":"Action, Crime, Drama, Thriller","released":"2008"},{"title":"Fight Club","runtime":"139","plot":"An insomniac office worker and a devil-may-care soapmaker form an underground fight club that evolves into something much, much more.","keyScene":"Brad Pitt explains the rules of Fight Club to Edward Norton. The first rule of Fight Club is: You do not talk about Fight Club. The second rule of Fight Club is: You do not talk about Fight Club.","genre":"Drama","released":"1999"},{"title":"Inception","runtime":"148","plot":"A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into thed of a C.E.O.","keyScene":"Leonardo DiCaprio explains the concept of inception to Ellen Page by using a child's spinning top.","genre":"Action, Adventure, Sci-Fi, Thriller","released":"2010"},{"title":"The Matrix","runtime":"136","plot":"A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.","keyScene":"Red pill or blue pill? Morpheus offers Neo a choice between the red pill, which will allow him to learn the truth about the Matrix, or the blue pill, which will return him to his former life.","genre":"Action, Sci-Fi","released":"1999"},{"title":"The Shawshank Redemption","runtime":"142","plot":"Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.","keyScene":"Andy Dufresne escapes from Shawshank prison by crawling through a sewer pipe.","genre":"Drama","released":"1994"},{"title":"Goodfellas","runtime":"146","plot":"The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.","keyScene":"Joe Pesci's character Tommy DeVito shoots young Spider in the foot for not getting him a drink.","genre":"Biography, Crime, Drama","released":"1990"},{"title":"Se7en","runtime":"127","plot":"Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.","keyScene":"Brad Pitt's character David Mills shoots John Doe after he reveals that he murdered Mills' wife.","genre":"Crime, Drama, Mystery, Thriller","released":"1995"},{"title":"The Silence of the Lambs","runtime":"118","plot":"A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.","keyScene":"Hannibal Lecter explains to Clarice Starling that he ate a census taker's liver with some fava beans and a nice Chianti.","genre":"Crime, Drama, Thriller","released":"1991"},{"title":"The Godfather","runtime":"175","plot":"An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.","keyScene":"James Caan's character Sonny Corleone is shot to death at a toll booth by a number of machine gun toting enemies.","genre":"Crime, Drama","released":"1972"},{"title":"The Departed","runtime":"151","plot":"An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.","keyScene":"Leonardo DiCaprio's character Billy Costigan is shot to death by Matt Damon's character Colin Sullivan.","genre":"Crime, Drama, Thriller","released":"2006"},{"title":"The Usual Suspects","runtime":"106","plot":"A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.","keyScene":"Kevin Spacey's character Verbal Kint is revealed to be the mastermind behind the crime, when his limp disappears as he walks away from the police station.","genre":"Crime, Mystery, Thriller","released":"1995"}
]
$ pwd
/Users/liuxg/python/elser
$ ls
Multilingual semantic search.ipynb
NLP text search using hugging face transformer model.ipynb
Semantic search - ELSER.ipynb
data.json

创建应用并演示

import modules

import pandas as pd, json
from elasticsearch import Elasticsearch
from getpass import getpass
from urllib.request import urlopen

部署 NLP 模型

我们将使用 eland 工具来安装 text_embedding 模型。 对于我们的模型,我们使用 all-MiniLM-L6-v2 将搜索文本转换为密集向量。

该模型会将你的搜索查询转换为向量,该向量将用于对 Elasticsearch 中存储的文档集进行搜索。

我们在 terminal 中打入如下的命令:

eland_import_hub_model --url https://elastic:vXDWYtL*my3vnKY9zCfL@localhost:9200 \--hub-model-id sentence-transformers/all-MiniLM-L6-v2 \--task-type text_embedding \--ca-cert /Users/liuxg/elastic/elasticsearch-8.10.0/config/certs/http_ca.crt \--start

请注意

  • 我们需要根据自己的部署来替换上面的 elastic 超级用户的密码
  • 我们需要根据自己的 Elasticsearch 集群的部署来替换上面的 Elasticsearch 访问地址
  • 我们需要根据自己的部署的证书来替换上面的证书路径

我们回到 Kibana 的界面:

连接到 Elasticsearch

我们创建一个客户端连接:

ELASTCSEARCH_CERT_PATH = "/Users/liuxg/elastic/elasticsearch-8.10.0/config/certs/http_ca.crt"es = Elasticsearch(  ['https://localhost:9200'],basic_auth = ('elastic', 'vXDWYtL*my3vnKY9zCfL'),ca_certs = ELASTCSEARCH_CERT_PATH,verify_certs = True)
print(es.info())

创建 ingest pipeline

我们需要创建一个文本嵌入提取管道来生成 title 字段的向量(文本)嵌入。

下面的管道定义了一个用于 NLP 模型的 inference 处理器。

# ingest pipeline definition
PIPELINE_ID="vectorize_blogs"es.ingest.put_pipeline(id=PIPELINE_ID, processors=[{"inference": {"model_id": "sentence-transformers__all-minilm-l6-v2","target_field": "text_embedding","field_map": {"title": "text_field"}}}])

创建带有映射的索引

现在,在索引文档之前,我们将创建一个具有正确映射的 Elasticsearch 索引。 我们添加 text_embedding 以包含 model_id 和 Predicted_value 来存储嵌入。

# define index name
INDEX_NAME="blogs"# flag to check if index has to be deleted before creating
SHOULD_DELETE_INDEX=True# define index mapping
INDEX_MAPPING = {"properties": {"title": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"text_embedding": {"properties": {"is_truncated": {"type": "boolean"},"model_id": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}},"predicted_value": {"type": "dense_vector","dims": 384,"index": True,"similarity": "l2_norm"}}}}}INDEX_SETTINGS = {"index": {"number_of_replicas": "1","number_of_shards": "1","default_pipeline": PIPELINE_ID}
}# check if we want to delete index before creating the index
if(SHOULD_DELETE_INDEX):if es.indices.exists(index=INDEX_NAME):print("Deleting existing %s" % INDEX_NAME)client.options(ignore_status=[400, 404]).indices.delete(index=INDEX_NAME)print("Creating index %s" % INDEX_NAME)
es.options(ignore_status=[400,404]).indices.create(index=INDEX_NAME, mappings=INDEX_MAPPING, settings=INDEX_SETTINGS)

摄入数据到 Elasticsearch

让我们使用摄取管道对示例博客数据进行索引。

注意:在我们开始索引之前,请确保你已启动训练模型部署。

from elasticsearch import helpers# Load data into a JSON object
with open('data.json') as f:data_json = json.load(f)print(data_json)# Prepare the documents to be indexed
documents = []
for doc in data_json:documents.append({"_index": "blogs","_source": doc,})# Use helpers.bulk to index
helpers.bulk(client, documents)

我们可以回到 Kibana 的界面查看被写入的文档:

GET blogs/_search

查询数据集

下一步是运行查询来搜索相关博客。 该示例查询使用我们上传到 Elasticsearch Sentence-transformers__all-minilm-l6-v2 的模型来搜索 “model_text”: “scientific fiction”。

该过程是一个查询,尽管它内部包含两个任务。 首先,查询将使用 NLP 模型为您的搜索文本生成一个向量,然后传递该向量以在数据集上进行搜索。

结果,输出显示按照与搜索查询的接近度排序的查询文档列表。

INDEX_NAME="blogs"source_fields = [ "id", "title"]query = {"field": "text_embedding.predicted_value","k": 10,"num_candidates": 50,"query_vector_builder": {"text_embedding": {"model_id": "sentence-transformers__all-minilm-l6-v2","model_text": "scientific fiction"}}
}response = es.search(index=INDEX_NAME,fields=source_fields,knn=query,source=False)results = pd.json_normalize(json.loads(json.dumps(response.body['hits']['hits'])))# shows the result
results[['_id', '_score', 'fields.title']]

上面命令显示的结果为:

你可尝试另外的一个搜索,比如:dark knight

最终的 jupyter 文件可以在地址下载。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/152983.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

如何查看postgresql中的数据库大小?

你可以使用以下命令来查看PostgreSQL数据库的大小: SELECT pg_database.datname as "database_name", pg_size_pretty(pg_database_size(pg_database.datname)) AS size_in_mb FROM pg_database ORDER by size_in_mb DESC;这将返回一个表格&#xff0…

Python爬虫技术系列-03requests库案例-完善

Python爬虫技术系列-03requests库案例 参考1 Requests基本使用1.1 Requests库安装与使用1.1.1 Requests库安装1.1.2 Rrequests库介绍1.1.3 使用Requests一般分为三个步骤1.1.4 requests的公共方法 2 Requests库使用案例2.1 GET请求携带参数和headers2.2 POST请求,写…

消息队列 Kafka

Kafka Kafka 是一个分布式的基于发布/订阅模式的消息队列(MQ,Message Queue),主要应用于大数据实时处理领域 为什么使用消息队列MQ 在高并发环境下,同步请求来不及处理会发生堵塞,从而触发too many conne…

【java学习】多维数组(10)

文章目录 1. 二维数组 1. 二维数组 二维数组[][]:数组中的数组 格式1(动态初始化):int[][] arr new int[3][2]; 解释说明: 定义了名称为arr的二维数组二维数组中有3个一维数组每个一维数组中有2个元素一维数组的名称…

亘古难题——前端开发or后端开发

一、引言 前端开发 前端开发是创建WEB页面或APP等前端界面呈现给用户的过程,通过HTML,CSS及JavaScript以及衍生出来的各种技术、框架、解决方案,来实现互联网产品的用户界面交互。 前端开发从网页制作演变而来,名称上有很明显的时…

遥感云大数据在灾害、水体与湿地领域典型案 例实践及 GPT 模型应用

近年来遥感技术得到了突飞猛进的发展,航天、航空、临近空间等多遥感平台不断增加,数据的空间、时间、光谱分辨率不断提高,数据量猛增,遥感数据已经越来越具有大数据特征。遥感大数据的出现为相关研究提供了前所未有的机遇&#xf…

MM-Camera架构-ProcessCaptureRequest 流程分析

文章目录 processCaptureRequest\_3\_41.1 mDevice1.2 mDevice->ops->process\_capture\_request1.3 hardware to vendor mct\_shimlayer\_process\_event2.1 mct\_shimlayer\_handle\_parm2.2 mct\_shimlayer\_reg\_buffer processCaptureRequest_3_4 sdm660的摄像头走…

C++设计模式-享元(Flyweight)

目录 C设计模式-享元(Flyweight) 一、意图 二、适用性 三、结构 四、参与者 五、代码 C设计模式-享元(Flyweight) 一、意图 运用共享技术有效地支持大量细粒度的对象。 二、适用性 一个应用程序使用了大量的对象。完全由…

基于matlab统计Excel文件一列数据中每个数字出现的频次和频率

一、需求描述 如上表所示,在excel文件中,有一列数,统计出该列数中,每个数出现的次数和频率。最后,将统计结果输出到新的excel文件中。 二、程序讲解 第一步:选择excel文件; [Filename, Pathn…

自然语言处理(NLP)的开发框架

自然语言处理(NLP)领域有许多开源的框架和库,用于处理文本数据和构建NLP应用程序。以下是一些常见的NLP开源框架及其特点,希望对大家有所帮助。北京木奇移动技术有限公司,专业的软件外包开发公司,欢迎交流合…

基于差分进化优化的BP神经网络(分类应用) - 附代码

基于差分进化优化的BP神经网络(分类应用) - 附代码 文章目录 基于差分进化优化的BP神经网络(分类应用) - 附代码1.鸢尾花iris数据介绍2.数据集整理3.差分进化优化BP神经网络3.1 BP神经网络参数设置3.2 差分进化算法应用 4.测试结果…

20231005使用ffmpeg旋转MP4视频

20231005使用ffmpeg旋转MP4视频 2023/10/5 12:21 百度搜搜:ffmpeg 旋转90度 https://zhuanlan.zhihu.com/p/637790915 【FFmpeg实战】FFMPEG常用命令行 https://blog.csdn.net/weixin_37515325/article/details/127817057 FFMPEG常用命令行 5.视频旋转 顺时针旋转…

Jupyter notebook怎么设置自动跳转问题

1.点击开始,就可以看到Jupyter,然后点击 2.结果就这样: 3你可以复制地址到浏览器,结果: 但是这么做很麻烦,所以有没有更好的办法呢?当然有下面就开始介绍 1.打开cmd(winr,输入cmd),输入以下命令…

JMeter 做接口性能测试,YYDS!

简介 本文写给想了解性能测试和JMeter的小白,适合对这两者了解很少的同学们,如果已经有使用经验的请绕道,别浪费时间:-) 我们将介绍JMeter的使用场景,如何安装、运行JMeter,以及开始一个最最简单的测试。 你还徘徊在…

LeetCode 1277. 统计全为 1 的正方形子矩阵【动态规划】1613

本文属于「征服LeetCode」系列文章之一,这一系列正式开始于2021/08/12。由于LeetCode上部分题目有锁,本系列将至少持续到刷完所有无锁题之日为止;由于LeetCode还在不断地创建新题,本系列的终止日期可能是永远。在这一系列刷题文章…

近期分享学习心得3

1、全屏组件封装 先看之前大屏端的监控部分全屏代码 整块全屏代码 常规流是下面这种 //进入全屏 function full(ele) {//if (ele.requestFullscreen) {// ele.requestFullscreen();//} else if (ele.mozRequestFullScreen) {// ele.mozRequestFullScreen();//} el…

BUUCTF [MRCTF2020]Ez_bypass1

这道题全程我都是用bp做的 拿到题目 我们查看页面源代码得到 代码审计 我们要用get传入id和gg两个参数,id和gg的值要求不能相等,但是id和gg的md5强比较必须相等 if(isset($_GET[gg])&&isset($_GET[id])) {$id$_GET[id];$gg$_GET[gg];if (md5($…

Direct3D网格(一)

创建网格 我们可以用D3DXCreateMeshFVF函数创建一个"空"网格对象 ,空网格对象是指我们指定了网格的面片总数和顶点总数,然后由该函数为顶点缓存、索引缓存和属性缓存分配大小合适的内存,之后即可手工填入网格数据。 HRESULT WINA…

计算机专业毕业设计项目推荐14-文档编辑平台(SpringBoot+Vue+Mysql)

文档编辑平台(SpringBootVueMysql) **介绍****各部分模块实现** 介绍 本系列(后期可能博主会统一为专栏)博文献给即将毕业的计算机专业同学们,因为博主自身本科和硕士也是科班出生,所以也比较了解计算机专业的毕业设计流程以及模式,在编写的…

【C++设计模式之解释器模式:行为型】分析及示例

简介 解释器模式(Interpreter Pattern)是一种行为型设计模式,它提供了一种解决问题的方法,通过定义语言的文法规则,解释并执行特定的语言表达式。 解释器模式通过使用表达式和解释器,将文法规则中的句子逐…