文章目录
- 准备
- 加载文档
- 分割文档
- 嵌入
- 矢量存储
- 查询矢量库
- 检索
- 返回评分
- 先嵌入查询文本再检索
- 检索器
- 总结
- 代码
我们在百度、必应、谷歌等搜索引擎中使用的检索都是基于字符串的:用户输入字符串后,搜索引擎先对搜索内容进行分词,然后在已经进行了倒排索引的巨大数据库中找出最符合用户要求的结果。
语义检索与其主要的区别是:它根据文本的真正含义进行搜索,其基本思路是将待检索的内容都转变成矢量(这个过程也叫做嵌入),转化矢量的基本原则是:语义相近的内容距离更近、相似性更高。
当用户输入检索内容时,也是先把检索内容变成矢量,然后去矢量数据库中找到最相似的文档。这样检索出来的结果并不依据字面的意思,而是依据语义的相似度。
本文描述了如何使用 langchain
和 大语言模型
以及 矢量数据库
完成pdf内容的语义检索。
在对内容进行矢量化时使用了 nomic-embed-text
,这个模型个头小,英文嵌入效果不错。
后面还将涉及到以下内容:
- 文档和文档加载器
- 文本分割器
- 嵌入
- 向量存储和检索器
准备
在正式开始撸代码之前,需要准备一下编程环境。
-
计算机
本文涉及的所有代码可以在没有显存的环境中执行。 我使用的机器配置为:- CPU: Intel i5-8400 2.80GHz
- 内存: 16GB
-
Visual Studio Code 和 venv
这是很受欢迎的开发工具,相关文章的代码可以在Visual Studio Code
中开发和调试。 我们用python
的venv
创建虚拟环境, 详见:
在Visual Studio Code中配置venv。 -
Ollama
在Ollama
平台上部署本地大模型非常方便,基于此平台,我们可以让langchain
使用llama3.1
、qwen2.5
等各种本地大模型。详见:
在langchian中使用本地部署的llama3.1大模型 。
加载文档
LangChain
实现了 Document
抽象,可以把pdf、csv、html等各种文件加载成为 Document
。它具有三个属性:
- page_content:表示内容的字符串;
- metadata:包含任意元数据的字典;
- id:(可选)文档的字符串标识符。
metadata
属性可以捕获有关文档来源、其与其他文档的关系以及其他信息的信息。请注意,单个Document
不一定是加载前源文件的完整的段落,它通常只是其一部分。
下面是加载pdf文档的代码:
def load_file(file_path):"""加载pdf文件""" # Loading documentsfrom langchain_community.document_loaders import PyPDFLoaderloader = PyPDFLoader(file_path)docs = loader.load()print(f'加载文件成功,总文本数:{len(docs)}')# PyPDFLoader loads one Document object per PDF page. The first page is at index 0.print(f"page one:\n{docs[0].page_content[:200]}\n")print(f'page one metadata:\n{docs[0].metadata}')return docs
执行此方法后,我们可以看到已加载文档的基本结构:
page one:
Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
Fpage one metadata:
{'source': 'E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf', 'page': 0, 'page_label': '1'}
langchian
提供了大量的Document Loader
,详见:Document loaders 。
分割文档
如果以 Document
作为矢量化的单位,往往粒度太粗糙,在问答等场景中,不容易找到理想的结果;下面将进一步将这些 Document
进行智能拆分,此过程将尽量确保每一部分的含义不被周围的文本”冲淡“。
下面我们使用 RecursiveCharacterTextSplitter
,它将使用常用分隔符(如换行符)递归拆分文档,直到每个块的大小合适。其中的参数含义如下:
- chunk_size=1000
chunk_size 参数指定了每个文本块的大小。这里设置为1000,意味着每个分割后的文本块的长度大约为1000个字符。 - chunk_overlap=200
chunk_overlap 参数指定了每个文本块之间的重叠字符数。这里设置为200,意味着相邻的两个文本块之间有200个字符是重叠的。这样做的目的是为了确保一些跨块的上下文信息不会丢失,有助于更好地理解和处理文本。 - add_start_index=True
add_start_index 参数是一个布尔值,当设置为 True 时,分割器会在每个文本块中添加一个起始索引,这个索引表示该文本块在原始文档中的起始位置。这对于后续的文本处理和引用非常有用,可以方便地定位每个文本块在原始文档中的位置。
def split_text(docs):"""分割文档"""from langchain_text_splitters import RecursiveCharacterTextSplittertext_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)all_splits = text_splitter.split_documents(docs)print(f"Number of splits: {len(all_splits)}") return all_splits
嵌入
向量搜索是存储和搜索非结构化数据(如非结构化文本)的常用方法。其理念是存储与文本相关的数字向量。给定一个查询,我们可以将其嵌入为相同维度的向量,并使用向量相似度指标(如余弦相似度)来识别相关文本。
这里我们使用 Ollama
的 nomic-embed-text
模型做嵌入。
langchian
支持很多模型做嵌入,详见:Embedding models .
from langchain_ollama.embeddings import OllamaEmbeddings
embeddings = OllamaEmbeddings(model="nomic-embed-text")
矢量存储
我们这里简单使用 InMemoryVectorStore
,它把矢量存储在内存中。
当然,我们也可以把矢量物理存储在磁盘里,以后随时使用,后面的文章我们将用
Chroma
演示这个过程。
def get_vector_store():"""获取内存矢量数据库"""from langchain_core.vectorstores import InMemoryVectorStorevector_store = InMemoryVectorStore(embeddings)file_path = get_file_path()docs = load_file(file_path)all_splits = split_text(docs)_ = vector_store.add_documents(documents=all_splits)return vector_store
查询矢量库
具有相似含义的文本生成的向量在几何上接近。我们只需传入一个问题即可检索相关信息,而无需了解文档中使用的任何特定关键词。
检索
定义检索方法:
def similarity_search(query):"""内存矢量数据库检索测试"""vector_store = get_vector_store()results = vector_store.similarity_search(query)return results
测试检索:
results = similarity_search("How many distribution centers does Nike have in the US?")
print(f'similarity_search results[0]:\n{results[0]}')
similarity_search results[0]:
page_content='direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213
NIKE Brand in-line stores (including employee-only stores) 74
Converse stores (including factory stores) 82
TOTAL 369
In the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.
2023 FORM 10-K 2' metadata={'source': 'E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf', 'page': 4, 'page_label': '5', 'start_index': 3125}
返回评分
定义检索方法:
def similarity_search_with_score(query):"""内存矢量数据库检索测试返回文档评分,分数越高,文档越相似。"""vector_store = get_vector_store()results = vector_store.similarity_search_with_score(query)return results
测试检索:
results = similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score and doc: {score}\n{doc}")
Score and doc: 0.800869769173528
page_content='UNITED STATES MARKET
For fiscal 2023, NIKE Brand and Converse sales in the United States accounted for approximately 43% of total revenues, compared to 40% and 39% for fiscal 2022 and
fiscal 2021, respectively. We sell our products to thousands of retail accounts in the United States, including a mix of footwear stores, sporting goods stores, athletic
specialty stores, department stores, skate, tennis and golf shops and other retail accounts. In the United States, we utilize NIKE sales offices to solicit such sales. During
fiscal 2023, our three largest United States customers accounted for approximately 22% of sales in the United States.
Our NIKE Direct and Converse direct to consumer operations sell our products to consumers through various digital platforms. In addition, our NIKE Direct and Converse
direct to consumer operations sell products through the following number of retail stores in the United States:
U.S. RETAIL STORES NUMBER
NIKE Brand factory stores 213' metadata={'source': 'E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf', 'page': 4, 'page_label': '5', 'start_index': 2311}
先嵌入查询文本再检索
定义检索方法:
def embed_query(query):"""嵌入查询测试"""embedding = embeddings.embed_query(query)vector_store = get_vector_store()results = vector_store.similarity_search_by_vector(embedding)return results
测试检索:
results = embed_query("How were Nike's margins impacted in 2023?")
print(f'embed_query results[0]:\n{results[0]}')
embed_query results[0]:
page_content='and 18% of total NIKE Brand footwear, respectively. For fiscal 2023, four footwear contract manufacturers each accounted for greater than 10% of footwear production
and in the aggregate accounted for approximately 58% of NIKE Brand footwear production.
As of May 31, 2023, our contract manufacturers operated 291 finished goods apparel factories located in 31 countries. For fiscal 2023, NIKE Brand apparel finished goods
were manufactured by 55 contract manufacturers, many of which operate multiple factories. The largest single finished goods apparel factory accounted for approximately
8% of total fiscal 2023 NIKE Brand apparel production. For fiscal 2023, factories in Vietnam, China and Cambodia manufactured approximately 29%, 18% and 16%
2023 FORM 10-K 3' metadata={'source': 'E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf', 'page': 5, 'page_label': '6', 'start_index': 3956}
检索器
LangChain VectorStore
对象不属于 Runnable
的子类。LangChain Retriever
是 Runnable
,因此它们实现了一组标准方法(例如:同步和异步调用和批处理操作)。把VectorStore
转换成Retriever
以后,对矢量数据库的处理就可以添加到LangChain的链里面,在实现RAG(Retrieval-Augmented Generation)
等功能时很方便。
from typing import Listfrom langchain_core.documents import Document
from langchain_core.runnables import chain@chain
def retriever(query: str) -> List[Document]:vector_store = get_vector_store()return vector_store.similarity_search(query, k=1)def retriever_batch_1(query:List[str]):r = retriever.batch(query)return r
我们来测试一下:
query = ["How many distribution centers does Nike have in the US?","When was Nike incorporated?",
]results = retriever_batch_1(query)
print(f'retriever.batch 1:\n{results}')
retriever.batch 1:
[[Document(id='a26e4349-108c-4988-8502-ff9cce20cdf3', metadata={'source': 'E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf', 'page': 4, 'page_label': '5', 'start_index': 3125}, page_content='direct to consumer operations sell products through the following number of retail stores in the United States:\nU.S. RETAIL STORES NUMBER\nNIKE Brand factory stores 213 \nNIKE Brand in-line stores (including employee-only stores) 74 \nConverse stores (including factory stores) 82 \nTOTAL 369 \nIn the United States, NIKE has eight significant distribution centers. Refer to Item 2. Properties for further information.\n2023 FORM 10-K 2')], [Document(id='872d6f81-3aa1-4aaa-ba2d-2d4eac29e661', metadata={'source': 'E:\\project\\my_opensource\\programming-with-local-large-language-model-gitee\\server\\services\\practice\\assert/nke-10k-2023.pdf', 'page': 3, 'page_label': '4', 'start_index': 714}, page_content='and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales\nrepresentatives in nearly all countries around the world. We also offer interactive consumer services and experiences through our digital platforms. Nearly all of our\nproducts are manufactured by independent contractors. Nearly all footwear and apparel products are manufactured outside the United States, while equipment products\nare manufactured both in the United States and abroad.\nAll references to fiscal 2023, 2022, 2021 and 2020 are to NIKE, Inc.\'s fiscal years ended May 31, 2023, 2022, 2021 and 2020, respectively. Any references to other fiscal\nyears refer to a fiscal year ending on May 31 of that year.\nPRODUCTS\nOur NIKE Brand product offerings are aligned around our consumer construct focused on Men\'s, Women\'s and Kids\'. We also design products specifically for the Jordan')]]
Vectorstores
实现了一个 as_retriever
方法,该方法将生成一个 Retriever
。我们可以用下面的代码实现与上述retriever_batch_1
同样的功能:
def retriever_batch_2(query:List[str]):vector_store = get_vector_store()retriever = vector_store.as_retriever(search_type="similarity",search_kwargs={"k": 1},)r = retriever.batch(query)return r
总结
总的来说,分词检索更注重词语的表面匹配,而语义检索更注重对查询意图和文档内容的深层次理解。随着技术的发展,语义检索在处理复杂查询和提供更精准的信息方面显示出更大的优势。
langchian
像胶水,可以轻松的把矢量数据库以及大语言模型的能力整合在一起,快速形成稳定的应用。
代码
本文涉及的所有代码以及相关资源都已经共享,参见:
- github
- gitee
🪐祝好运🪐