大模型工程师学习日记(八):基于 LangChain 构建向量存储和查询:Chroma

Vector stores(向量存储)

存储和搜索非结构化数据的最常见方法之一是将其嵌入并存储生成的嵌入向量, 然后在查询时将非结构化查询嵌入并检索与嵌入查询“最相似”的嵌入向量。 向量存储会处理存储嵌入数据并为您执行向量搜索。 可以通过以下方式将向量存储转换为检索器接口:

Retrievers(检索器)是一个接口,根据非结构化查询返回文档。 它比向量存储更通用。 检索器不需要能够存储文档,只需要能够返回(或检索)它们。 检索器可以从向量存储器创建,但也足够广泛,包括Wikipedia搜索和Amazon Kendra。 检索器接受字符串查询作为输入,并返回文档列表作为输出。

vectorstore = MyVectorStore()
retriever = vectorstore.as_retriever()



Chroma ( /'kromə/ n. (色彩的)浓度,色度 )是一个以人工智能为基础的开源向量数据库,专注于开发者的生产力和幸福感。Chroma 使用 Apache 2.0 许可证。 使用以下命令安装 Chroma

pip install langchain-chroma

Chroma 可以以多种模式运行。以下是每种模式的示例,均与 LangChain 集成:

  • in-memory - 在 Python 脚本或 Jupyter 笔记本中
  • in-memory with persistance - 在脚本或笔记本中保存/加载到磁盘
  • in a docker container - 作为在本地机器或云中运行的服务器


  • .add
  • .get
  • .update
  • .upsert
  • .delete
  • .peek
  • .query 则运行相似性搜索。

查看完整文档,请访问 docs。要直接访问这些方法,可以使用 ._collection.method()


在这个基本示例中,我们获取《乔布斯演讲稿》(任意的txt格式的文档都可以),将其分割成片段,使用开源嵌入模型进行嵌入,加载到 Chroma 中,然后进行查询。

# pip install langchain-chroma
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
# pip install -U langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import CharacterTextSplitter# 加载文档并将其分割成片段
loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
documents = loader.load()
# 将其分割成片段
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# 创建开源嵌入函数
embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# 将其加载到 Chroma 中
db = Chroma.from_documents(docs, embedding_function)
# 进行查询
query = "Pixar公司是做什么的?"
docs = db.similarity_search(query)
# 打印结果


During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the worlds first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I retuned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together.
在接下来的五年里, 我创立了一个名叫 NeXT 的公司,还有一个叫Pixar的公司,然后和一个后来成为我妻子的优雅女人相识。Pixar 制作了世界上第一个用电脑制作的动画电影——“”玩具总动员”,Pixar 现在也是世界上最成功的电脑制作工作室。在后来的一系列运转中,Apple 收购了NeXT,然后我又回到了苹果公司。我们在NeXT 发展的技术在 Apple 的复兴之中发挥了关键的作用。我还和 Laurence 一起建立了一个幸福的家庭。


在上一个示例的基础上,如果您想要保存到磁盘,只需初始化 Chroma 客户端并传递要保存数据的目录。

注意:Chroma 尽最大努力自动将数据保存到磁盘,但多个内存客户端可能会相互干扰。最佳做法是,任何给定时间只运行一个客户端。

#示例:chroma_disk.py# 保存到磁盘
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
docs = db2.similarity_search(query)
# 从磁盘加载
db3 = Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
docs = db3.similarity_search(query)


将 Chroma 客户端传递给 Langchain

您还可以创建一个 Chroma 客户端并将其传递给 LangChain。如果您希望更轻松地访问底层数据库,这将特别有用。

您还可以指定要让 LangChain 使用的集合名称。

import chromadb
persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection("collection_name")
collection.add(ids=["1", "2", "3"], documents=["a", "b", "c"])
langchain_chroma = Chroma(client=persistent_client,collection_name="collection_name",embedding_function=embedding_function,
print("在集合中有", langchain_chroma._collection.count(), "个文档")


在集合中有 3 个项目



Chroma 要求用户提供 ids 来简化这里的簿记工作。ids 可以是文件名,也可以是类似 filename_paragraphNumber 的组合哈希值。

Chroma 支持所有这些操作,尽管有些操作仍在通过 LangChain 接口进行整合。额外的工作流改进将很快添加。


# pip install langchain-chroma
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
# pip install -U langchain-huggingface
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import CharacterTextSplitter# 加载文档并将其分割成片段
loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
documents = loader.load()
# 将其分割成片段
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# 创建开源嵌入函数
embedding_function = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
query = "Pixar公司是做什么的?"
# 创建简单的 ids
ids = [str(i) for i in range(1, len(docs) + 1)]
# 添加数据
example_db = Chroma.from_documents(docs, embedding_function, ids=ids)
docs = example_db.similarity_search(query)
# 更新文档的元数据
docs[0].metadata = {"source": "../../resource/knowledge.txt","new_value": "hello world",
example_db.update_document(ids[0], docs[0])
# 删除最后一个文档
print("删除前计数", example_db._collection.count())
print("删除后计数", example_db._collection.count())


{'ids': ['1'], 'embeddings': None, 'metadatas': [{'source': '../../resource/knowledge.txt'}], 'documents': ["\ufeffI am honored to be with you today at your commencement from one of the finest universities in the world. I never graduated from college. Truth be told, this is the closest I've ever gotten to a college graduation. Today I want to tell you three stories from my life. That's it. No big deal. Just three stories.\n我今天很荣幸能和你们一起参加毕业典礼,斯坦福大学是世界上最好的大学之一。我从来没有从大学中毕业。说实话,今天也许是在我的生命中离大学毕业最近的一天了。今天我想向你们讲述我生活中的三个故事。不是什么大不了的事情,只是三个故事而已。\n\nThe first story is about connecting the dots.\n第一个故事是关于如何把生命中的点点滴滴串连起来。\n\nI dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?\n我在Reed大学读了六个月之后就退学了,但是在十八个月以后——我真正的作出退学决定之前,我还经常去学校。我为什么要退学呢?"], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}
{'ids': ['1'], 'embeddings': None, 'metadatas': [{'new_value': 'hello world', 'source': '../../resource/knowledge.txt'}], 'documents': ["During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the worlds first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I retuned to Apple, and the technology we developed at NeXT is at the heart of Apple's current renaissance. And Laurene and I have a wonderful family together.\n在接下来的五年里, 我创立了一个名叫 NeXT 的公司,还有一个叫Pixar的公司,然后和一个后来成为我妻子的优雅女人相识。Pixar 制作了世界上第一个用电脑制作的动画电影——“”玩具总动员”,Pixar 现在也是世界上最成功的电脑制作工作室。在后来的一系列运转中,Apple 收购了NeXT,然后我又回到了苹果公司。我们在NeXT 发展的技术在 Apple 的复兴之中发挥了关键的作用。我还和 Laurence 一起建立了一个幸福的家庭。"], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}
删除前计数 16
{'ids': ['16'], 'embeddings': None, 'metadatas': [{'source': '../../resource/knowledge.txt'}], 'documents': ['Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find yourself hitchhiking on if you were so adventurous. Beneath it were the words: "Stay Hungry. Stay Foolish." It was their farewell message as they signed off. Stay Hungry. Stay Foolish. And I have always wished that for myself. And now, as you graduate to begin anew, I wish that for you.\nStewart和他的伙伴出版了几期的“整个地球的目录”,当它完成了自己使命的时候,他们做出了最后一期的目录。那是在七十年代的中期,你们的时代。在最后一期的封底上是清晨乡村公路的照片(如果你有冒险精神的话,你可以自己找到这条路的),在照片之下有这样一段话:“求知若饥,虚心若愚。”这是他们停止了发刊的告别语。“求知若饥,虚心若愚。”我总是希望自己能够那样,现在,在你们即将毕业,开始新的旅程的时候,我也希望你们能这样:\n\nStay Hungry. Stay Foolish.\n求知若饥,虚心若愚。\n\nThank you all very much.\n非常感谢你们。'], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}
删除后计数 15
{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': [], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}

使用 OpenAI Embeddings

许多人喜欢使用 OpenAIEmbeddings,以下是如何设置它。

from langchain_openai import OpenAIEmbeddings
# pip install langchain-chroma
from langchain_chroma import Chroma
import chromadb
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoaderembeddings = OpenAIEmbeddings()
persistent_client = chromadb.PersistentClient()
new_client = chromadb.EphemeralClient()
# 加载文档并将其分割成片段
loader = TextLoader("../../resource/knowledge.txt", encoding="UTF-8")
documents = loader.load()
# 将其分割成片段
text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = text_splitter.split_documents(documents)openai_lc_client = Chroma.from_documents(docs, embeddings, client=new_client, collection_name="openai_collection"
query = "Pixar公司是做什么的?"
docs = openai_lc_client.similarity_search(query)


