目录
Langchain支持的文档拆分
智谱AI采用的文档拆分策略
Meta KDD Cup'24
Qanything
总结
Langchain支持的文档拆分
名字 | 具体教程 | 分割字符 | 是否添加metadata | 描述 |
递归式 | RecursiveCharacterTextSplitter、RecursiveJsonSplitter | 用户自定义的字符 | 递归拆分文本。这种拆分是试图将相关的文本片段彼此相邻。这是开始拆分文本的方法。recommended way | |
HTML | HTMLHeaderTextSplitter、HTMLSectionSplitter | HTML字符 | ✅ | 根据特定于 HTML 的字符拆分文本。值得注意的是,这添加了有关该块来自何处的相关信息(基于 HTML) |
Markdown | MarkdownHeaderTextSplitter | Markdown字符 | ✅ | 根据特定于 Markdown 的字符拆分文本。值得注意的是,这添加了有关该 chunk 来源的相关信息(基于 Markdown) |
代码 | many languages | 代码 (Python、JS)字符 | 根据特定于编码语言的字符拆分文本。有 15 种不同的语言可供选择。 | |
Token | many classes | Token | 在标记上拆分文本。有几种不同的方法来测量代币。 | |
字符 | CharacterTextSplitter | 用户定义的字符 | 根据用户定义的字符拆分文本。更简单的方法之一。 | |
语义分块 | SemanticChunker | 句子 | 首先对句子进行拆分。然后,如果它们在语义上足够相似,则将它们彼此相邻地合并。摘自 Greg Kamradt | |
AI21 语义文本拆分器 | AI21SemanticTextSplitter | ✅ | 识别形成连贯文本片段的不同主题,并沿这些主题进行拆分。 |
智谱AI采用的文档拆分策略
详见https://mp.weixin.qq.com/s/DC0-so_8pVUfcz7lGhYwlQ,记录的是24年6月的方式
采用small to big 的策略,即在原始文档切片基础上,扩展了更多粒度更小的文档切片。检索文档时如果检索到粒度细致的切片,会递归检索到其原始大切片,然后再将原始节点做为检索结果提交给 LLM
类似于ParentDocumentRetriever策略,首先利用CharacterTextSplitter将文档拆分成短chunks,用于向量化和检索。在短chunk之上,有一个长chunk(父chunk),当query检索命中小chunk时,仅将父chunk返回给LLM。优势在于
-
文本向量更容易建模短文本的语义含义。例如query是苹果,短文本是“张三爱吃苹果”,长文本是“李四爱吃梨,张三爱吃苹果,他们都爱吃树上的水果”,很明显,query更容易命中短文本
-
长文本能包含更多上下文关联,将长文本给到LLM,能回答更复杂问题。例如query是“有谁和张三一样,也爱苹果”,短文本是“张三爱吃苹果”,长文本是“张三爱吃苹果,李四和张三一样,也爱吃”。很明显,query更容易命中短文本,但短文本不足以回答问题
Meta KDD Cup'24
ACM SIGKDD (Knowledge Discovery and Data Mining,简称 KDD)是数据挖掘领域的国际顶级会议。KDD Cup比赛由SIGKDD主办,自1997年开始每年举办一次,是目前数据挖掘领域最具影响力的赛事。
本次比赛共包含WEB-BASED RETRIEVAL SUMMARIZATION、KNOWLEDGE GRAPH AND WEB AUGMENTATION、END-TO-END RAG,冠军方案采用的文档拆分策略和智谱AI一致,详见https://openreview.net/forum?id=oWNPeoP1uC
Qanything
采用ParentDocumentRetriever策略,代码详见https://github.com/netease-youdao/QAnything/blob/qanything-v2/qanything_kernel/core/retriever/parent_retriever.py
class ParentRetriever:def __init__(self, vectorstore_client: VectorStoreMilvusClient, mysql_client: KnowledgeBaseManager, es_client: StoreElasticSearchClient):self.mysql_client = mysql_clientself.vectorstore_client = vectorstore_client# This text splitter is used to create the parent documentsinit_parent_splitter = RecursiveCharacterTextSplitter(separators=SEPARATORS,chunk_size=DEFAULT_PARENT_CHUNK_SIZE,chunk_overlap=0,length_function=num_tokens_embed)# # This text splitter is used to create the child documents# # It should create documents smaller than the parentinit_child_splitter = RecursiveCharacterTextSplitter(separators=SEPARATORS,chunk_size=DEFAULT_CHILD_CHUNK_SIZE,chunk_overlap=int(DEFAULT_CHILD_CHUNK_SIZE / 4),length_function=num_tokens_embed)self.retriever = SelfParentRetriever(vectorstore=vectorstore_client.local_vectorstore,docstore=MysqlStore(mysql_client),child_splitter=init_child_splitter,parent_splitter=init_parent_splitter,)self.backup_vectorstore: Optional[Milvus] = Noneself.es_store = es_client.es_storeself.parent_chunk_size = DEFAULT_PARENT_CHUNK_SIZE@get_time_asyncasync def insert_documents(self, docs, parent_chunk_size, single_parent=False):insert_logger.info(f"Inserting {len(docs)} documents, parent_chunk_size: {parent_chunk_size}, single_parent: {single_parent}")if parent_chunk_size != self.parent_chunk_size:self.parent_chunk_size = parent_chunk_sizeparent_splitter = RecursiveCharacterTextSplitter(separators=SEPARATORS,chunk_size=parent_chunk_size,chunk_overlap=0,length_function=num_tokens_embed)child_chunk_size = min(DEFAULT_CHILD_CHUNK_SIZE, int(parent_chunk_size / 2))child_splitter = RecursiveCharacterTextSplitter(separators=SEPARATORS,chunk_size=child_chunk_size,chunk_overlap=int(child_chunk_size / 4),length_function=num_tokens_embed)self.retriever = SelfParentRetriever(vectorstore=self.vectorstore_client.local_vectorstore,docstore=MysqlStore(self.mysql_client),child_splitter=child_splitter,parent_splitter=parent_splitter)# insert_logger.info(f'insert documents: {len(docs)}')ids = None if not single_parent else [doc.metadata['doc_id'] for doc in docs]return await self.retriever.aadd_documents(docs, parent_chunk_size=parent_chunk_size,es_store=self.es_store, ids=ids, single_parent=single_parent)async def get_retrieved_documents(self, query: str, partition_keys: List[str], time_record: dict,hybrid_search: bool, top_k: int):milvus_start_time = time.perf_counter()expr = f'kb_id in {partition_keys}'# self.retriever.set_search_kwargs("mmr", k=VECTOR_SEARCH_TOP_K, expr=expr)self.retriever.set_search_kwargs("similarity", k=top_k, expr=expr)query_docs = await self.retriever.aget_relevant_documents(query)for doc in query_docs:doc.metadata['retrieval_source'] = 'milvus'milvus_end_time = time.perf_counter()time_record['retriever_search_by_milvus'] = round(milvus_end_time - milvus_start_time, 2)if not hybrid_search:return query_docstry:# filter = []# for partition_key in partition_keys:filter = [{"terms": {"metadata.kb_id.keyword": partition_keys}}]es_sub_docs = await self.es_store.asimilarity_search(query, k=top_k, filter=filter)es_ids = []milvus_doc_ids = [d.metadata[self.retriever.id_key] for d in query_docs]for d in es_sub_docs:if self.retriever.id_key in d.metadata and d.metadata[self.retriever.id_key] not in es_ids and d.metadata[self.retriever.id_key] not in milvus_doc_ids:es_ids.append(d.metadata[self.retriever.id_key])es_docs = await self.retriever.docstore.amget(es_ids)es_docs = [d for d in es_docs if d is not None]for doc in es_docs:doc.metadata['retrieval_source'] = 'es'time_record['retriever_search_by_es'] = round(time.perf_counter() - milvus_end_time, 2)debug_logger.info(f"Got {len(query_docs)} documents from vectorstore and {len(es_sub_docs)} documents from es, total {len(query_docs) + len(es_docs)} merged documents.")query_docs.extend(es_docs)except Exception as e:debug_logger.error(f"Error in get_retrieved_documents on es_search: {e}")return query_docs
总结
当前(20240925)的时间节点下,效果比较好的文档拆分策略为CharacterTextSplitter+ParentDocumentRetriever,能够较好得协调检索的精度和LLM的效果
CharacterTextSplitter:https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/character_text_splitter/
ParentDocumentRetriever:https://www.langchain.com.cn/modules/data_connection/retrievers/parent_document_retriever