面试的时候经常会问到,模型和系统是怎么评估的,尤其是RAG,这么多组件,还有端到端,每部分有哪些指标评估,怎么实现的。今天整理下
目前最通用的是RAGAS框架,已经在langchain集成了。在看它之前,首先要了解整个业界是怎么做的。
通用评估框架
这部分是介绍RAG评估的综合层面,就是在每个RAG环节和系统上评估的指标。
基于RAG的组件,划分了三部分。要了解RAG 应用程序的性能是否在提高,必须对其进行定量评估。为此,需要两个要素:评估指标和评估数据集。
本文侧重于前者。
原贴
三个主要部分:检索、生成/幻觉和端到端。
1.检索部分
这里的AP其实就是下面的Context Precision。
2.生成/幻觉部分
基于n-gram的评估:如BLEU、ROUGE、METEOR等。
关于BLEU、ROUGE可以看https://blog.csdn.net/qq_43814415/article/details/141142750
基于模型的评估:如BERTScore、BARTScore等。
基于LLM的评估:如G-Eval、UniEval、GPTScore、TRUE、SelfCheckGPT、ChatProtect、Chainpoll等。
这部分可以看司南。
3.端到端部分
Ragas框架:评估上下文召回、上下文精度、上下文相关性、答案语义相似性、答案正确性、忠实度、答案相关性。
promptfoo框架:评估上下文依从性、上下文召回、上下文相关性、真实性、答案相关性。
RAG Triad框架:评估上下文相关性、答案相关性、扎实性。
ARES框架:评估上下文相关性、答案忠实度、答案相关性。
EXAM框架:评估上下文相关性。
RAGAS
RAGAs(Retrieval-Augmented Generation Assessment)是一个评估框架,文档。考虑检索系统识别相关和重点上下文段落的能力,LLM 以忠实方式利用这些段落的能力,以及生成本身的质量。
根据以上部分,RAGAS也考虑到了单独评估检索器和生成器,以及整个端到端系统的评估。
下面将分别介绍所需的数据集格式,评估指标,实际用例等。
1.数据集格式
最开始的 RAGAs 在评估数据集时,不必依赖人工标注的标准答案,而是通过底层的大语言模型 (LLM) 来进行评估。
所以只需要一个带有问题-答案对的评估数据集(QA 对),如:https://huggingface.co/datasets/m-ric/huggingface_doc
具体字段:
- question:作为 RAG 管道输入的用户查询。输入。
- answer:从 RAG 管道生成的答案。输出。
- contexts:从用于回答question外部知识源中检索的上下文。
- ground_truths:question的基本事实答案。这是唯一人工注释的信息。
2.评估指标(link)
分为组件级和系统级。
检索部分
包括Context Precision上下文精确度(相关性)、上下文召回率。
目前都由question、contexts、ground_truths计算。
1.Context Precision
评估是否所有和ground_truths(真实答案)相关的chunk被检索到而且排名很靠前。
计算方法:
1.对于检索到的上下文中的每个块,检查它是否相关或不相关,以得出给定问题的基本事实。
2.计算上下文中每个块的precision@k。(precision@k衡量在检索结果的前 k 个返回结果中,有多少比例是相关的。true positives@k:表示在前 k 个结果中,确实与查询相关的文档数量。)
3.计算 precision@k 的平均值,得出最终的上下文精度分数。
代码:
from datasets import Dataset
from ragas.metrics import context_precision
from ragas import evaluatedata_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_precision])
score.to_pandas()
2.Context Recall
上下文召回衡量检索到的上下文(contexts)与标准答案(ground_truths)的匹配程度。
- 分子:GT claims that can be attributed to context
表示在标准答案(GT)中的论断中,有多少是可以归因于检索到的上下文的。换句话说,这些论断在检索到的上下文中找到了支持或依据。 - 分母:Number of claims in GT 表示标准答案中论断的总数量。
计算方法:
1:将标准答案分解为单独的陈述。
2.对于每个基本实况陈述,验证它是否可以归因于检索到的上下文。
3.计算上下文召回率。
由大模型完成。
代码:
from datasets import Dataset
from ragas.metrics import context_recall
from ragas import evaluatedata_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[context_recall])
score.to_pandas()
生成部分
Faithfulness(忠实度)和Answer relevancy (答案相关性)
1.Faithfulness
这个指标衡量生成答案与给定上下文之间的事实一致性。它是根据answer和context计算的。
如果答案中提出的所有声明都可以从给定的上下文中推断出来,则生成的答案被认为是真实的。
- 分子:在生成的答案中,有多少论断(claims)是可以从给定的上下文推断出来的。
- 分母:生成的答案中论断(claims)的总数量。这包括所有生成的论断,无论它们是否能够从给定的上下文推断出来。
计算方法:
1.将生成的答案分解为若干单独的陈述。
2.对于每个陈述,验证是否可以从给定的上下文中推断出它。
3.计算忠诚度。
代码:
from datasets import Dataset
from ragas.metrics import FaithulnesswithHHEM
from ragas import evaluatefaithfulness_with_hhem = FaithulnesswithHHEM()
data_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[faithfulness_with_hhem])
score.to_pandas()
2.答案相关性
侧重于评估生成的答案与给定提示的相关性。
分数越低,回答不完整或包含冗余信息,分数越高表示相关性越好。
该指标是使用question、context和answer计算得出的。
计算方法:
1.使用大型语言模型 (LLM) 从生成的答案中对问题的“n”个变体进行逆向工程。也就是根据answer反向推导出可能的n个问题。
2.计算生成的问题与实际问题之间的平均余弦相似度。
代码:
from datasets import Dataset
from ragas.metrics import answer_relevancy
from ragas import evaluatedata_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'contexts' : [['The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles,'], ['The Green Bay Packers...Green Bay, Wisconsin.','The Packers compete...Football Conference']],
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_relevancy])
score.to_pandas()
端到端部分
Answer semantic similarity(答案语义相似性)和Answer Correctness(答案正确性)
1.Answer semantic similarity
答案语义相似性(Answer Semantic Similarity)的概念涉及评估answer与ground truth在语义上的相似程度。得分越高,表示生成答案与标准答案之间的语义匹配度越高。
这种评估使用交叉编码器模型(cross-encoder model)来计算语义相似性得分。
计算方法:
1.使用指定的嵌入模型对真值答案进行向量化。
2.使用相同的嵌入模型对生成的答案进行向量化。
3.计算两个向量之间的余弦相似度。
代码:
from datasets import Dataset
from ragas.metrics import answer_similarity
from ragas import evaluatedata_samples = {'question': ['When was the first super bowl?', 'Who won the most super bowls?'],'answer': ['The first superbowl was held on Jan 15, 1967', 'The most super bowls have been won by The New England Patriots'],'ground_truth': ['The first superbowl was held on January 15, 1967', 'The New England Patriots have won the Super Bowl a record six times']
}
dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset,metrics=[answer_similarity])
score.to_pandas()
2.Answer Correctness
答案正确性的评估涉及衡量answer与ground truth相比的准确性。
答案正确性包含两个关键方面:生成的答案与基本事实之间的语义相似性,以及事实相似性。使用加权方案将这些方面结合起来,以制定答案正确性分数。
语义相似性是上一个指标。
事实相似性量化了answer与ground truth之间的事实重叠,借用了混淆矩阵的概念:
TP(真阳性):存在于ground truth和answer中的事实或陈述。
FP(假阳性):存在于answer中但不存在于ground truth中的事实或陈述。
FN(假阴性):存在于ground truth中但不存在于answer中的事实或陈述。
然后根据F1值的公式得到事实相似性:
最后对事实相似性和语义相似性进行加权求和,得到答案正确性。
3.Domain Specific Evaluation
特定领域评估指标是一种基于评分量规的评估指标,用于评估模型在特定领域的性能。
评分标准由每个分数的描述组成,通常范围从 1 到 5。
使用评分标准中指定的描述使用 LLM 进行评分。
1.无参考的评估方法
from ragas import evaluate
from datasets import Dataset, DatasetDictfrom ragas.metrics import reference_free_rubrics_score, labelled_rubrics_scorerows = {"question": ["What's the longest river in the world?",],"ground_truth": ["The Nile is a major north-flowing river in northeastern Africa.",],"answer": ["The longest river in the world is the Nile, stretching approximately 6,650 kilometers (4,130 miles) through northeastern Africa, flowing through countries such as Uganda, Sudan, and Egypt before emptying into the Mediterranean Sea. There is some debate about this title, as recent studies suggest the Amazon River could be longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers (4,350 miles).",],"contexts": [["Scientists debate whether the Amazon or the Nile is the longest river in the world. Traditionally, the Nile is considered longer, but recent information suggests that the Amazon may be longer.","The Nile River was central to the Ancient Egyptians' rise to wealth and power. Since rainfall is almost non-existent in Egypt, the Nile River and its yearly floodwaters offered the people a fertile oasis for rich agriculture.","The world's longest rivers are defined as the longest natural streams whose water flows within a channel, or streambed, with defined banks.","The Amazon River could be considered longer if its longest tributaries are included, potentially extending its length to about 7,000 kilometers."],]
}dataset = Dataset.from_dict(rows)result = evaluate(dataset,metrics=[reference_free_rubrics_score,labelled_rubrics_score],
)
2.有参考的
from ragas.metrics.rubrics import LabelledRubricsScoremy_custom_rubrics = {"score1_description": "answer and ground truth are completely different","score2_description": "answer and ground truth are somewhat different","score3_description": "answer and ground truth are somewhat similar","score4_description": "answer and ground truth are similar","score5_description": "answer and ground truth are exactly the same",
}labelled_rubrics_score = LabelledRubricsScore(rubrics=my_custom_rubrics)
实践
暂时还没有实验过,找到了以下资料参考:
https://huggingface.co/learn/cookbook/zh-CN/rag_evaluation
https://docs.ragas.io/en/latest/concepts/testset_generation.html
https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a#c52f
参考
1.https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a#c52f
2.https://docs.ragas.io/en/latest/concepts/testset_generation.html
3.https://liduos.com/how-to-evaluate-rag-application.html
4.https://myscale.com/blog/zh/ultimate-guide-to-evaluate-rag-system/