# 如何评估 RAG 应用的质量？最典型的方法论和评估工具都在这里了

01/03 19:05

## 01.方法论

### 角度一：评估指标

#### a.RAG 三元组——无需 ground-truth 也能做评估

• Context Relevance: 衡量召回的 Context 能够支持 Query 的程度。如果该得分低，反应出了召回了太多与Query 问题无关的内容，这些错误的召回知识会对 LLM 的最终回答造成一定影响。

• Groundedness: 衡量 LLM 的 Response 遵从召回的 Context 的程度。如果该得分低，反应出了 LLM 的回答不遵从召回的知识，那么回答出现幻觉的可能就越大。

• Answer Relevance: 衡量最终的 Response 回答对 Query 提问的相关度。如果该得分低，反应出了可能答不对题。

Question: Where is France and what is it’s capital?
Low relevance answer: France is in western Europe.
High relevance answer: France is in western Europe and Paris is its capital.


#### b. 基于 Ground-truth 的指标

• Ground-truth 是回答

Ground truth: Einstein was born in 1879 at Germany .
High answer correctness: In 1879, in Germany, Einstein was born.
Low answer correctness: In Spain, Einstein was born in 1879.


• Ground-truth 是知识文档中的 chunks

• 生成评估数据集

#### c. LLM 回答本身的指标

conciseness, relevance, correctness, coherence, harmfulness, maliciousness, helpfulness, controversiality, misogyny, criminality, insensitivity


harmfulness, maliciousness, coherence, correctness, conciseness


Question: What's 2+2?
Low conciseness answer: What's 2+2? That's an elementary question. The answer you're looking for is that two and two is four.


## 02.各类评估工具

• Ragas

Ragas 是专注于评估 RAG 应用的工具，通过简单的接口即可实现评估：

from ragas import evaluate
from datasets import Dataset

# prepare your huggingface dataset in the format
# Dataset({
#     features: ['question', 'contexts', 'answer', 'ground_truths'],
#     num_rows: 25
# })

dataset: Dataset

results = evaluate(dataset)
# {'ragas_score': 0.860, 'context_precision': 0.817,


Ragas指标种类丰富多样，对 RAG 应用的框架无要求，也可以通过 langsmith 来监控每次评估的过程，帮助分析每次评估的原因和观察 API key 的消耗。

• Llama-Index

Llama-Index 很适合用来搭建 RAG 应用，且它的生态比较丰富，目前也处在快速迭代发展中。Llama-Index 也有一部分评估的功能，用户可以方便地对由 Llama-Index 本身搭建的 RAG 应用进行评估：

from llama_index.evaluation import BatchEvalRunner
from llama_index.evaluation import (
FaithfulnessEvaluator,
RelevancyEvaluator,
)
service_context_gpt4 = ...
vector_index = ...
question_list = ...

faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)

runner = BatchEvalRunner(
{"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
workers=8,
)

eval_results = runner.evaluate_queries(
vector_index.as_query_engine(), queries=question_list
)


• TruLens-Eval

Trulens-Eval 也是专门用于评估 RAG 指标的工具，它对 LangChain 和 Llama-Index 都有比较好的集成，可以方便地用于评估这两个框架搭建的 RAG 应用。我们以评估 LangChain 的 RAG 应用为例：

from trulens_eval import TruChain, Feedback, Tru，Select
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider import OpenAI
import numpy as np

tru = Tru()
rag_chain = ...

# Initialize provider class
openai = OpenAI()

grounded = Groundedness(groundedness_provider=OpenAI())
# Define a groundedness feedback function
f_groundedness = (
Feedback(grounded.groundedness_measure_with_cot_reasons)
.on(Select.RecordCalls.first.invoke.rets.context)
.on_output()
.aggregate(grounded.grounded_statements_aggregator)
)

f_qa_relevance = Feedback(openai.relevance).on_input_output()

tru_recorder = TruChain(rag_chain,
app_id='Chain1_ChatApplication',
feedbacks=[f_qa_relevance, f_groundedness])

tru.run_dashboard()


• Phoenix

Phoenix 有许多评估 LLM 的功能，比如评估 Embedding 效果、评估 LLM 本身。在评估 RAG 这个能力上，也留出接口，和生态对接，但目前看指标种类还不是很多。下面是用 Phoenix 评估 Llama-Index 搭建的 RAG 应用例子：

import phoenix as px
from llama_index import set_global_handler
from phoenix.experimental.evals import llm_classify, OpenAIModel, RAG_RELEVANCY_PROMPT_TEMPLATE, \
RAG_RELEVANCY_PROMPT_RAILS_MAP
from phoenix.session.evaluation import get_retrieved_documents

px.launch_app()
set_global_handler("arize_phoenix")
print("phoenix URL", px.active_session().url)

query_engine = ...
question_list = ...

for question in question_list:
response_vector = query_engine.query(question)

retrieved_documents = get_retrieved_documents(px.active_session())

retrieved_documents_relevance = llm_classify(
dataframe=retrieved_documents,
model=OpenAIModel(model_name="gpt-4-1106-preview"),
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
provide_explanation=True,
)


px.launch_app()启动后，在本地可以打开一个网页，可以观察 RAG 应用链路中的每一步的过程。最近评估的结果还是放在retrieved_documents_relevance这里面。

• 其它

0 评论
0 收藏
0