This document shows you how to view, visualize, and interpret model evaluation results after you run an evaluation task. It covers the following topics:
- View evaluation results: Learn how to run an evaluation task and access the results object.
- Visualize evaluation results: See how to plot metrics in charts for comparison.
- Understand metric results: Explore the different types of evaluation metrics and what their results mean.
View evaluation results
After you define an evaluation task, run it to get the evaluation results:
from vertexai.evaluation import EvalTask
eval_result = EvalTask(
dataset=DATASET,
metrics=[METRIC_1, METRIC_2, METRIC_3],
experiment=EXPERIMENT_NAME,
).evaluate(
model=MODEL,
experiment_run=EXPERIMENT_RUN_NAME,
)
The eval_result
object is an instance of the EvalResult
class, which contains the results of the evaluation run. It has the following key attributes:
summary_metrics
: A dictionary of aggregated evaluation metrics for an evaluation run.metrics_table
: Apandas.DataFrame
table that contains evaluation dataset inputs, responses, explanations, and metric results per row.metadata
: The experiment name and experiment run name for the evaluation run.
The EvalResult
class is defined as follows:
@dataclasses.dataclass
class EvalResult:
"""Evaluation result.
Attributes:
summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
metrics_table: A pandas.DataFrame table containing evaluation dataset inputs,
responses, explanations, and metric results per row.
metadata: the experiment name and experiment run name for the evaluation run.
"""
summary_metrics: Dict[str, float]
metrics_table: Optional["pd.DataFrame"] = None
metadata: Optional[Dict[str, str]] = None
You can use helper functions to display the evaluation results in the Colab notebook as follows:
Visualize evaluation results
You can plot summary metrics in a radar or bar chart for visualization and comparison between results from different evaluation runs. This visualization can be helpful for evaluating different models and prompt templates.
The following example visualizes four metrics (coherence, fluency, instruction following, and overall text quality) for responses generated using four different prompt templates. The radar and bar plots show that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice.
Understand metric results
The following table provides a high-level comparison of the different evaluation metric types.
Metric Type | Description | Use Case |
---|---|---|
PointwiseMetric | Evaluates a single model's response based on a predefined rubric (for example, a 1-5 quality score). | Assess the absolute quality of a single model's output without comparison to another model. |
PairwiseMetric | Compares the responses from two models (a candidate and a baseline) and determines which one is better. | Directly compare two models or two different prompts to determine the superior option (A/B testing). |
Computation-based metrics | Calculates a score based on the semantic similarity between the model's response and a ground-truth reference response. | Perform objective, automated evaluation when a "correct" or reference answer is available. |
The following sections list the components of instance-level and aggregate results included in metrics_table
and summary_metrics
for each metric type.
PointwiseMetric
Instance-level results
Column | Description |
---|---|
response | The response generated for the prompt by the model. |
score | The rating given to the response as per the criteria and rating rubric. The score can be binary (0 and 1), Likert scale (1 to 5, or -2 to 2), or float (0.0 to 1.0). |
explanation | The reason from the judge model for the score. The service uses chain-of-thought reasoning to guide the judge model to explain its rationale for each verdict, which can improve evaluation accuracy. |
Aggregate results
Column | Description |
---|---|
mean score | Average score for all instances. |
standard deviation | Standard deviation for all the scores. |
PairwiseMetric
Instance-level results
Column | Description |
---|---|
response | The response generated for the prompt by the candidate model. |
baseline_model_response | The response generated for the prompt by the baseline model. |
pairwise_choice | The model with the better response. Possible values are CANDIDATE, BASELINE, or TIE. |
explanation | The judge model's reason for the choice. |
Aggregate results
Column | Description |
---|---|
candidate_model_win_rate | Ratio of the number of times the judge model decided the candidate model had the better response to the total number of responses. Ranges between 0 and 1. |
baseline_model_win_rate | Ratio of the number of times the judge model decided the baseline model had the better response to the total number of responses. Ranges between 0 and 1. |
Computation-based metrics
Instance-level results
Column | Description |
---|---|
response | The model's response being evaluated. |
reference | The reference response. |
score | The score is calculated for each pair of responses and references. |
Aggregate results
Column | Description |
---|---|
mean score | Average score for all instances. |
standard deviation | Standard deviation for all the scores. |
Examples
The examples in this section demonstrate how to read and understand the evaluation results.
Example 1
This example shows a pointwise evaluation instance for TEXT_QUALITY
. The score is 4 (on a scale of 1 to 5), which means the response is good. The explanation shows why the judge model gave the prediction a score of 4.
Dataset
prompt
: "Summarize the following text in a way that a five-year-old can understand: Social Media Platform Faces Backlash Over Content Moderation Policies\nA prominent social media platform finds itself embroiled in controversy as users and content creators express discontent over its content moderation policies. Allegations of biased censorship, inconsistent enforcement, and suppression of certain viewpoints have sparked outrage among users who claim that the platform is stifling free speech. On the other hand, the platform asserts that its policies are designed to maintain a safe and inclusive online environment. This controversy raises broader questions about the role of social media in shaping public discourse and the responsibilities of platforms in curating online content."response
: "People are upset with a website because they think it's not being fair with what people can post. Some people say the website is stopping them from saying what they want. But the website says it's trying to keep everyone safe. This makes people wonder if websites should control what people can say online."
Result
score
: 4,explanation
: The summary in the response follows the instruction to summarize the context in a way that a five-year old can understand. It is grounded in the context and provides important details in its summarization. However, the language used in the response is a bit verbose.
Example 2
This example is a pairwise comparison evaluation on PAIRWISE_QUESTION_ANSWERING_QUALITY
. The pairwise_choice
result shows the judge model prefers the candidate response over the baseline response to answer the question in the prompt. The explanation clarifies why the candidate response is better because it is more helpful.
Dataset
prompt
: "Can you answer where France is based on the following paragraph? France is a country located in Western Europe. It's bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra. France's coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea. Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world."response
: "France is a country located in Western Europe.",baseline_model_response
: "France is a country.",
Result
pairwise_choice
: CANDIDATE,explanation
: The BASELINE response is grounded but does not fully answer the question. The CANDIDATE response, however, is correct and provides helpful details on the location of France.
What's next
- Try an evaluation example notebook.
- Learn about generative AI evaluation.