Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

View and interpret evaluation results

This document shows you how to view, visualize, and interpret model evaluation results after you run an evaluation task. It covers the following topics:

View evaluation results: Learn how to run an evaluation task and access the results object.
Visualize evaluation results: See how to plot metrics in charts for comparison.
Understand metric results: Explore the different types of evaluation metrics and what their results mean.

View evaluation results

After you define an evaluation task, run it to get the evaluation results:

from vertexai.evaluation import EvalTask

eval_result = EvalTask(
    dataset=DATASET,
    metrics=[METRIC_1, METRIC_2, METRIC_3],
    experiment=EXPERIMENT_NAME,
).evaluate(
    model=MODEL,
    experiment_run=EXPERIMENT_RUN_NAME,
)

The eval_result object is an instance of the EvalResult class, which contains the results of the evaluation run. It has the following key attributes:

summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
metrics_table: A pandas.DataFrame table that contains evaluation dataset inputs, responses, explanations, and metric results per row.
metadata: The experiment name and experiment run name for the evaluation run.

The EvalResult class is defined as follows:

@dataclasses.dataclass
class EvalResult:
    """Evaluation result.

    Attributes:
      summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
      metrics_table: A pandas.DataFrame table containing evaluation dataset inputs,
        responses, explanations, and metric results per row.
      metadata: the experiment name and experiment run name for the evaluation run.
    """

    summary_metrics: Dict[str, float]
    metrics_table: Optional["pd.DataFrame"] = None
    metadata: Optional[Dict[str, str]] = None

You can use helper functions to display the evaluation results in the Colab notebook as follows:

Tables for summary metrics and row-based metrics

Visualize evaluation results

You can plot summary metrics in a radar or bar chart for visualization and comparison between results from different evaluation runs. This visualization can be helpful for evaluating different models and prompt templates.

The following example visualizes four metrics (coherence, fluency, instruction following, and overall text quality) for responses generated using four different prompt templates. The radar and bar plots show that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice.

Radar chart showing the coherence, instruction_following, text_quality, and fluency scores for all prompt templates

Bar chart showing the mean for coherence, instruction_following, text_quality, and fluency for all prompt templates

Understand metric results

The following table provides a high-level comparison of the different evaluation metric types.

Metric Type	Description	Use Case
PointwiseMetric	Evaluates a single model's response based on a predefined rubric (for example, a 1-5 quality score).	Assess the absolute quality of a single model's output without comparison to another model.
PairwiseMetric	Compares the responses from two models (a candidate and a baseline) and determines which one is better.	Directly compare two models or two different prompts to determine the superior option (A/B testing).
Computation-based metrics	Calculates a score based on the semantic similarity between the model's response and a ground-truth reference response.	Perform objective, automated evaluation when a "correct" or reference answer is available.

The following sections list the components of instance-level and aggregate results included in metrics_table and summary_metrics for each metric type.

`PointwiseMetric`

Instance-level results

Column	Description
response	The response generated for the prompt by the model.
score	The rating given to the response as per the criteria and rating rubric. The score can be binary (0 and 1), Likert scale (1 to 5, or -2 to 2), or float (0.0 to 1.0).
explanation	The reason from the judge model for the score. The service uses chain-of-thought reasoning to guide the judge model to explain its rationale for each verdict, which can improve evaluation accuracy.

Aggregate results

Column	Description
mean score	Average score for all instances.
standard deviation	Standard deviation for all the scores.

`PairwiseMetric`

Instance-level results

Column	Description
response	The response generated for the prompt by the candidate model.
baseline_model_response	The response generated for the prompt by the baseline model.
pairwise_choice	The model with the better response. Possible values are CANDIDATE, BASELINE, or TIE.
explanation	The judge model's reason for the choice.

Aggregate results

Column	Description
candidate_model_win_rate	Ratio of the number of times the judge model decided the candidate model had the better response to the total number of responses. Ranges between 0 and 1.
baseline_model_win_rate	Ratio of the number of times the judge model decided the baseline model had the better response to the total number of responses. Ranges between 0 and 1.

Computation-based metrics

Instance-level results

Column	Description
response	The model's response being evaluated.
reference	The reference response.
score	The score is calculated for each pair of responses and references.

Aggregate results

Column	Description
mean score	Average score for all instances.
standard deviation	Standard deviation for all the scores.

Examples

The examples in this section demonstrate how to read and understand the evaluation results.

Example 1

This example shows a pointwise evaluation instance for TEXT_QUALITY. The score is 4 (on a scale of 1 to 5), which means the response is good. The explanation shows why the judge model gave the prediction a score of 4.

Dataset

prompt: "Summarize the following text in a way that a five-year-old can understand: Social Media Platform Faces Backlash Over Content Moderation Policies\nA prominent social media platform finds itself embroiled in controversy as users and content creators express discontent over its content moderation policies. Allegations of biased censorship, inconsistent enforcement, and suppression of certain viewpoints have sparked outrage among users who claim that the platform is stifling free speech. On the other hand, the platform asserts that its policies are designed to maintain a safe and inclusive online environment. This controversy raises broader questions about the role of social media in shaping public discourse and the responsibilities of platforms in curating online content."
response: "People are upset with a website because they think it's not being fair with what people can post. Some people say the website is stopping them from saying what they want. But the website says it's trying to keep everyone safe. This makes people wonder if websites should control what people can say online."

Result

score: 4,
explanation: The summary in the response follows the instruction to summarize the context in a way that a five-year old can understand. It is grounded in the context and provides important details in its summarization. However, the language used in the response is a bit verbose.

Example 2

This example is a pairwise comparison evaluation on PAIRWISE_QUESTION_ANSWERING_QUALITY. The pairwise_choice result shows the judge model prefers the candidate response over the baseline response to answer the question in the prompt. The explanation clarifies why the candidate response is better because it is more helpful.

Dataset

prompt: "Can you answer where France is based on the following paragraph? France is a country located in Western Europe. It's bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra. France's coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea. Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world."
response: "France is a country located in Western Europe.",
baseline_model_response: "France is a country.",

Result

pairwise_choice: CANDIDATE,
explanation: The BASELINE response is grounded but does not fully answer the question. The CANDIDATE response, however, is correct and provides helpful details on the location of France.

What's next

Try an evaluation example notebook.
Learn about generative AI evaluation.

View and interpret evaluation results Stay organized with collections Save and categorize content based on your preferences.

View evaluation results

Visualize evaluation results

Understand metric results

PointwiseMetric

PairwiseMetric

Computation-based metrics

Examples

Example 1

Example 2

What's next

View and interpret evaluation results

`PointwiseMetric`

`PairwiseMetric`