View and interpret evaluation results

This document shows you how to view, visualize, and interpret model evaluation results after you run an evaluation task. It covers the following topics:

View evaluation results

After you define an evaluation task, run it to get the evaluation results:

from vertexai.evaluation import EvalTask

eval_result = EvalTask(
    dataset=DATASET,
    metrics=[METRIC_1, METRIC_2, METRIC_3],
    experiment=EXPERIMENT_NAME,
).evaluate(
    model=MODEL,
    experiment_run=EXPERIMENT_RUN_NAME,
)

The eval_result object is an instance of the EvalResult class, which contains the results of the evaluation run. It has the following key attributes:

  • summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
  • metrics_table: A pandas.DataFrame table that contains evaluation dataset inputs, responses, explanations, and metric results per row.
  • metadata: The experiment name and experiment run name for the evaluation run.

The EvalResult class is defined as follows:

@dataclasses.dataclass
class EvalResult:
    """Evaluation result.

    Attributes:
      summary_metrics: A dictionary of aggregated evaluation metrics for an evaluation run.
      metrics_table: A pandas.DataFrame table containing evaluation dataset inputs,
        responses, explanations, and metric results per row.
      metadata: the experiment name and experiment run name for the evaluation run.
    """

    summary_metrics: Dict[str, float]
    metrics_table: Optional["pd.DataFrame"] = None
    metadata: Optional[Dict[str, str]] = None

You can use helper functions to display the evaluation results in the Colab notebook as follows:

Tables for summary metrics and row-based metrics

Visualize evaluation results

You can plot summary metrics in a radar or bar chart for visualization and comparison between results from different evaluation runs. This visualization can be helpful for evaluating different models and prompt templates.

The following example visualizes four metrics (coherence, fluency, instruction following, and overall text quality) for responses generated using four different prompt templates. The radar and bar plots show that prompt template #2 consistently outperforms the other templates across all four metrics. This is particularly evident in its significantly higher scores for instruction following and text quality. Based on this analysis, prompt template #2 appears to be the most effective choice.

Radar chart showing the coherence, instruction_following, text_quality, and fluency scores for all prompt templates

Bar chart showing the mean for coherence, instruction_following, text_quality, and fluency for all prompt templates

Understand metric results

The following table provides a high-level comparison of the different evaluation metric types.

Metric Type Description Use Case
PointwiseMetric Evaluates a single model's response based on a predefined rubric (for example, a 1-5 quality score). Assess the absolute quality of a single model's output without comparison to another model.
PairwiseMetric Compares the responses from two models (a candidate and a baseline) and determines which one is better. Directly compare two models or two different prompts to determine the superior option (A/B testing).
Computation-based metrics Calculates a score based on the semantic similarity between the model's response and a ground-truth reference response. Perform objective, automated evaluation when a "correct" or reference answer is available.

The following sections list the components of instance-level and aggregate results included in metrics_table and summary_metrics for each metric type.

PointwiseMetric

Instance-level results

Column Description
response The response generated for the prompt by the model.
score The rating given to the response as per the criteria and rating rubric. The score can be binary (0 and 1), Likert scale (1 to 5, or -2 to 2), or float (0.0 to 1.0).
explanation The reason from the judge model for the score. The service uses chain-of-thought reasoning to guide the judge model to explain its rationale for each verdict, which can improve evaluation accuracy.

Aggregate results

Column Description
mean score Average score for all instances.
standard deviation Standard deviation for all the scores.

PairwiseMetric

Instance-level results

Column Description
response The response generated for the prompt by the candidate model.
baseline_model_response The response generated for the prompt by the baseline model.
pairwise_choice The model with the better response. Possible values are CANDIDATE, BASELINE, or TIE.
explanation The judge model's reason for the choice.

Aggregate results

Column Description
candidate_model_win_rate Ratio of the number of times the judge model decided the candidate model had the better response to the total number of responses. Ranges between 0 and 1.
baseline_model_win_rate Ratio of the number of times the judge model decided the baseline model had the better response to the total number of responses. Ranges between 0 and 1.

Computation-based metrics

Instance-level results

Column Description
response The model's response being evaluated.
reference The reference response.
score The score is calculated for each pair of responses and references.

Aggregate results

Column Description
mean score Average score for all instances.
standard deviation Standard deviation for all the scores.

Examples

The examples in this section demonstrate how to read and understand the evaluation results.

Example 1

This example shows a pointwise evaluation instance for TEXT_QUALITY. The score is 4 (on a scale of 1 to 5), which means the response is good. The explanation shows why the judge model gave the prediction a score of 4.

Dataset

  • prompt: "Summarize the following text in a way that a five-year-old can understand: Social Media Platform Faces Backlash Over Content Moderation Policies\nA prominent social media platform finds itself embroiled in controversy as users and content creators express discontent over its content moderation policies. Allegations of biased censorship, inconsistent enforcement, and suppression of certain viewpoints have sparked outrage among users who claim that the platform is stifling free speech. On the other hand, the platform asserts that its policies are designed to maintain a safe and inclusive online environment. This controversy raises broader questions about the role of social media in shaping public discourse and the responsibilities of platforms in curating online content."
  • response: "People are upset with a website because they think it's not being fair with what people can post. Some people say the website is stopping them from saying what they want. But the website says it's trying to keep everyone safe. This makes people wonder if websites should control what people can say online."

Result

  • score: 4,
  • explanation: The summary in the response follows the instruction to summarize the context in a way that a five-year old can understand. It is grounded in the context and provides important details in its summarization. However, the language used in the response is a bit verbose.

Example 2

This example is a pairwise comparison evaluation on PAIRWISE_QUESTION_ANSWERING_QUALITY. The pairwise_choice result shows the judge model prefers the candidate response over the baseline response to answer the question in the prompt. The explanation clarifies why the candidate response is better because it is more helpful.

Dataset

  • prompt: "Can you answer where France is based on the following paragraph? France is a country located in Western Europe. It's bordered by Belgium, Luxembourg, Germany, Switzerland, Italy, Monaco, Spain, and Andorra. France's coastline stretches along the English Channel, the North Sea, the Atlantic Ocean, and the Mediterranean Sea. Known for its rich history, iconic landmarks like the Eiffel Tower, and delicious cuisine, France is a major cultural and economic power in Europe and throughout the world."
  • response: "France is a country located in Western Europe.",
  • baseline_model_response: "France is a country.",

Result

  • pairwise_choice: CANDIDATE,
  • explanation: The BASELINE response is grounded but does not fully answer the question. The CANDIDATE response, however, is correct and provides helpful details on the location of France.

What's next