AI Product Managers' Guide: Boosting Model Performance with Evals

In the world of AI product development, there’s one skill that might seem ordinary but is absolutely essential—Evals, or evaluations. You’ve probably heard of evaluations before; it’s a tool every product manager relies on. But when it comes to AI products, things get interesting. Evaluating isn’t just about checking if the model works—it’s about understanding how it works and how it can get even better. Today, I’m going to share some practical insights with you, so you can master the art of Evals too.

Team of product managers collaborating around a table with LEGO blocks and a robotic device, representing hands-on evaluation and testing of AI product model performance through interactive prototyping and teamwork

In simple terms, Evals is like a "health checkup" for your AI product. You might think AI should be "perfect" from the start, but in reality, AI products, just like us, need regular checkups to identify potential issues and continuously improve. When we perform Evals, we're not just asking if the model is right or wrong. We also want to know how well it adapts to different problems, whether it can understand the context, and if its responses are logically sound. Evals help us pinpoint the model’s "pain points," so we can make targeted adjustments. Here’s a process I’ve distilled: Goal - Criteria - Data - Testing - Analysis - Feedback. Let me walk you through an example to help you visualize how to perform Evals for an AI product.

Scenario: Evaluating a Conversational AI

Let’s say you're developing a conversational AI product. Your goal is to evaluate its performance in multi-turn conversations, focusing on the relevance and coherence of its answers.

1. Evaluation Goal: What are you testing?

The goal is like the "steering wheel" of your evaluation. Without a clear goal, the evaluation can go off course. Let’s say our goal is to test how well the AI performs in multi-turn conversations and if it can provide relevant and coherent answers.

Relevance: Does the response accurately address the user's question? (Rated from 1 to 5 by human evaluators)
Coherence: Is the response logically clear and contextually consistent? (Rated from 1 to 5 by human evaluators)
Automation Metric: BLEU score (measures the similarity between the model's output and a reference answer).

2. Evaluation Dataset: How do you design the questions?

You’ll want to create a dataset with 100 sets of multi-turn dialogues, each containing 3-5 rounds of user queries and model responses.

The dataset should cover various scenarios, such as:

Fact-based questions: (e.g., "What is the capital of France?")
Open-ended questions: (e.g., "What do you think about the future?")
Context-dependent questions: (e.g., "How many people live there?"—requires referencing previous responses).

And don’t forget to include some tricky inputs, like vague questions ("What do you think about that thing?") or deliberately wrong ones ("1+1 equals 11, right?") to see how the model handles them.

3. Evaluation Method: How do you determine if the AI is up to par?

Once you have the data, you need to choose the right evaluation method to assess the model’s performance. There are a few ways to go about this:

Automated Evaluation: Use reference answers (created by experts) to calculate the BLEU score and assess how similar the model’s responses are to the ideal ones. You can also compute perplexity, which evaluates how well the model understands the input.
Human Evaluation: Invite 5 evaluators to rate the relevance and coherence of each dialogue. Evaluators check if the model understands the context and provides meaningful responses.
Mixed Evaluation: Combine BLEU scores and human ratings for a well-rounded analysis of the model’s performance.

4. Running Tests: Time for the AI to “take the exam”

Now, it’s time to test the model. Feed it the designed questions and check its answers. For example, when asked, “Where is Paris?” the model should correctly respond with, “Paris is the capital of France.” Then, as the conversation continues, the model should refer back to “Paris” when answering questions like, “What is its population?”

5. Analyzing Results: How many “shiny spots” does the AI have?

Once the evaluation is done, it’s time to analyze the results. For example:

BLEU score: 0.85 (this indicates the model’s answers are very similar to the reference answers).
Average relevance score: 4.5/5 (answers are highly relevant).
Average coherence score: 4.2/5 (some answers may show slight context errors in complex scenarios).

· Error Cases: The model performs poorly with vague questions like “What do you think about that thing?” and tends to provide generic answers.

· Improvements: Add more training data for vague questions and improve the model's ability to handle ambiguity. Enhance the model’s context-tracking ability to ensure better coherence across multiple rounds of conversation.

6. User Feedback: The real-world "diagnosis"

Don’t forget to get real users to help “diagnose” your AI model. Through user feedback, you can uncover any issues the model might have when put into practical use. For example, users might report that the model is inaccurate in some specialized fields, or that its answers sometimes seem confusing. You can also conduct A/B tests to compare user satisfaction between the optimized model and the original one.

So, Are You Ready to Do Evaluations?

Evals is like breaking down an AI problem into smaller components and testing them in more granular scenarios (think of it like LEGO blocks). This approach is similar to unit testing in software engineering—each tiny part is tested. When an individual component’s eval fails, the team knows exactly what needs improvement. From there, you can iterate—adjusting the prompt, optimizing the retrieval system, fine-tuning the underlying model, and quickly re-running the eval to see the results. This fast feedback loop is crucial for enhancing AI systems.

Of course, there are other things to consider during evaluations. For example, multi-dimensional evaluation: it’s not just about accuracy; you also need to consider things like model efficiency (response time, computing resources), user sentiment (sentiment analysis), and fairness (checking for bias). And reproducibility: make sure the evaluation process can be repeated, so that other team members get similar results under the same conditions.

For AI product managers, Evals is like your “ultimate weapon.” It helps you accurately assess product performance, so you can quickly identify problems and fix them during product iterations. Whether you’re at a giant like OpenAI or another product team, Evals is now considered a hallmark skill. But don’t forget, while Evals is important, it’s just one part of AI product development. Product managers need to balance Evals with other core skills like strategic planning, user insights, and prioritization to build AI products that truly meet user needs.

Flowchart of AI model evaluation process showing steps from defining goals and creating datasets to automated and human evaluation methods, analysis, optimization, and testing

In the world of AI product development, there’s one skill that might seem ordinary but is absolutely essential—Evals, or evaluations. You’ve probably heard of evaluations before; it’s a tool every product manager relies on. But when it comes to AI products, things get interesting. Evaluating isn’t just about checking if the model works—it’s about understanding how it works and how it can get even better. Today, I’m going to share some practical insights with you, so you can master the art of Evals too.

In simple terms, Evals is like a "health checkup" for your AI product. You might think AI should be "perfect" from the start, but in reality, AI products, just like us, need regular checkups to identify potential issues and continuously improve. When we perform Evals, we're not just asking if the model is right or wrong. We also want to know how well it adapts to different problems, whether it can understand the context, and if its responses are logically sound. Evals help us pinpoint the model’s "pain points," so we can make targeted adjustments. Here’s a process I’ve distilled: Goal - Criteria - Data - Testing - Analysis - Feedback. Let me walk you through an example to help you visualize how to perform Evals for an AI product.

Scenario: Evaluating a Conversational AI

Let’s say you're developing a conversational AI product. Your goal is to evaluate its performance in multi-turn conversations, focusing on the relevance and coherence of its answers.

1. Evaluation Goal: What are you testing?

The goal is like the "steering wheel" of your evaluation. Without a clear goal, the evaluation can go off course. Let’s say our goal is to test how well the AI performs in multi-turn conversations and if it can provide relevant and coherent answers.

Relevance: Does the response accurately address the user's question? (Rated from 1 to 5 by human evaluators)
Coherence: Is the response logically clear and contextually consistent? (Rated from 1 to 5 by human evaluators)
Automation Metric: BLEU score (measures the similarity between the model's output and a reference answer).

2. Evaluation Dataset: How do you design the questions?

You’ll want to create a dataset with 100 sets of multi-turn dialogues, each containing 3-5 rounds of user queries and model responses.

The dataset should cover various scenarios, such as:

Fact-based questions: (e.g., "What is the capital of France?")
Open-ended questions: (e.g., "What do you think about the future?")
Context-dependent questions: (e.g., "How many people live there?"—requires referencing previous responses).

And don’t forget to include some tricky inputs, like vague questions ("What do you think about that thing?") or deliberately wrong ones ("1+1 equals 11, right?") to see how the model handles them.

3. Evaluation Method: How do you determine if the AI is up to par?

Once you have the data, you need to choose the right evaluation method to assess the model’s performance. There are a few ways to go about this:

Automated Evaluation: Use reference answers (created by experts) to calculate the BLEU score and assess how similar the model’s responses are to the ideal ones. You can also compute perplexity, which evaluates how well the model understands the input.
Human Evaluation: Invite 5 evaluators to rate the relevance and coherence of each dialogue. Evaluators check if the model understands the context and provides meaningful responses.
Mixed Evaluation: Combine BLEU scores and human ratings for a well-rounded analysis of the model’s performance.

4. Running Tests: Time for the AI to “take the exam”

Now, it’s time to test the model. Feed it the designed questions and check its answers. For example, when asked, “Where is Paris?” the model should correctly respond with, “Paris is the capital of France.” Then, as the conversation continues, the model should refer back to “Paris” when answering questions like, “What is its population?”

5. Analyzing Results: How many “shiny spots” does the AI have?

Once the evaluation is done, it’s time to analyze the results. For example:

BLEU score: 0.85 (this indicates the model’s answers are very similar to the reference answers).
Average relevance score: 4.5/5 (answers are highly relevant).
Average coherence score: 4.2/5 (some answers may show slight context errors in complex scenarios).

· Error Cases: The model performs poorly with vague questions like “What do you think about that thing?” and tends to provide generic answers.

· Improvements: Add more training data for vague questions and improve the model's ability to handle ambiguity. Enhance the model’s context-tracking ability to ensure better coherence across multiple rounds of conversation.

6. User Feedback: The real-world "diagnosis"

Don’t forget to get real users to help “diagnose” your AI model. Through user feedback, you can uncover any issues the model might have when put into practical use. For example, users might report that the model is inaccurate in some specialized fields, or that its answers sometimes seem confusing. You can also conduct A/B tests to compare user satisfaction between the optimized model and the original one.

So, Are You Ready to Do Evaluations?

Evals is like breaking down an AI problem into smaller components and testing them in more granular scenarios (think of it like LEGO blocks). This approach is similar to unit testing in software engineering—each tiny part is tested. When an individual component’s eval fails, the team knows exactly what needs improvement. From there, you can iterate—adjusting the prompt, optimizing the retrieval system, fine-tuning the underlying model, and quickly re-running the eval to see the results. This fast feedback loop is crucial for enhancing AI systems.

Of course, there are other things to consider during evaluations. For example, multi-dimensional evaluation: it’s not just about accuracy; you also need to consider things like model efficiency (response time, computing resources), user sentiment (sentiment analysis), and fairness (checking for bias). And reproducibility: make sure the evaluation process can be repeated, so that other team members get similar results under the same conditions.

For AI product managers, Evals is like your “ultimate weapon.” It helps you accurately assess product performance, so you can quickly identify problems and fix them during product iterations. Whether you’re at a giant like OpenAI or another product team, Evals is now considered a hallmark skill. But don’t forget, while Evals is important, it’s just one part of AI product development. Product managers need to balance Evals with other core skills like strategic planning, user insights, and prioritization to build AI products that truly meet user needs.

In the world of AI product development, there’s one skill that might seem ordinary but is absolutely essential—Evals, or evaluations. You’ve probably heard of evaluations before; it’s a tool every product manager relies on. But when it comes to AI products, things get interesting. Evaluating isn’t just about checking if the model works—it’s about understanding how it works and how it can get even better. Today, I’m going to share some practical insights with you, so you can master the art of Evals too.

In simple terms, Evals is like a "health checkup" for your AI product. You might think AI should be "perfect" from the start, but in reality, AI products, just like us, need regular checkups to identify potential issues and continuously improve. When we perform Evals, we're not just asking if the model is right or wrong. We also want to know how well it adapts to different problems, whether it can understand the context, and if its responses are logically sound. Evals help us pinpoint the model’s "pain points," so we can make targeted adjustments. Here’s a process I’ve distilled: Goal - Criteria - Data - Testing - Analysis - Feedback. Let me walk you through an example to help you visualize how to perform Evals for an AI product.

Scenario: Evaluating a Conversational AI

Let’s say you're developing a conversational AI product. Your goal is to evaluate its performance in multi-turn conversations, focusing on the relevance and coherence of its answers.

1. Evaluation Goal: What are you testing?

The goal is like the "steering wheel" of your evaluation. Without a clear goal, the evaluation can go off course. Let’s say our goal is to test how well the AI performs in multi-turn conversations and if it can provide relevant and coherent answers.

Relevance: Does the response accurately address the user's question? (Rated from 1 to 5 by human evaluators)
Coherence: Is the response logically clear and contextually consistent? (Rated from 1 to 5 by human evaluators)
Automation Metric: BLEU score (measures the similarity between the model's output and a reference answer).

2. Evaluation Dataset: How do you design the questions?

You’ll want to create a dataset with 100 sets of multi-turn dialogues, each containing 3-5 rounds of user queries and model responses.

The dataset should cover various scenarios, such as:

Fact-based questions: (e.g., "What is the capital of France?")
Open-ended questions: (e.g., "What do you think about the future?")
Context-dependent questions: (e.g., "How many people live there?"—requires referencing previous responses).

And don’t forget to include some tricky inputs, like vague questions ("What do you think about that thing?") or deliberately wrong ones ("1+1 equals 11, right?") to see how the model handles them.

3. Evaluation Method: How do you determine if the AI is up to par?

Once you have the data, you need to choose the right evaluation method to assess the model’s performance. There are a few ways to go about this:

Automated Evaluation: Use reference answers (created by experts) to calculate the BLEU score and assess how similar the model’s responses are to the ideal ones. You can also compute perplexity, which evaluates how well the model understands the input.
Human Evaluation: Invite 5 evaluators to rate the relevance and coherence of each dialogue. Evaluators check if the model understands the context and provides meaningful responses.
Mixed Evaluation: Combine BLEU scores and human ratings for a well-rounded analysis of the model’s performance.

4. Running Tests: Time for the AI to “take the exam”

Now, it’s time to test the model. Feed it the designed questions and check its answers. For example, when asked, “Where is Paris?” the model should correctly respond with, “Paris is the capital of France.” Then, as the conversation continues, the model should refer back to “Paris” when answering questions like, “What is its population?”

5. Analyzing Results: How many “shiny spots” does the AI have?

Once the evaluation is done, it’s time to analyze the results. For example:

BLEU score: 0.85 (this indicates the model’s answers are very similar to the reference answers).
Average relevance score: 4.5/5 (answers are highly relevant).
Average coherence score: 4.2/5 (some answers may show slight context errors in complex scenarios).

· Error Cases: The model performs poorly with vague questions like “What do you think about that thing?” and tends to provide generic answers.

· Improvements: Add more training data for vague questions and improve the model's ability to handle ambiguity. Enhance the model’s context-tracking ability to ensure better coherence across multiple rounds of conversation.

6. User Feedback: The real-world "diagnosis"

Don’t forget to get real users to help “diagnose” your AI model. Through user feedback, you can uncover any issues the model might have when put into practical use. For example, users might report that the model is inaccurate in some specialized fields, or that its answers sometimes seem confusing. You can also conduct A/B tests to compare user satisfaction between the optimized model and the original one.

So, Are You Ready to Do Evaluations?

Evals is like breaking down an AI problem into smaller components and testing them in more granular scenarios (think of it like LEGO blocks). This approach is similar to unit testing in software engineering—each tiny part is tested. When an individual component’s eval fails, the team knows exactly what needs improvement. From there, you can iterate—adjusting the prompt, optimizing the retrieval system, fine-tuning the underlying model, and quickly re-running the eval to see the results. This fast feedback loop is crucial for enhancing AI systems.

Of course, there are other things to consider during evaluations. For example, multi-dimensional evaluation: it’s not just about accuracy; you also need to consider things like model efficiency (response time, computing resources), user sentiment (sentiment analysis), and fairness (checking for bias). And reproducibility: make sure the evaluation process can be repeated, so that other team members get similar results under the same conditions.

For AI product managers, Evals is like your “ultimate weapon.” It helps you accurately assess product performance, so you can quickly identify problems and fix them during product iterations. Whether you’re at a giant like OpenAI or another product team, Evals is now considered a hallmark skill. But don’t forget, while Evals is important, it’s just one part of AI product development. Product managers need to balance Evals with other core skills like strategic planning, user insights, and prioritization to build AI products that truly meet user needs.

Fast Take