Back to database

Hamel Husain's AI use case

AI evals and product quality expert at Parlance Labs / Independent

Uses error analysis and targeted evals to improve AI products by inspecting failures, grouping them into patterns, and turning those patterns into tests that guide future product changes.

The problem

What was broken before AI

Many teams try to improve AI products by changing prompts and seeing whether the next answer feels better. That can work for demos, but it breaks down quickly in production. Failures are messy, users ask unpredictable questions, and a change that fixes one example may quietly make another class of examples worse. Without a structured way to inspect errors, teams end up arguing from anecdotes.

What changed

What the use case made possible

Hamel’s workflow treats AI failures as product data. Instead of jumping straight to a fix, the team reviews examples, identifies the type of failure, groups similar issues, and decides which failures matter most. From there, they build targeted evals that test the behavior directly. That makes it easier to compare prompt changes, retrieval changes, model changes, or product changes against the problems users actually experience.

Why this matters

Why this use case is worth studying

This use case is valuable because it gives teams a way to make AI quality less mysterious. The model is not just “good” or “bad.” It fails in patterns: missing context, making unsupported claims, refusing incorrectly, following the wrong instruction, formatting poorly, or misunderstanding the user’s intent. Once those patterns are visible, product teams can make better decisions about what to fix first.

Use this when

When this pattern applies

Use this pattern when an AI feature is working sometimes, failing unpredictably, and the team is not sure what to fix next. It works especially well when you have real examples from users or internal testing, but the failures feel too varied to understand from aggregate metrics alone.

Exponential Builder analysis

01

Treat failures as product data.

AI quality improves faster when teams inspect real bad outputs instead of debating whether a new prompt “feels” better.

02

Name the failure before fixing it. A clear taxonomy turns messy examples into decisions

retrieval issue, instruction failure, unsupported claim, formatting problem, or intent mismatch.

03

Build evals from actual pain.

Small targeted evals tied to recurring user-facing failures will guide product work better than broad benchmarks that do not reflect how your system breaks.

Who this is for

Best fit

AI product teams

Founders building LLM features

Engineers responsible for AI quality

PMs evaluating model behavior

Support or ops teams collecting AI failures

Anyone trying to improve an AI product without guessing from vibes

What to avoid

Mistakes and warnings

Where this pattern can go wrong if you copy it too literally.

Do not rely only on vibes or a few cherry-picked examples.

Avoid building a huge eval suite before understanding the main failure modes.

Do not treat benchmark scores as a substitute for product-specific evals.

Keep the failure taxonomy small enough for the team to use.

Update evals when real user behavior changes.

Public workflow preview

The shape of the workflow

A high-level look at how the use case works, with the reusable pattern made clear.

01

Collect real failures

Start with actual user examples, support tickets, logs, or test cases where the AI output was wrong, weak, or unhelpful.

02

Inspect examples manually

Read enough failures closely to understand what is actually going wrong instead of relying only on aggregate scores.

03

Group failures into patterns

Turn individual examples into categories such as retrieval failure, instruction failure, hallucination, formatting issue, or user-intent mismatch.

04

Build targeted evals

Create tests that measure the specific failure patterns that matter most to the product.

05

Use evals to guide changes

Compare prompt, retrieval, model, or UX changes against the failure categories instead of judging from vibes.

Copy the pattern

The reusable idea

Pattern in one sentence

Turn real AI failures into a small set of named patterns, then build evals around those patterns so product changes can be judged clearly.

Reusable idea

Hamel’s workflow is a reminder that improving an AI product starts with looking at the mistakes directly. Before changing the prompt again, collect the failures, name the patterns, and decide which ones matter. Once the team agrees on the failure categories, evals become much more useful because they are tied to real product behavior instead of abstract model performance.

Steal this workflow

Use a failure-to-eval review table with these columns: user input, AI output, expected behavior, what went wrong, failure label, severity, proposed fix area, eval case needed. Start with 20–50 real failures, label each one in plain English, merge labels into 5–8 recurring failure modes, then create 3–5 eval cases for the highest-impact modes. Re-run those evals before and after any prompt, retrieval, model, or UX change, and add new production failures back into the table each week.

Suggested prompt

Review the following AI product failures and help turn them into an actionable eval plan. For each example, identify what the system got wrong, what the better behavior should have been, the most specific failure mode, and whether this should become an eval case. Then group the failure modes into a small taxonomy a product team can use, prioritize the categories by user impact, and draft test cases for the top categories with input, expected behavior, and clear failure criteria.

Field notes

Get new AI use cases in your inbox

A short weekly note on how real people are using AI to save time, make money, build tools, and run their lives.

No spam. Just useful AI use cases.

Related use cases

Keep exploring nearby systems.

Browse all