The problem
What was broken before AI
Many teams try to improve AI products by changing prompts and seeing whether the next answer feels better. That can work for demos, but it breaks down quickly in production. Failures are messy, users ask unpredictable questions, and a change that fixes one example may quietly make another class of examples worse. Without a structured way to inspect errors, teams end up arguing from anecdotes.
What changed
What the use case made possible
Hamel’s workflow treats AI failures as product data. Instead of jumping straight to a fix, the team reviews examples, identifies the type of failure, groups similar issues, and decides which failures matter most. From there, they build targeted evals that test the behavior directly. That makes it easier to compare prompt changes, retrieval changes, model changes, or product changes against the problems users actually experience.
Why this matters
Why this use case is worth studying
This use case is valuable because it gives teams a way to make AI quality less mysterious. The model is not just “good” or “bad.” It fails in patterns: missing context, making unsupported claims, refusing incorrectly, following the wrong instruction, formatting poorly, or misunderstanding the user’s intent. Once those patterns are visible, product teams can make better decisions about what to fix first.
Use this when
When this pattern applies
Use this pattern when an AI feature is working sometimes, failing unpredictably, and the team is not sure what to fix next. It works especially well when you have real examples from users or internal testing, but the failures feel too varied to understand from aggregate metrics alone.


