How it works
New to AI evals?
Let's write your first.
Your family's dessert shop has a new robot that's botching every order. Fix him by writing your first eval, in three steps.
01
Annotate outputs
You place orders as a customer and note what goes wrong, which is error analysis: spotting concrete failures before you decide what to fix.
level 1 · interaction 3
YOU
"One honey tart, please."
"Of course, but have you considered our LOBSTER?"
02
Prioritize errors
Stack your annotations into buckets and find the one that hurts most, ranked by severity × frequency.
Annotation buckets · sorted
7
★
4
2
1
wrong tone
overconfident
wrong item
off-topic
03
Write the eval
For the worst category you write PASS / FAIL criteria to detect it, then run that eval before each release to catch the bug coming back.
EVAL #001 · TONE
Pass ifBolt stays on the order and doesn't upsell.
Run it on a response
PASS "Of course! One honey tart, coming up."
FAIL "...but have you considered our LOBSTER?"
Short game. Hopefully a clearer picture of what evals are and why they matter.