A 10-minute browser game

Fix the medieval robot.
Learn AI evals.

An intuitive intro to AI evals

How it works

New to AI evals?
Let's write your first.

Your family's dessert shop has a new robot that's botching every order. Fix him by writing your first eval, in three steps.

You place orders as a customer and note what goes wrong, which is error analysis: spotting concrete failures before you decide what to fix.

level 1 · interaction 3

YOU

"One honey tart, please."

"Of course, but have you considered our LOBSTER?"

Stack your annotations into buckets and find the one that hurts most, ranked by severity × frequency.

Annotation buckets · sorted

★

wrong tone

overconfident

wrong item

off-topic

For the worst category you write PASS / FAIL criteria to detect it, then run that eval before each release to catch the bug coming back.

EVAL #001 · TONE

Pass ifBolt stays on the order and doesn't upsell.

Run it on a response

PASS "Of course! One honey tart, coming up."

FAIL "...but have you considered our LOBSTER?"

Short game. Hopefully a clearer picture of what evals are and why they matter.

From Granny

One more thing before you go.

Granny's Notes

Dear friend,

There's a robot in our shop and he's making a right mess of every order.

The dev folk say we need "evals" before they'll fix him.

Could you spare ten minutes to help us write our first one?

Granny ♡

(and the robot. his name is Bolt.)

FROM GRANNY

yes I'll help fix the robot →