All notes
AIApril 2, 20267 min read
Ship AI features evals-first, not demo-first
By Tomás Albrecht
It is never been easier to build an AI demo that wows a room and never easier to ship one that quietly erodes user trust. The gap between the two is measurement.
Evals are the spec
Before we wire a model into a product, we write the evals: a representative set of inputs and the answers we'd accept. That set becomes the spec. It tells us when retrieval is good enough, when a prompt change helped or hurt, and when we're done.
Ground everything
A confident, wrong answer is worse than no answer. We ground responses in the customer’s own data and cite sources, so every reply is checkable. The model drafts; a human or a guardrail approves.
If you can't measure it, you can't ship it to users — only to a slide.