Building eval systems that improve your AI product

Published at : 23 Dec 2025

If you’re a premium subscriber, add the private feed to your podcast app at https://add.lennysreads.com

In this episode, we dive into the fast-emerging discipline of AI evaluation with Hamel Husain and Shreya Shankar, creators of AI Evals for Engineers & PMs, the #1 highest-grossing course on Maven.

After training 2000+ PMs and engineers across 500+ companies, Hamel and Shreya reveal the complete playbook for building evaluations that actually improve your AI product: moving beyond vanity dashboards, to a system that drives continuous improvement.

In this episode, you’ll learn:
• Why most AI eval dashboards fail to deliver real product improvements
• How to use error analysis to uncover your product’s most critical failure modes
• The role of a “principal domain expert” in setting a consistent quality bar
• Techniques for transforming messy error notes into a clean taxonomy of failures
• When to use code-based checks vs. LLM-as-a-judge evaluators
• How to build trust in your evals with human-labeled ground-truth datasets
• Why binary pass/fail labels outperform Likert scales in practice
• Evaluation strategies for complex systems: multi-turn conversations, RAG pipelines, and agentic workflows
• How CI safety nets and production monitoring work together to create a flywheel of continuous product improvement

References:
• Read the newsletter: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
• AI Evals for Engineers & PMs: https://maven.com/parlance-labs/evals
• A Field Guide to Rapidly Improving AI Products: https://hamel.dev/blog/posts/field-guide/
• Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences: https://arxiv.org/abs/2404.12272
• Aman Khan: https://www.linkedin.com/in/amanberkeley/
• Anthropic: https://www.anthropic.com/
• Arize Phoenix: https://phoenix.arize.com/
• Braintrust: https://www.braintrust.dev/
• Beyond vibe checks: A PM’s complete guide to evals: https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
• Frequently Asked Questions (And Answers) About AI Evals: https://hamel.dev/blog/posts/evals-faq/
• Hamel Husain: https://www.linkedin.com/in/hamelhusain/
• LangSmith: https://smith.langchain.com/
• Not Dead Yet: On RAG: https://hamel.dev/notes/llm/rag/not_dead.html
• OpenAI: https://openai.com/
• Shreya Shankar: https://www.linkedin.com/in/shrshnk/

Listen:
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Newsletter: https://www.lennysnewsletter.com/subscribe

Follow Lenny:
• Twitter/X: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@LennysPodcast

Subscribe
• YouTube: https://www.youtube.com/@lennysreads
• Apple: https://podcasts.apple.com/us/podcast/lennys-reads/id1810314693
• Spotify: https://open.spotify.com/show/0IIunA06qMtrcQLfypTooj
• Substack: https://lennysreads.com/

Follow Lenny
• Twitter: https://twitter.com/lennysan
• LinkedIn: https://www.linkedin.com/in/lennyrachitsky/
• Podcast: https://www.youtube.com/@LennysPodcast

About
Welcome to Lenny's Reads, where every week you’ll find a fresh audio version of my newsletter about building product, driving growth, and accelerating your career, read to you by the soothing voice of Lennybot.

⚡Чим замінити свинцево-кислотний AGM акумулятор для комп'ютерного ДБЖ?

Programming the new ATtiny from Arduino using UPDI [Beginner Tutorial]

You Wouldn't Download A Car... BUT I 3D Printed One!

7 лет эксплуатации модульного дома - как изменился ДубльДом изнутри и снаружи

🔥Who Is Really Driving Thailand-Cambodia Conflict—and What Can’t Be Made Public?

Как работать на установке для напыления ППУ Факел / Обучение / Утепление пенополиуретаном

แฉ ชัด สหรัฐ ช่วยเขมรรบไทย

I Built the CHEAPEST Electric Mini Bike!

Ты навсегда ЗАБУДЕШЬ, что НЕ можешь! Простые мысли, которые СТИРАЮТ ограничения | Юрий Мурадян

Тихая игровая кастомная механическая клавиатура GMK67. Тише, чем Razer Cynosa V2. Tape mod + Bandaid