Join the AI Evals Course starting Jan 27, 2026:
https://maven.com/parlance-labs/evals?promoCode=tf-yt-c4. Three AI practitioners put LangSmith to the test in this technical deep dive into LLM evaluation frameworks. This is part 1 of the Evals Bake-Off series where industry experts evaluate popular evaluation tools.
Associated blog post:
https://hamel.dev/blog/posts/eval-tools/Playlist for series:
https://www.youtube.com/playlist?list=PLgIaq8VgndJv43faVR55aFD4g9jfaEojgIn this episode, Harrison Chase from LangChain demonstrates LangSmith's approach to prompt engineering, dataset creation, annotation workflows, and error analysis. The panel provides real-time technical critique on UX decisions, automation trade-offs, and practical implementation challenges.
This technical deep dive covers LangSmith's approach to:
Prompt engineering workflows and iteration
Dataset creation and synthetic data generation
Bulk testing and evaluation infrastructure
Error analysis and annotation systems
Tracing and observability for LLM applications
Judges & Panel:
Bryan Bischof, Head of AI, Theory Ventures
Hamel Husain, Independent Developer
Shreya Shankar, Data Systems Researcher
Key Timestamps:
0:00 Introduction
2:20 Harrison's LangSmith Demo Begins
4:05 Prompt Engineering Workflow
6:56 Trace to Playground Navigation
10:10 System Prompt Modification
13:10 Dataset Creation and Schema Definition
18:00 Adding Synthetic Examples
22:00 Running Bulk Experiments
25:00 Synthetic Data Generation Concerns
31:00 Annotation Queue Interface
35:00 Error Analysis Workflow
40:00 Axial Coding with LLMs
44:00 Final Assessment and Scoring Discussion