How Lyft Builds Evals That Actually Matter in Production – Nick Ung
Speaker(s): Nick Ung (Head of Data Science / Safety / Customer Care, Lyft)
Session: Interrupt 2026 · Day 1 (May 13) · ~12:00 PM PT
Source: in-person audio recording, transcribed locally with Whisper large-v3.
Summary
Nick Ung, who leads data science for safety and customer care at Lyft, explains how his team builds evals for AI Assist, Lyft's customer care AI agent product. He frames AI evaluation as no different from traditional ML: offline evaluation acts as the 'quality gate' before shipping (don't use your users as test data), followed by continuous online monitoring (LangSmith traces, human-in-the-loop) once in production. For offline evals he built a lightweight simulator inspired by Sierra AI's tau-bench, where an LLM plays the user against the real agents, MCP outputs are mocked, and the goal is a diverse dataset approximating production. The recording cuts off mid-talk while he is arguing that scalar LLM-as-judge scores (e.g., 0-to-1 helpfulness) are hard to act on, and that LLM-as-judge should instead be framed around the specific tasks the agent must perform, using clear rubrics (his example: an educational rubric for educating users about policy).
Key Points
- Lyft is public with ~79 million trips a month; AI Assist serves about 270,000 AI interactions per month; resolution rate cited around 5% (transcript figures are noisy), with the goal of fully resolving issues end to end
- AI Assist journey: started with simple rule-based logic before ~2020, then LLMs enabled 7+ AI agents and raised resolution rate from ~10% to ~35%
- Example agents: a multimodal driver-side damage-claim agent that returns a claim decision in ~15 minutes from uploaded photos, and a rider-side refund agent that contextually understands the situation and explains the refund decision
- Core eval philosophy: treat AI evals like traditional ML; offline evaluation is the 'quality gate' that decides whether to ship, since you don't want to use real customers as test data at scale
- After shipping, rely on continuous monitoring and online evals using LangSmith traces and human-in-the-loop feedback
- Built a lightweight offline simulator inspired by Sierra AI's tau-bench: real agents run against an LLM playing the user, MCP outputs are mocked, varying state of world / user intent / persona to generate diverse end-to-end trajectories
- Evaluators include LLM-as-judge plus code assertions checking end state; warns that scalar 0-to-1 scores (toxicity, helpfulness, naturalness) are hard to act on, and LLM-as-judge should be framed around the agent's specific tasks via clear rubrics (recording cuts off here)
Notable Quotes
You don't want to use your user as test data.
The way we think about Alan and Judge is that it should be framed around the task that you want the AI agent to perform
Full Transcript
Show the full timestamped transcript (auto-generated; lightly cleaned)
[00:00] Hello everyone. Happy Wednesday. It's the middle of the week. So we're going to speak with a local
building company. My name is Nick. I lead the data science and learning function for safety and
customer care at Lyft.
[00:31] And my team is building AI Assist, which is a customer care AI agent product for a lot of people
here. And I'm here to talk to you about how to build a user assistant that actually classes AI
agents into a product. So maybe just to play the show and help me over the years, AI agents should
pay for the job. I'm literally the CEO of that, so don't be embarrassed. I know it's one of those
things that we always forgot about but really shouldn't.
[01:04] Hopefully, you know, Lyft doesn't need any additional introductions. Hopefully, everyone on the
social knows Lyft. But just a couple of things to think about. Lyft is a public, you know, 79
million trips a month. And our AI Assist top boss today served about 270,000. AI instruction per
month, we've been able to build some AI vision and more instructions. Our reflection rate is 5%.
[01:34] AI resolution rate is actually 5%. So 5% makes me feel a bit low. But it really holds us up for the
high bar. And we want the agents to solve the company issue fully and to end and not just like block
customer and agent support. I'm sure you all have experience with that with support chatbot. So I'm
going to quickly walk you through AI Assist's journey. You know, we started this journey in 2020
from before. We used very simple terms to say logic at that time.
[02:08] So imagine you say A, the box, and say B, then A. And we all know how L1 has really transformed the
space. And with ETHOP, we've been able to build more than seven AI agents in the past. And our
resolution rate has really gone up from 10% to like 35% in climate. To give you a sense of what type
of AI agents we're building, we have an agent that handles charts, the skills, some types of
experience,
[02:41] and we have a little driver that's supposed to meet and call and report to us. On the driver's side,
we have automated damage plan processing, compliance data, yada yada, and the taxes. We all belong
in that. I just want to show a couple of really patients that I thought was really interesting that
we built. And also shout out to Akshay and Supamka coming down here, who actually did all my Gs.
We'll do a video on that.
[03:11] So, we have this AI agent. We have this agent on the driver's side that is a multi-modal, complex AI
agent that, when the driver uploads photos of their damage plan, we get back to them in like 15
minutes for a claim decision. On the rider's side, we have an agent that literally processes A-plus
automation or we've got logic in the back end and able to more contextually understand what the
rider is telling us and what their situations are, and explain how our refund decision for them is
very seamlessly displayed.
[03:44] And all that is to say is, you know, we wouldn't have been able to ship so many AI agents or test
and make sure that such complex AI agents actually perform in production without a great G-notice.
So, those of you who ship agent without a G-notice, we need that too. Cool. So, I'm not going to
walk you through AI systems. I'm not going to talk you through AI systems architecture, you know,
it's Lan Shang, Lan Gra, Lan Civ, some of the E&O, and all that.
[04:17] But I'm more interested to talk to you about how we think about AI systems, or how we should build
G-notice for a production system, right? A little bit about myself, I've been, for the longest, you
know, for my career, I've been doing data science for a while. I've been building machine learning
models. And my approach to think about AI in the lab, or AI agent, is no different than traditional
NL, NL data.
[04:49] So, but those of you who come from a data science background, or an engineering background, you're
used to working in the notebook, you've explored this model, you've trained on a huge data set,
you've had some kind of ground truth or regression for the computation path, and then you run this
offline evaluation, right? And see what the performance are, should we ship, should we not ship? And
so, I think about this as a quality data, right? Can we launch? So I really want to emphasize on
that offline, offline E&O piece. If that has stayed the way that we do NL engineering for the last
couple of years,
[05:22] why should AI engineers change their paradigm? So, really, really, really think about that. And the
offline E&O really serves as the quality, quality data. You don't want to use your user as test
data. But maybe you can use it as a test data. But maybe if you're a startup, you have five
customers, that's okay, but at the scale of that, we really don't want to use our customers as test
data. And once we got the agent in production, we have continuous monitoring,
[05:55] we can have online eval, getting all the long-sleeve traces, human in the loop, all that good stuff,
that's good enough feedback on how we do things with API. So I'm going to go into each of these
components. I'm going to look at each of these components. I'm going to take a little bit more
detail. Thinking about offline collaboration, what does it mean? How is that different from
traditional NL engineering workflow? You know, we love working in notebook. We love just slacking
off CSV file and get a ground truth, right?
[06:27] But for AI engineering, it's a little bit more of a setup there. We really took inspiration from
OutBench, which is a fantastic team, and Sierra AI is published as a public benchmark that all the
other LL model, all the other AI labs are using as a benchmark. So for first offline evaluations, I
actually built a lightweight simulator. So right here, we have all the network agents, which is, you
know,
[07:00] the AI agents that we built for AI Exist, and then we have an LLM wizard, or, you know, we just use
one of the other cloud-style models. We both play a user, so that we can generate this end-to-end
trajectory of the non-competent interactions. And then we have, we use a very complicated parameter
number. We have a demo cloud that, first of all, you know, what the user's intent's are, what is the
support's.
[07:30] Now we're looking at how do we mock the state of the world? If they're not blind, you're not blind.
If they're not, you're not calling, you know, you're making network calls to, so, our MCP, so in
offline setting, we're mocking those MCP outputs. And imagine if an agent called offline, we first
mock the MCP output with one of the things that's on the public cloud. So we've got this long chain
of trajectory.
[08:01] On this unit, so our company is, it is a communications, so it's a different, a combination of
different things, right? The state of the world, the user intent, the user persona, and all of that.
And the goal here really is to get a diverse data set that you can use to approximate production
data, right? You want to be able to process how the agent's going to perform when you're going to
turn it on. So for the, for the trajectory, we also define the evaluator, we use Alan as a judge.
[08:33] I think everyone's doing that. Code assertion, you know, we run something simple, simple type of
script that checks the end state of the work, the user condition is met, to find, to find the tail.
I'll get into Alan as a judge a little bit more. You know, I think when we started thinking about
Alan as a judge, and I'm sure like everyone here is looking to like NetSmith or any other
accessibility platform,
[09:03] you see, different platform framework comes to like a pre-built set of Alan and Judge, right? Things
like toxicity score, response helpfulness, conversation with the slicer, things like that. And we,
we kind of, you know, we kind of started the same way as well. We were defining things like to use
appropriate nodes, response helpfulness, naturalness, and all that, right? And these are things for
on the scale of zero to one,
[09:33] so like a scalar elementary. So what's the, what's the problem with this, right? Like for response
helpfulness, for example, what does a zero to one score score mean compared to a two to one score?
With a total of seven, how can you, how can you move the needle, right? And how, if you want to move
your overall score, what's the inside is this going to do? So we've been very in and out of this,
[10:03] we've been very in and out of what Alan, Alan and Judge should do. And hopefully, you know, no one
is using for helpfulness and agency functions. So the way we think about Alan and Judge is that it
should be framed around the task that you want the AI agent to perform, right? So your AI agent is
the instructor to do task A, B, and C, and each of those tasks is then also done daily.
[10:33] So a very simple example that I've proved from Alan's next platform is the educational, educational
rubric. So one of the tasks that we want our AI agent to do is to educate the user about list policy
or wider target or some other type of support scenario, right? So the benefit of doing this is
you're defining very, very clear success criteria for your agent. And by default, defining a rubric
like this,
[11:03] when you see is Russian that bell rubric, you can very quickly, you can very quickly,