Building Reliable Agents - Agent Evaluations

Harrison Chase
Summary
Harrison Chase from LangChain discussed the critical role of evaluations (evals) in moving AI agents from prototype to production quality. Key points include:
- The Problem: Quality is the primary blocker for deploying more agents into production, as identified in a survey of agent builders.
- Eval-Driven Development: This is the adopted technique to bridge the quality gap, involving continuous measurement of app performance.
- Three Types of Evals as a Continuous Journey:
- Offline Evals: Pre-production testing against a constructed dataset to compare models, prompts, and track performance over time.
- Online Evals: Scoring a subset of real-time production data to track app performance on actual user queries.
- In-the-Loop Evals: Occur during agent runtime to self-correct, improve response quality, and block bad responses. Beneficial for low error tolerance or long-running agents, despite increased time/cost.
- Core Components of Evals: Data and Evaluators:
- Data: Academic datasets are often unrepresentative. LangChain emphasizes building custom datasets. LangSmith facilitates this by allowing users to add traces (inputs, outputs, intermediate steps) to datasets. "Great evals start with great observability."
- Evaluators:
- Code-based: Deterministic, cheap, fast (e.g., exact match, regex, JSON validation).
- LLM-as-a-Judge: Powerful for complex/natural language outputs but tricky to set up and trust. LangSmith is launching private preview features (based on AlignEval and "Who Validates the Validators?") to simplify this and offer eval calibration.
- Human Annotation: Live user feedback (thumbs up/down) or background scoring via annotation queues.
- LangChain's Contributions:
- LangSmith: Provides tracing, dataset creation from traces, and upcoming features for LLM-as-a-judge setup and calibration.
- OpenEvals (Open Source Package): Offers off-the-shelf evaluators for common use cases (code linting, RAG, extraction, tool calling) and customizable evaluators (agent trajectories, chat simulations).
- Key Takeaway: Evals are a continuous journey, essential at all stages (offline, online, and increasingly in-the-loop) for building high-quality agents.
Transcript
Before this, we heard a bunch of talks about agents and how people were building. And one theme that was mentioned a few times was evals. And so for the next series of talks, we're going to deep dive into that. We've got a few speakers who are going to shed a lot of light on how they are using evals or thinking about new research in evals. But before that, I wanted to take a little bit of time to set the context for why we think evals are important. And some general things that we are doing in the space clicker is not working. There we go. All right. We're going to go all the way back. This is time travel. It's a feature in Langrath. All right, so there we go. So we ran a survey of agent builders about six months ago where we asked them what was the biggest blocker for getting more agents into production. And the number one thing that they cited by far was quality. So we talked a little bit with Michele about the trade offs between quality and leniency and costumes. And quality is still the top thing. It's blocking people getting to production. And in order to kind of like bridge that gap from prototype to production and increase that quality, one of the techniques that we've seen people adopt is eval driven development. So using evals throughout a bunch of different stages of development to measure your app's performance and then make sure that you're constantly climbing that ladder of performance. And one of the things that I want to emphasize is that evals is really a continuous journey. So there's three different types of evals that we see people running. Most people are thinking about evals maybe in one, maybe two of these, but we think it's a continuous journey throughout. Speaker B: The whole life cycle. Speaker A: So what do I mean by that? First, let's talk about offline evals. This is probably what most people think of when they say or hear the word evals. This is before you go into production, you get some data set. You run your app against that data set, and then you measure performance using some evaluators to score it, and you can track its performance over time. You can compare different models, different prompts, things like that, and get a sense for whether the changes you're making are actually increasing or decreasing your app's performance on this data set that you've constructed. Of course, that data set isn't perfect. And so there's another type of eval called online eval, which we commonly see people doing. This is when you take your app that's running in Production, you take some subset of the data that's coming into the app and you score that. And so then you can start to track the performance of your app in real time as it's running on real production data, as this is real queries that users are sending in. So it's not a static kind of like golden set. And so these are the two types of udalls that people are most familiar with. But we also think there's a third type of eval, which we call in the loop evals. And so these are evals that occur during the agent while it's running. So Michele talked a little bit about some of what they were doing at replit with this, where they were adding some evals based on trying it out and testing the browser use or running, coding on syntax. These are the examples of running some evals in the loop to correct the agent as it's running. And then if it messes up, like in any of these scenarios here, you can feed it back into the agent and have it kind of like self correct. And so you can add this in a bunch of different domains and use it to basically check the agent. And so this has some obvious benefits. It improves response quality. You're not monitoring it after the fact. It actually improves it before it responds, and it can block bad responses. The big downside is that this takes more time and it costs more money. And so we see this commonly being used when the tolerance for mistake is really low or when latency is not an issue. And as we see more and more long running agents, I actually think that's a perfect time to start thinking about putting these in the loop evals into your agent. When we think about evals, there's generally kind of like two parts to evals that we see the data that you run over and then the evaluators that you use. And so all of those three components, they had different aspects of these data and of these evaluators. So in the offline sense, you know, you'vegot your data set in the online evals. The data is the production data, and it's happening after the fact in the loop is the production data, but it's happening in the loop. And then the evaluators can be different as well. So when you have your golden truth data set, you can compare against it. And so that's useful. Those are called what we call ground truth or reference evals. And then reference free evals are when you don't have this ground truth. And this is what you do online or in the loop because you don't know what the ground truth. So basically, data evaluators are two parts of eval, no matter what type of eval you're doing. And so we want to make it easy for people to build data sets and run things over their data, as well as build their own evaluators. Because one thing that we'venoticed is that all the academic data sets that you might see or get started with, those aren't representative of how users are using your application. They're oftentimes not even kind of like in the same domain. And so when we talk about data and evals, it's really about making it easy for any application developer to build a data set or build evaluators that are specific for their use case. So how do we help do that on the data side? One we've talked about, tracing traces are where you run these online evaluators over. So we have really good tracing in links and we can send everything there. We track not only the inputs and outputs, but all the steps as well. And so you can then run evaluators over these traces. For the online evals part, we'vealso made it really, really easy to add these to a data set and start building up this ground truth data set for offline evals. So there's a nice little button in Langsmith that you can click add to data set. It will take this kind of like input output pair. You can actually modify the output pair as well, and then it adds it to a data set. So we've tried to make it really easy for people to build up these data sets in Langsmith powered by the observability. And so one of the things that we like to say is that great evals start with great observability. And that's how we think about it in Langsmith. They're tied, they're not separate things. Now let's talk about evaluators for a little bit. So there are a few different types of evaluators that we see people using. First is maybe just using code to evaluate things. So this would be like exact match or regex or checking if it's valid JSON, things like that. These are great. These are deterministic, they're cheap, they're fast to run, but they're oftentimes not as representative of all the things you want to catch, especially if you have natural language responses. So one of the things that we see popping up here is using LLM as a judge techniques to use an LLM to score the outputs of Your agent or LLM application. And so this is useful because they can handle more complex things. There's some downsides to this. They're tricky to get started work. We'll talk about this later. But generally the idea of using an LLM to judge outputs is pretty promising. And the third type that we see is just good old human annotation. This can happen kind of like live from the user as they're using the app. You can collect thumbs up, thumbs down, send those to Langsmith and track them there. Or you can have human go in the background and use something like our annotation cues to score these runs. So one of the things that we'vebeen building over the past month or so is a set of open source evaluators to try to make it easy to get started with these evaluators. And so there are a few common use cases that we think are common and you can use off the shelf. These include things like code rag extraction and tool calling. So for code, for example, we have some off the shelf utils that will lint python code or lint typescript code and then you can take those errors and feed them back into the LLM. This is great. You can use these off the shelf. There's little configuration needed, but of course for a lot of use cases you are going to want to configure evaluators to your specific use case in your specific domain. And so we have a few customizable evals also included in open evals. One of these are llamas and judge things, making it really easy to get started with that. A little bit more interesting is things around agent trajectories. So taking in a trajectory of messages, passing it into one of these evaluators and customizing it so you can choose what to look for. And then one of the things that we're launching today is chat simulations. So a lot of applications are chatbots or they have some back and forth components. So sure you can test a single kind of like interaction, but you often want to see how it performs in a conversational setting. And so we're launching some utils to both run and then score those evaluators. These are all open source in open eval's package. One of the most common techniques is how all of them to judge evaluators. These can be really powerful, but they're also tricky to get working properly. You now have a separate prompt that you have to worry about. You have to prompt engineer this prompt which is is grading your other prompt. And so there's this extra work that goes into it. And so while this is powerful, we oftentimes see people struggling to set it up or to trust the process. And so we're launching in private preview some features specifically designed to help with this. So first we're launching some work that's based off of Align eva, which is some research by Eugene Yan. You'll hear from Shreya later on. She was actually allotted to the ink for some of this work. She wrote a great paper called who Validates the Validators? And so a lot of this work we're incorporating into Langsmith to make it really easy to get started with LLM as a judge techniques. And then of course, after you get started, how do you know that it's working? So we're launching some eval calibration techniques in Langsmith where you can blindly score how the evals are doing and then compare them and see that over time. And as they start to drift parvoi, then you go back into this aligned evaluate, kind of like statin. So this is in private preview. We're really excited to launch it and work in figuring out what the right UX for building these LLM as a judge evaluators are. The thing that I want to emphasize is that evals is a continuous journey. You're not done with it. Once you build a data set and run it once, you're going to want to keep on running it. You're not done with it just because you did it on the offline part. You're going to want to do it online. And eventually, I think more and more, we're going to start building it into the agent themselves. And so this is one of the key takeaways that I'd kind of leave you with. With that, I want to hand it over to some of the amazing speakers that we have.