Building Reliable Agents - Evaluation Challenges

Sayantan Mukhopadhyay
Summary

Summary of Building an LLM as a Judge for Money Transfer Requests
Initial Evaluation Challenges: The team started by evaluating their model's performance against humans. Initial results showed that while humans achieved 80% accuracy, the model lagged at 51%, highlighting the need for improvement.
Fine-Tuning and Model Selection: They improved the model's accuracy to 59% through fine-tuning. Further advancements came with using larger models like B2 and 4.0, each iteration enhancing performance until they reached 79% accuracy, nearing human levels.
Iterative Process and Tools: The process involved iterating quickly, facilitated by a robust online system. This allowed for efficient testing of different prompts and models, emphasizing the importance of having the right tools in place.
Focus on Nuances: Beyond technical aspects like hallucination checks, the team highlighted the need for empathy and tone of voice. These elements are crucial for real-world applications but require careful consideration in model training.
Conclusion and Takeaways: The success underscores that building reliable agents demands consistent evaluation against real-world scenarios. Without this focus, the quality and effectiveness of the models would be compromised.
In essence, the journey demonstrates the necessity of persistent effort and thorough evaluation to ensure AI systems meet user needs effectively.
Auto-Highlights

Auto-Highlights: Highlight: different failure modes, Count: 2, Rank: 0.06 Highlight: many people, Count: 2, Rank: 0.05 Highlight: different kinds, Count: 1, Rank: 0.05 Highlight: data processing pipelines, Count: 2, Rank: 0.05 Highlight: real estate agent developer, Count: 1, Rank: 0.05 Highlight: new failure modes, Count: 1, Rank: 0.05 Highlight: failure modes, Count: 12, Rank: 0.05 Highlight: LLM accuracy, Count: 1, Rank: 0.05 Highlight: different situations, Count: 1, Rank: 0.05 Highlight: unstructured data documents, Count: 1, Rank: 0.05 Highlight: LLM operations, Count: 1, Rank: 0.05 Highlight: money transfer, Count: 5, Rank: 0.04 Highlight: data processing, Count: 5, Rank: 0.04 Highlight: good online evaluation, Count: 1, Rank: 0.04 Highlight: different sources, Count: 1, Rank: 0.04
Transcript

Pipelines, just to give a picture of the kind of research that we do in Berkeley around data processing. What do I mean by data processing? Organizations have lots of unstructured data documents that they want to extract and analyze, extract insights from and make sense of. So, for example, maybe in customer service reviews, they want to extract themes, summarize them, figuring out actionable next steps. Maybe they want to look through their emails to figure out for a sales agent which clients could have got, why didn't they? How do we move forward from that? And all sorts of domains have these kinds of tasks. For example, in traffic safety, aviation safety, what are the causes of accidents? How can we mitigate them? And when people write pipelines to use LLMs to solve these problems, their number one complaint is that this is really hard. It doesn't work. And so I want to put you in that mindset to figure out why. Imagine you are a real estate agent trying to find a place to meet your customer or your client's needs. And your client has a pet. He's a dog owner. You might want to know, okay, what neighborhoods in, say, SF have the most restrictive pet policies? I want to tell that to my client. So you might write this pipeline as a sequence of LLM operations on a bunch of real estate rental contracts. You might start out with a map operation, which for every document gives you some extracted output, more map operations, for example, to categorize or classify, and then aggregate these clauses together, maybe by neighborhood, by city, and come up with a summary or report for each. People write these pipelines and the number one thing that they tell us is my prompts don't work. And then the number one thing that they're told as a solution is, oh, just iterate on your prompts. So today's talk I really want to dive into, what is this kind of iteration entail? Why is this problem hard? How can you feel like you're not just hacking away at nothing to make progress? At UC Berkeley, we put our research hats on, or HCI hats on, and studied how people write these kinds of data processing pipelines. The very first thing we observed is that people did not even know what the right question is. And many of you might resonate with this a little bit. So in our real estate agent example, someone might think they want to extract all pet policy clauses and then realize only after looking at the documents and looking at the outputs that they only wanted dog and cat pet policy clauses. Then when they feel like they know they have the right question they want to ask, then they want to figure out how to specify that question. So we all know when working with LLMs, that we need to have very well specified, clear, unambiguous prompts. And things that we as humans think are unambiguous are actually pretty ambiguous. For example, just saying dog and cat policy classes doesn't tell the LLM much. Maybe you need to say weight limits or restrictions, breed restrictions, quantity limits, and so forth, improving the LLM's performance. So, zooming out a bit, what do these challenges mean? Iteration kind of reveals a lot of these insights, if you do it correctly. But when we help people build data processing pipelines, what we really want to do is close these gaps between the user or the developer, the data they're trying to query and make sense of, and the pipeline that we're writing. And as researchers, we figured out that, oh my gosh, there's so much tooling in this bottom half in LLM accuracy. When you have a very well specified pipeline, how do we make sure that generalizes to all documents and our needs? But there's virtually no tooling in this data understanding and intent specification gaps. So in today's talk, I want to spend the rest of the time telling you about how we're thinking of closing these gaps and insights that you might apply when you are trying to iterate. Speaker B: On your own homes. Speaker A: First, I'll talk about this data understanding gap. So, going back to our real estate rental contract example, the core challenge here is what are the types of documents in the data and what are the unique failure modes that happen for each type of documents? So, for example, all of these types of pet clauses might exist. Breed restriction type clauses, clause on the number of pets, service animal exemptions. And many people don't even know this until they look at the data. So when we're building tools, we might want to automatically be able to extract them for end users. So they can look at examples of failure modes for each type. And then we see that there's a really, really long tale of failure modes. And this is not just unique to real estate settings. We observe this for pretty much any application here. Slight ML. In general, there's so many different types of failure modes that are difficult to make sense of. So, for example, clauses might be phrased unusually and the LLM might miss extracting them. LLMs might overfit to certain keywords. It might extract things that are unrelated because, you know, a keyword is separately related, and so forth. It's not uncommon to see people flag hundreds of issues in 1000 document collection. So putting this all together, you're zooming out. What does it mean to close this data understanding here? Right. I mentioned that we want tooling to help people find anomalies and failure modes of their data, but also to be able to design evals on the fly for each of these different failure modes. And some of the solutions that we're prototyping in our stack, in our research stack, are for having people look at clusters of outputs, automatically annotate them in situ so that we can help organize them. So to give you a concrete example of what a real estate agent might do, or real estate agent developer, they might see for each failure mode that either we organize or once they label, we are able to identify that they've labeled them all the same and come up with, okay, here's a data set for where you can design evals on. And maybe there's some potential strategies, for example, generating alternative phrasings with an LLM or doing keyword checks in hybrid with LLMs. And this is where it gets a little bit fuzzy and interesting, right? How do we build these for our user? And I think a lot of different domains have very exciting challenges of this. So now I want to move over to the intent gap, which is when we know that there are lots of failure modes in our data, how do we even go about improving the pipeline? And much of this revolves around reducing query ambiguity or prompt ambiguity. Maybe I want to change pet related clauses to dog encapsulated clauses. This is a very simple example, but you can imagine with the hundreds of failure modes, figuring out how to translate this into actual pipeline improvements. It's very difficult. Do we prompt engineer? Do we add new operations? Do we do task decomposition? Do we try to look at subsections of the document and unify the results? People often get very lost in that. So one of the solutions that we're prototyping and that's available on our Doc ETL project is the ability to take users provided notes and automatically translate them into prompt improvements in an interface where people can interactively give feedback, edit, and maintain their vision history. So it's fully steerable. All right, now my last slide. You might be wondering, okay, why does this matter? I don't really care. I might not be building agents for data processing. What can I take away from this? Great question. So here's my takeaways for you. First is that we always find in every single domain that people are processing, data with, evals are very, very fuzzy and they're never done. First off, with evaluation, people are always collecting new failure modes as they run pipelines, always creating new subsets of documents or ex traces that will represent evals to run in the future. And failure modes really hide in this long tail, right? We see people having tens twenties of different failure modes that they're constantly checking for. Then the next thing that we'veobserved that is very helpful is when our users unpack the cycle of iteration into distinct stages. I mentioned that people try to do strategies like query decomposition or prompt optimization to get a well specified pipeline into a generalizable. However, we find that people first need to figure out how to specify their pipeline in the first place. So first understand your data. Do this as a stage yourself. Don't worry about having good accuracy, just know what's going on in your failure modes. Second, figure out how to get your prompts as well specified as possible. Make sure there's no ambiguity. If you were to send them to a human, they would not misinterpret them, for example. And that only do people get really good games in applying well known accuracy optimization strategies. With that, thanks so much. Feel free to email me shreyashankarkeley.edu and happy to chat with anyone afterwards. Speaker B: Thank you Shreya. Next up is Tan Lukopatcha, General Manager at numac. I'll be discussing challenges building reliability systems and how they use Langsmith to help build a reliable agency. Hello everyone, my name is Ken and it's very hard to like follow a like full blown researcher right from UC Berkeley like figuring out how to do the best prom in the world. But I decided to talk about new back here. My name is as I said I have been at New bank for the last four years. We are building the AI private banker for the for all of our customers. And the idea is that people are notoriously bad at making financial decisions like which subscription to cancel is a difficult decision for many people. So imagine how you think about loan investment how to like especially in a situation like New bank where we are the third largest bank in Brazil, the fastest growing bank in Mexico and Colombia. We have given first credit card access to 21 million people in Brazil in last five years. And just to be sure this number is actually outdated because we just released our numbers yesterday's call and these all numbers have gone up since then. But the interesting part is that from the very beginning we ran into the world of ChatGPT. We got into it and we have been working very closely with Lang Chain, Langsmith and all the teams here and I'm excited to talk about A few things we have built and how we have built it, and also like how we are going to evaluate them effectively. So just to be clear, I'm going to talk about two different applications here. First one is a chatbot. No surprise, we have 120 million almost users, right? And we get almost eight and a half million contacts every month. And out of that chat is our main channel. And in the channel right now, 60% of them are first dealt with LLOs. The results are improving and we are building agents for different situations. And I'll talk about them and the next one is actually more interesting and we'll talk about these two applications because in the world of finance, as you heard from Herbie, that complexity of the legal matters are important for finance. Every single dollar, every single thing matters. And that is important because it creates trust, it makes sure the users are happy with our service and so forth. So just for a second, let's look at this application. So in this application you can see that we have built a agent system that can do money transfer over voice, image and chat at very low inaccuracy. That will show you the numbers really soon. Here you can see that the user is connecting their account with WhatsApp. We are asking them multiple times about their password, et cetera, to make sure the right contact. And once they do all of them, they can give a very simple instruction in a bit that hey, make a transfer to Jose for 100k. We confirm this is the one that users want and then once it confirms it makes the transfer. All you need is to take around 70 seconds to make this transfer through nine different screens. It's taking less than 30 seconds now and you can see the C sharp is more than 90%, less than 0.5% inaccuracy, so on and so forth. And we are doing that at scale. And I will talk about where evaluations matter and what kind of things matter. So just to be on the same page, the experience of building the chatbot and building an application like this is very different because you need to iterate, but at the same time you need to think about that, how to not build one off solution. Because if you're building it like for one application of this kind, imagine a finance world, you are doing hundreds of these operations to make these changes or to make money movement or make micro decisions. Then you are building like hundreds of separate systems and agents which is just not scalable. So taking a step back, what is new at elevated system look like Neiman Gallen ecosystem has four different layers. The first one is core engine testing and eval tools and development experience. I don't have time unfortunately to go over each one of them, but you can see that in three of them. Right now we are working very closely with Langson and Lang suite and testing and eval is something we'll talk about LLM as a judge and also online quality evaluation and also in the developer experience side, we are using Langdra and LangChain from the very beginning of LangChain now to Langgraph, but we are using all of them now. How do we use it and why it matters? Let's see. Okay, so the first thing that happens is that without Landgra we cannot do more faster iterations and cannot make it very standard that what's the canonical approach we can take to build agentic systems or any kind of rag system. So the learnings there is that complex LLM flows can be hard to analyze. Centralized LLM logs and repository and graphical interface helps people to make faster decisions. Because we don't want only our developers to make decisions, we want our business users to also contribute to it. And the way to do it is by democratizing data and giving them access to how our business analysts, our product managers, our product operations, our total home operations that how they can make faster decisions in terms of prompt, in terms of adding inputs, in terms of adding parameters of different kinds. Right? And last but not the least, graphs can decrease the cognitive effort to represent flows. And this is something that what Shreya was mentioning about that the human interaction is difficult for machine to understand to graph basically makes that process easier for us. Now to be true to my presentation, I will go to the evaluation part and on the evaluation side I first talk about a few different challenges. Overall we have. The first thing is that as we heard from Herbie, that they are not only in Antarctica, we are only in three countries. So it's much smaller problem set. But still you can imagine that when you're dealing with like photography, Spanish, the languages, the dialects, the kind of way people talk, et cetera, that changes across the country. And we have 58% of Brazilian population or customer number. So we have to understand what users are talking about, et cetera, very extensively. The second thing is that Nubank's brand presence is huge. We are more popular than some of the well known brands like McDonald's or Nike, even in Brazil. So we cannot do anything, especially when it comes to jailbreak or guardrails. It's very important for us to keep a very high bar. And last but not the least we have to be accurate in our messaging because at the end of the day we are dealing with people's money and money is something that people care about. About accuracy and losing TR over money transfer is very easy. So taking a step back again, moving a little forward on the customer service side and the money transfer use case, we have very different needs from business side and from a technical side that what kind of evaluation three needs. So in the case of customer service, in addition to accuracy, what matters a lot that how are we approaching a customer? If a customer is calling and hey, where is my card? Or hey, I see this chart that I don't recognize. If we give a very robotic experience, we lose the customer's trust and the empathy and it matters. It's very easy for human to have this connection. It's very hard for machine to have this connection. And we all know that also very high flattery doesn't work. I think all of you have seen that what happened with ChatGPT 4.1 model last week and they recalled and they relaunched. So in order to do these two jobs well, we need to think about do we understand customers intent well, do we understand that how are we retrieving content and context from different sources that we have internally? What is the deep link accuracy that we have? Because imagine our app is 3,000 pages and in 3,000 pages we have hundreds of deep links and basically landing a user to the very top of the node and then asking them to traverse through different clicks and go to the page where they can self service is very tedious and not very effective. And last but not the least, we need to make sure that we are not hallucinating. While as a money transfer tone and state sentiment is okay, but we need to be accurate. But the accuracy not only about the transfer money but also who we are transferring, which source we are using for transfer. Does the person have enough money in their account? Are they broad suspect? Do they have a pending collection? All of these things are very intricately connected because our customers are using not only one product but whole suit of product from lending to financing, to investment, to banking account, to credit card etc. And also oftentimes they have dependent account, they have multiple cards. So if you look at all of them together, so what is important there is that can we identify the name data recognition properly? Because you can say hey, send $100 to my brother. Now say you only have one brother, you have sent your brother and you have sent money before. It's easy. But imagine a situation where you have not Done that. And you have multiple brothers and it's like my favorite brother, my less favorite brother. Then you have to identify that which brother I'm talking about to send the money, right? Because then they don't want to send money to my less favorite brother. The next one is about making sure the crack in the ecosystem. Because if the user is saying that hey, I want to send $100, but do it tomorrow, that's a different instruction than doing it right now. Because if I do it right now, maybe you will land a bit overtime. The last but not the least. Also identify that what is the correct action, because the user might be saying that I don't want to send it the last day you saw, I want to cancel it. So we need to understand that as well. So all of these things matter and evaluate for all of these things matter. And without them we cannot launch a product. So in absence of eval or in absence of a tool like Langsmith, what happens is that we have a linear path of development. We are running AB tests because we make all decisions. With AV tests, I'm not sure if I could cover before, but we have 1800 services and we do deployment every two minutes. So we do every decision by every test. And that is a linear path. But if we have a system that can very well connect the traces and give observability, give logging and then alert on top of it, so on and so forth, then we have a full cycle of observability to filtering, to define data sets, to run experiments and go on. And this is the flywheel we have in other situations and we are building with landscape for our generative AI applications. I think these two things I've heard a few times in this last couple of talks about offline evaluation, online evaluation. So I will not go in very deep details of it. But as you can see that we in the case of offline evaluation, this is an after imaging experiment result. After the experiment results, we take them to the LLM apps and we have like individual evaluation and we have pairwise evaluation for both of them. I mean most. I will talk about it later, but we primarily use human evaluative human laborers in that process. And then we currently also using LLM and other customs based on all of that, we run statistical tests and the wiener variant is something that we launch. Things get more interesting actually not at this stage, but at the online stage. Because in online evaluation you can run things in your sandboxes in your own more comfort environments. And in that situation you have a more continuous loop of improvements and development. If we only do the online evaluation, why not going back? Okay, if we only do offline evaluation, then our decision making speed, especially for developers and analysts is much slower. But if we can do good online evaluation and tracing and lobbying and alerting, et cetera, then our development speed goes up significantly. So we are doing both of them now. Last but not the least, I will talk about LLM as a judge, which is something we have talked about a few times. And the question basically goes back why we need it. Imagine the situation you are describing about the money transfer. In that situation you need to understand who you are sending, what type of request, how much money from where, all of that and doing all of that sending like we are currently doing few hundred thousand or a few millions of transactions every day. That amount of data and that amount of labeling, even if we do like sampling, it's not enough to maintain the quality of the product. And that's why we need to do more labeling. And doing it only by human is not scalable because training how people understand the mistakes people make and so forth. So our log was that that let's build LVM as a judge and try to keep the quality of the judge at the same level of human. And so we started with the first test. It was a simple prompt. We used a photo mini model because it's cheap. We didn't do any fine tuning and see that okay, how it works and we got that in the human were making 80% accurate decisions and 20% mistakes and something like that. N1 score exactly doesn't show that, but you can imagine that way a little bit. As a accuracy metric we added 51% and in the case two we moved to a fine tuning fine tuned model and we increased the equal score from 51 to 59. Next test we changed the prompt and we go to B2 and we got a big jump of 11% point of 70. We made another iteration at 4.0 from 4 oh mini to 4.0. It's a better and bigger model. Change the prompt again in Test 5, changed the fine tuning again in Test 6 and this is where we landed where we are at one cross 79% compared to 80% of human which is quite comparable. Now you might ask that hand, why did you move from Test 5 to Test 6? The F1 score is 80 and 79. Because in 79 we are identifying the inaccurate information better and are more like more inaccuracies we are catching there. That's why we are here. And just to no surprise. Actually, it might be surprising. This whole development took us maybe around two weeks with a couple of developers to go through these six iterations. And we can only do it because we have the online tracing and system in place, otherwise it would not be possible. So taking a, like wrapping it up all. There is no magic in building any agent, any element. It's hard work. Evals are hard work. And if you don't evaluate, you don't know what you're building. And if you don't know what you're building, then you cannot ship it to the world. So do more evals, spend more time, understand what the users are saying, think about not only hallucination and red pill your center, but also like nuanced situations of like empathy, tone of voice. Those things matter. We are in a very exciting time and thank you all for listening to me and if you have any questions. Okay, well, thank you T. And we are now going to to a 20 minute break before our next session. One event is we do a Boba bar located in the Adri rooftop and espresso carts located. That doc ETL thing, I'm gonna, I'm gonna dive into that. But that's like super important. Basically like what I'vebeen doing, creating that. There's all these different open source versions and they're not exactly what we need, but looking at that.
Summary​

Auto-Highlights​

Transcript​

Summary

Auto-Highlights

Transcript