From Pilot to Platform - Agentic Developer Products with LangGraph

Sourabh Shirhatti


Matas Rastenis

Summary
Uber's Use of LangGraph (via LangFX) and LangChain: A Summary
- AI Developer Tools (LangFX Framework): Uber is leveraging LangGraph and LangChain through their internal opinionated framework called LangFX. This framework is used to create AI tools that assist developers in tasks like test generation (AutoCover), best practice violation flagging (Validate), build management, and security analysis. These tools aim to automate and enhance processes that were previously manual.
- Test Generation and Management (AutoCover):
- AutoCover generates high-quality, coverage-raising business case tests.
- It uses a graph of "domain expert agents" (e.g., Scaffolder, Generator, Executor) to mimic and supercharge an engineer's test-writing heuristics.
- It can handle hundreds of concurrent code generation and execution iterations.
- Validator (another agent) is composed within AutoCover to ensure generated tests meet quality standards.
- Best Practice and Security Violation Flagging (Validate):
- Validate is an IDE experience that flags best practice violations and security issues in code automatically.
- It's a LangGraph agent that can compose multiple sub-agents, including LLM-based checkers and deterministic linters.
- It provides users with explanations and pre-computed fixes or options to send fixes to an IDE assistant.
- Deterministic Tools and Composable Sub-Agents:
- Emphasis on using deterministic tools/sub-agents where possible for reliability (e.g., linting within Validate).
- LangGraph enables the composition of these deterministic parts with LLM-powered agents.
- Encapsulation and Collaboration (Reusable Primitives):
- LangFX promotes building reusable, "super capable domain expert" agents (primitives) that can be composed into various applications (e.g., a build system agent used in both AutoCover and Validate).
- Well-defined abstractions allow different teams (e.g., security) to contribute rules/logic without needing deep AI/graph knowledge.
- Graph-Based Interactions Mirroring Developer Workflows:
- The agent graphs are designed to model how developers interact with systems.
- Improvements made to make these agentic systems efficient often lead to benefits for non-AI developer workflows as well.
- Impact and Benefits:
- AutoCover significantly increased developer platform test coverage.
- Validate handles thousands of fix interactions daily.
- The approach improves developer productivity and code quality by addressing inefficiencies.
In essence, Uber is using LangGraph and LangChain via their LangFX framework to build sophisticated, composable AI developer tools. They focus on creating highly capable, domain-specific agents and leveraging graph-based architectures to automate complex tasks, enhance developer workflows, and improve code quality across the organization.
Auto-Highlights
Auto-Highlights: Highlight: AI developer tools, Count: 1, Rank: 0.06 Highlight: test generation, Count: 2, Rank: 0.06 Highlight: new tests, Count: 1, Rank: 0.06 Highlight: test generation valuation, Count: 1, Rank: 0.06 Highlight: value tests, Count: 1, Rank: 0.06 Highlight: new test cases, Count: 1, Rank: 0.06 Highlight: concurrency tests, Count: 1, Rank: 0.06 Highlight: existing tests, Count: 1, Rank: 0.06 Highlight: AI agents, Count: 1, Rank: 0.06 Highlight: genome tests, Count: 1, Rank: 0.06 Highlight: agent systems, Count: 1, Rank: 0.05 Highlight: developer platform coverage, Count: 1, Rank: 0.05 Highlight: deterministic tool, Count: 1, Rank: 0.05 Highlight: multiple sub agents, Count: 1, Rank: 0.05 Highlight: deterministic sub agents, Count: 1, Rank: 0.05
Transcript
Next up are Srav and Matas from Uber. He'll be discussing rolling out agents using LangFX, which is their framework built with LangDrive. So please welcome it.
Hello everyone. Thanks for being here joining us on this nice Wednesday afternoon. My name is Matthias Estanis and this is my colleague April. Speaker A: I'm sorry. Speaker B: And today we're going to present how we built AI developer tools at Uber using my. So to start off a little bit of context, Uber is a massive company serving 33 million trips a day across 15,000 cities. This is enabled by massive co base for hundreds millions of lights and it is our job job of developer platform to make sure that that is triming smoothly through. Now all I need to know really is that we have, you know, 5,000 developers that are hard to please that we have to keep happy. So that's not so not so easy on this year to accomplish that we build out a large corpus of devils for our engineers and today we'll present a few of them to you and some of the key insights that we found out about building them. So sort of take us through the user. Speaker A: All right, so we'll dive right in by Talking about the 10,000 foot view of the AI developer Google Landscape and Uber. Part of that will highlight a couple of product. We will actually show you what the user experience is like and then we'll tell you what are the reusable tools that power them. After that, you know, we only focus on a couple but we'll do a quick look through a couple more products we've built just to show you how this is illustrated all to Uber and final. And we'll just, you know, tell you what we learned and hopefully there's something reusable that we. Speaker B: Let's do it. Speaker A: So our AI level strategy does not primarily three the first one is these products or bets that we have to take. So we pick things that directly improve. So things that are perform today guess warriors is reviewing forward which can also be with warriors and we're like okay, how do we make this better, how do we make this faster? How do we eliminate do we bet on few areas based on what we think we can make the most impact? We're also always learning, you know, that's why we're here. See what everyone else is up to and see what else we can target. The second pillar of our strategy is we going to do the die. We call like cross cutting derivatives. There's foundational AI technologies that is pretty much, you know, in all your solutions, you all probably feel it too. And having the right, you know, abstractions in place, the right frameworks, the right tooling helps us build more solutions and build them faster. And lastly, where I Say it's probably the cornerstone of this strategy is what we call the intentional tech transfer. We've taken a bet on a few product areas. We want to build them, we want to build them as fast as possible. But we do stop and be about, hey, what here is reusable? What can be spun out into something that reduces the barrier for the next problem we want to solve. And so LandFX is our opinionated framework. We built that maps like Land graph and land chain and makes it work better with Uber systems. And it was born out of necessity. Right. We had the first couple of products emerged and they wanted to solve problems in an agentic manner. They wanted to build the useful nodes and like Landrop was the perfect fit to do it because we saw it was proliferating across the organization. We made it available, we built a signature framework around it. So, you know, I think enough of the view, let's just dive into one of the products. Walk us through Validator. Speaker B: Yeah, absolutely. So the first product we'll showcase today is called Validate. Now what it is is an NID experience that flags up best practices violations of security issues for engineers and code automatically. So so it is effectively a mag graph agent that we open ID around and let's take a look at how it works. So we have a screenshot here that shows a user opening profile and what we have there is justified of a violation in this case. So they have a little bit of attack mouse because they can mouse over and they got a nice model saying hey, in this case you're using the indirect method to create a temporary test file. This will leak into post if you want to have them automatically clean up for you. So what do you do about it for the portfolio user? Do? Well, they have multiple choices. They can apply a pre computed fix that we have prepared for them in the background or if they choose so they can ship off the fix to their ID agenda assistant if they prefer. So that's what we have in the next slide actually is the fix requested and shift off. And we got back a fix from the IDE and so the issue is no longer present and the user is happy the issue is resolved. They no longer have a good smell. So that's super. Some of the key ideas that we found out about building this, the main thing is that the agent abstraction allows us to actually compose multiple sub agents under a central validator, for example, so we have a, you know, sections sub agent for a validator that calls into the LLM with a list of practices and sort of Gets those points of feedback dissolved or returned. But there's also a deterministic bit where for example we want to discover link issues from static windsors. So there's nothing stopping us from running a link tool and then passing all those links to the rest of the graph that allows us, you know, pre computer fixed even for those. So that's the learning and in terms of impact, you know, we're seeing thousands of fixed interactions a day. Problem satisfied engineers that fix their problems in code before they come back later to find them. Speaker A: I think, you know, we think we've built a compelling experience here, right. We met the developers where they are in the ide we have tooling that runs in the background. It can combine, you know, capabilities like we use aspart and tools. We find out where each of the test boundaries lie. We're able to evaluate set of curated best practices, flat up violations regardlessly. The most expressive way to deliver this back to the user, shorten the ide, give them a way of applying fixes. But we caught it. Why stop there for sure. Speaker B: So why stop a validate? Let's help engineers by automating their tests from the get go. Now the we're showing off here is called auto cover and it is a tool to help engineers build or generate rather building, passing coverage, raising business case testing and you know, validated and mutation tested tests. So like really high quality tests is what we're shooting for here. And the intent is to save the engineer time. So they're developing code, you want to get their test quickly and move on to the next business feature that you want to implement. So the way we got to this is actually we took a bunch of experts, domain experts, agents, we actually threw invalid window as well and more on that later. And then we arrived with that generation rule. Speaker A: So let's take a look at how it works. Speaker B: We have a screenshot of Google Source for app as an example and the user can invoke it in or in multiple ways if they want to invoke. Speaker A: It through the whole file instead of. Speaker B: Both generates, they can do a right click as shown in the screenshot and just invoke it. And once they just replace the button, what happens next is a whole, a whole bunch of stuff happens in the background. So we start with adding the target to the build system. We set up a desktop, we run an initial coverage check to get a sort of a target space for us to operate on all while that is being done, we also analyze the surrounding source to get the business context out. So we know what to test against. And what the user sees really is just they get switched to an empty desktop, depopulated. And then because we did all that stuff in the background, we're starting to pretty generic test. And what the user will see is it says genome tests come in and the file will be constant flux. There will be tests coming in at fast speed. Do a build, this does impact, we take it out. Some tests might get buried, some tests might get removed because they're redundant. You might see benchmarks like concurrency tests come in later. And so, you know, the user sort of watching this experience and then at the end we're running a nice set of value tests. That's what we want. That's the match we want for our users here. Yeah, and that's what we want. Let's dive a bit deeper into the graph here to see how it actually functions. So here's the graph. On the bottom right you can actually see Validator, which is the same agent that we just talked about previously. So you can already see some of the composability learnings that we found that we found useful. Speaker A: But so how do we arrive at this graph? Speaker B: We looked at the sort of heuristics that an engineer would use while writing tests. And so for example, you want to prepare a test environment, you want to think about which business cases to test. That's the job of the scaffolder. Then you want to think of new test cases, whether it be for extending existing tests or just running new tests, albeit that's the job of the generator. And then you want to run your builds, your tests, and then if you, you know, those are passing, you want to run a coverage check to see what you missed. That's the job of executing. And so we go on to, you know, complete the graph this way. And then because we don't no longer have a human involved, we can actually supercharge the graph and sort of juice it up so that we can deal with 100 iterations of CO generation at the same time and not 100 executions at the same time. We've seen, you know, for a sufficiently large source file, you can do that. And that's sort of where our key learning comes in, is we found that having these super capable domain expeditions gives us unparalleled performance, sort of exceptional performance compared to other agentic coding tools. We benchmark it against the industry agentic coding tools that are available for test generation and we get about two, three times more coverage and about half the time because of the speed ups that we did in creating this graphic and sort of the custom dystopian we built into our agents. And in terms of impact we have, this tool has helped raise developer platform coverage by about 10%. So that maps to about 21,000 devourers state which we're super happy about. And we're seeing the use of thousands of test generated monthly. So yeah, that's very happy about that. Sure. Let us do some more parts. Speaker A: Yeah, we don't want to stop at 5,000 testimony like we built these primitives. Right. Just want to give you a sneak peek of what else we've been able to do in your organization with this. So what you see on screen right now is our Uber Assistant builder. Think of it like our internal custom GPT store where you can build chatbots that are steeped in Uber knowledge. So like one of them you see on the screen is the security score bar and it has access to some of the same tools that we showcased earlier speak to Uber's best practices. It can detect security attack patterns. So even before I get to the point, if I'm in my IDE writing board, I can ask questions about architecture, figure out whether my implementation is secure or not. Right. Same primitives powers, different experience. Next up we have Picasso. Picasso is our internal workflow management platform and we build a conversational AI call engineering. Adopt that it understands workflow automation that understands source of truth and give you feedback grounded in product truth, like aware of what the product does. Third thing I want to show you, and this is not an exhaustive list, right, is our tool called U Review. Obviously we build stuff in the id, we try to slag. I thought that earlier in the process, but sometimes things still slip through the crack, you know, why not reinforce and make sure quality is enforced. The form, you know, know code gets landed before your VR gets merged. So again, powers are some of the same tools that you saw earlier. The power like validator depth generator were able to apply, you know, code review comments and code suggestions that developers can apply during review time. I think with that we'll just jump. Speaker B: Over to the learnings. Yeah, sounds good. So in terms of the learnings, we already sort of talked about this, but we found that building domain expeditions that are super capable are actually the way to go to get outside results. So they use context better, you can encode, it's in a rich state, they hallucinates and then you know, the outgoing result is much better. So an example that I already talked about is the execute regions. So we're able to build system to Allow us to on the Same file execute 100 tests and see that colliding and then also get separate coverage reports. That's an example of a dominance that's super capable and gives us that performance that we want. Secondly, we found that when possible composing agents with deterministic sub agents or just have a whole agent deterministic makes a lot of sense if you can solve the problem in an indigenous way. So you know, one example of that was the Lindagent undervaluator. We want to have a right output and if we have deterministic tool that can give us that intelligence, we don't need to rely on that ldl. We can have that reliable algorithm pass on the learnings to the rest of the graph that have been fixed. And then third, we found that we can scale up our data quite a bit by solving a valid problem by creating an agent and then reusing it in multiple applications. So you already saw it with Validator, the standalone experience and Valliere within Barcode for test generation valuation. But I'm going to give you one more lower level example and that's the build system agent that's actually used for both of those products. That's a new lower level abstraction that is required for us to be able to have the agents be able to execute builds and execute tests in our build system. So sora, take us through some of the strategic learning now. Speaker A: Yeah, so not a chopped up on the tech benefit, but this is the one I'm probably most excited to share. You can set up your organization for success if you want to build agentic AI. I think we've done a pretty good job of it at Uber. We haven't evolved into the AI arms race. We're all building in collaboration and I think these are our biggest takeaways. The first thing, just encapsulation collaboration. When there are well thought out abstractions like Mangrove and their opinions on how to do things like handle state management, how to deal with the currency. It really allows us to scale development horizonically. It lets us tackle more problems and more complex problems without creating this operational bottleneck. Right. The example I'll give you is our security team was able to write rules for validators the product earlier they couldn't detect security anti matter, but the security team knew nothing about this part of security. We knew nothing about AI agents and how the graph was constructed, but they were still able to add value and improve the lives of our developers. And sort of like a natural segue from that is if you're able to actually you know, work into these well defined norms, then like graphs are the next thing you think about. Speaker B: Right? Speaker A: Like graphs help us model these interactions perfectly. They often have to mirror how developers already interact with the system. So when we do the costly process engineering and identify process bottlenecks and inefficiencies, it doesn't just help accelerate or boost the AI workload, it also helps improve the experience for people not even interacting with the AI tools. Right. So it's not a like an object either. Or should we build agent systems or should we improve our existing system? It usually segues into like helping each other like just you know, we spoke about our agent, that's generation and we found multiple inefficiencies through like how do you do mock generation quickly? How do you modify build files and build like interact with the build system or even execute the test. And in the process of like fixing all these paperbacks, we improve the experience for just like non agent applications just for developers interacting, directing our systems and if it hugely beneficial and you know, with that I want to bring this up to the end. We really enjoyed presenting here. Thank you for the opportunity. Hopefully you'll learn something and we'll take something back to your doctors. Thank you Saurabh and Matas for that and thank you LinkedIn, Uber and BlackRock for sharing how they're building their enterprise platforms on Langgraph and Langsmith. This is a trend that we see more and more as a lot of the building blocks are starting to get figured out. I.