Skip to main content

How We Built It: LangSmith Engine – Ben Tannyhill & Vivek Trivedy

Speaker(s): Ben Tannyhill (Product Manager, LangChain); Vivek Trivedy (Product Manager, LangChain)
Session: Interrupt 2026 · Day 1 (May 13) · ~2:20 PM PT
Source: in-person audio recording, transcribed locally with Whisper large-v3.

Summary

Ben Tannyhill and Vivek Trivedy of LangChain walk through how they built LangSmith Engine, positioned as the future of agent observability, evaluation, and autonomous agent improvement. The motivation came from their own go-to-market agent: despite close monitoring they kept hitting 'micro-regressions'—small fixes that get deprioritized and turn the work into a game of whack-a-mole because writing good evals is hard. Engine is itself an agent that pulls error traces, clusters them into a prioritized inbox of issues, proposes fixes (prompt, skills, AGENTS.md, or code) that can be opened as a GitHub PR in one click, and strengthens the eval suite by creating custom online evaluators and generating ground-truth dataset examples from production errors. They recount the early 'wind-up toy' version (traces auto-sent to a coding agent that spun up PRs via a GitHub Action) and the key lesson that finding meaningful issues—not generating fixes—is the hardest, most important step, since Engine was initially 'too good at finding problems.' Engine runs on LangChain's own deep agents framework and LangSmith sandboxes, is in use with ~15 customers, learns per-team priorities via an 'agent coverture' memory file, and even improves itself by feeding its own traces back into Engine.

Key Points

  • LangSmith Engine is an agent for observability, evaluation, and autonomous agent improvement; the motivating problem was 'micro-regressions'—small, deprioritized fixes that create a whack-a-mole loop because good evals are hard to write.
  • Engine identifies errors from traces, clusters them into a prioritized inbox of issues (each with a description and an occurrence timeline), and lets you click into offending traces to gut-check whether an issue is real.
  • Beyond diagnosis, Engine proposes fixes to prompts and agent files and lets you open a GitHub PR directly from the page; it also creates custom online evaluators and generates ground-truth dataset examples from real production errors (via assertion, editable in the annotation queue).
  • The earliest 'wind-up toy' version auto-sent LangSmith traces to a coding agent that spun up PRs and ran on a GitHub Action; an early real win was catching a tool call that was failing silently, and another extended an internal Slack-invoked coding agent ('open suite') to review PR comments.
  • Key lesson: Engine was 'too good at finding problems'—finding non-real ones ('show me the man and I'll show you the crime')—so identifying meaningful, actionable issues is the most important part of the loop, more so than fix or eval generation.
  • Engine runs on LangChain's deep agents framework and newly announced LangSmith sandboxes, with multi-tenant orchestration and a distributed task queue on a single LangSmith deployment; it ingests condensed/summarized traces (not full traces) plus optional source code, since 'traces are the window to the agent's soul.'
  • It is in use with about 15 customers (named: clay, manta, camfire) and learns per-team priorities—one customer wanted hallucination/data-leak issues kept but context-bloat issues suppressed, another the reverse—via a memory file called 'agent coverture.'
  • Evaluating Engine relies on internal-trace evals plus synthetic evals (LLM-generated but needing heavy rubrics and human review, since LLMs are 'very bad' at writing good evals) and end-to-end representative-trace tasks; Engine is given a deliberately minimal toolset to stay autonomous, and Vivek notes prompt engineering is 'still very much alive.'
  • The most exciting frontier is self-autonomous improvement: Engine produces traces like any agent, which are fed back into a version of Engine to improve Engine itself (e.g. context management, model choice for finding issues).

Notable Quotes

this feels like a game of whack-a-mole

the traces really are the window to the agent's soul, and how it will perform in the wild

It was finding crimes in all sorts of places where there weren't any.

Slides

Slide of the LangSmith Engine architecture: a scheduled CRON trigger dispatching parallel per-issue fix runs through LangSmith deployment, sandboxes, and tracing LangSmith Engine architecture: a scheduled (every-6h CRON) ambient agent that dispatches one parallel fix run per code-rooted issue and proposes fixes — backed by a LangSmith deployment, per-run sandboxes, and live tracing.

Slide titled Rounding out the eval suite, with diverse eval data, end-to-end evals, scenario-based tasks, and sub-piece optimization Rounding out the eval suite: diverse eval data reflecting real-world variation, end-to-end evals across the full flow, scenario-based stress tests, and sub-piece optimization of Engine's internals.

Full Transcript

Show the full timestamped transcript (auto-generated; lightly cleaned)
[00:00] Hey everybody, we are super excited about the announcement and the launch of LangSmith Engine. We
really see this, as Harrison said in his keynote, as the future of agent observability and of
valuation and of spinning this agent improvement. And what we want to do today is kind of walk you
through the process by which we landed on what is now Engine, and this cool agent that runs on
LangSmith.

[00:37] And to start, here at LangSmith, we're the agent engineering platform. So, as you might imagine, we
love to build agents. And we like to think that we're pretty good at it. And internally, we have
built a ton of them, including a go-to-market agent. This go-to-market agent does account research
for our go-to-market team. It does account outreach for our customers as well. And as we were
building and improving this agent, we saw a ton of different problems that we would run into over
and over.

[01:08] The first being, as we would be carefully and meticulously monitoring and observing what our agent
was doing in the wild, we would still get the inevitable request from our go-to-market team saying,
hey, something is broken, this didn't work as it should have, even though we had been monitoring it
very closely. And for each one of those issues that would be brought up to us, the fix was
oftentimes very straightforward, but tons of these very small fixes start to add up, they get
deprioritized, etc. And with each one of these fixes, the process of creating good evals to ensure
that there aren't regressions is very steep.

[01:42] And so oftentimes, we find ourselves encountering these micro-regressions. And these are problems
that, as I talk to customers, I see all the time as our customers are building agents as well. They
are... looking through their traces in a very manual way, which is extremely error-prone. They are
finding that a lot of these small fixes that they need for their agent to perform well are not
getting prioritized because there are just so many minor ones. And, like Harrison said in his
keynote, this feels like a game of whack-a-mole.

[02:12] I'm going to fix one problem over here, and another is going to spring up because this process of
creating good evals is very difficult. So what we wanted to build was an agent that would do this
for us and resolve these different problems for us. Something that would find our issues
automatically for us. Something that would fix those issues. And something that would ask testing to
prevent against regression. And ideally, something that would be doing this all the time so that our
team could be focusing on expanding the functionality of the agent instead of fixing these
irritating bugs or errors.

[02:43] And what I'm excited to walk you through is what we landed on with Langston Engine. And I want to
show you what that looks like in the UI. And so... Langston Engine will identify errors from your
traces in your tracing project. And it will present them in this kind of prioritized inbox on the
left. Each issue has a nice description and a timeline of its occurrence. So I can see if this is
something new or something that's been going on for a long time. And if I want to dive in deeper
into any of these offending traces, I can actually just click into them right here on the page.

[03:13] And see exactly what might have occurred with the customer. And so I can kind of gut check, is this
a real issue and something I need to fix? Engine goes beyond the addressing and the diagnosis of
problems and actually works to fix those problems as well. So here you can see there's an adjustment
to my prompt being proposed as well as an adjustment to one of my agent files. And if I want to open
a PR right here on the page, I can do so very easily. The process of finding important problems is
also very difficult. And so Engine creates custom online evaluators that are tailored to this exact
problem.

[03:47] So that if it returns or recurs in any way, it will be able to fix it. The last part of this process
that is so painful is the creation of dataset examples for your offline eval. And Engine also makes
this much easier by taking the error inputs that you've seen in production and creating a version of
ground truth from each one of those examples. You can see here I have the various inputs that I've
seen in production, the wrong outputs that have been generated by a customer in our earlier version
of the agent. And ideally what we'd like to see.

[04:17] And this is created via assertion. So we can add this to a dataset or edit this in the annotation
queue if you have any concerns about what that ground truth looks like. So really excited about the
way that this connects and strings together all the different components of that Engine engineering
loop. Engine is an agent like we talked about. It pulls these different error traces, clusters them
into issues for you. These are issues that you can action on very easily. Generates fixes for you
and then works to improve your evaluation suite as well.

[04:47] Whether that's your online evaluation suite or your online evaluation suite. So we can add this to a
dataset or edit this in the annotation queue if you have any concerns about what that ground truth
looks like. These are issues that you can add this to a dataset or edit this in the annotation
queue. This is the first instance we've seen in this time tag. And this is my version of Engine that
you just saw. This is going to show you where we are running this out. Now the version of Engine
that you just saw and that we are describing is working with a handful of our customers today. We
have about 15 customers that are really awesome and are helping us to improve and make this version
of Engine much better. From clay to manta to camfire, they have given us really awesome feedback
that Engine is changing not just the way they interact with Laintsmith but the way that they are
engineering the regions in Tyra. And it is facilitating that process so much.

[05:17] So we can add this to a dataset or edit this in the annotation queue. process so much. But the
product that they're using and that we just walked through is not what we first started with. And I
want to talk to you about the earliest version of Engine. If Engine is like this motor, this
powerhouse of that agent improvement loop, the first version was kind of like a wind-up toy. And
what we created was something that would take our length of traces, automatically send them to a
coding agent, spin up a PR, and would run on a GitHub action all the time. And it was a little
hacky, but it worked

[05:50] and we started to surface different issues from our traces on our internal media. And so the very
first PR that was created by this early version of Engine was a real problem. It was some tool call
that was failing silently one of our agents, and Engine added some error in for that. Not an insane
fix by any means, but it was a real problem that was addressed by this agent. So we were excited
about that. But more excited about the subsequent fixes that we saw. Here's another one for one of
our internal agents that is an asynchronous

[06:22] coding that can be invoked via Slack. And Engine picked up that a lot of our users were requesting
that the agent review the PR comments for the PRs that had been created by this agent. And this open
suite, the name of our coding agent, would respond, I don't have that functionality, I can't help
you with it. And so Engine picked up on these users getting these unsatisfactory responses and
created this PR to extend the functionality of these. And this is where we started to get really
excited,

[06:53] feeling like this is not just going to find issues, but it's going to improve our agent in all these
different dimensions. Obviously this first version of Engine needed some work. And one of the very
first problems that was recognized with Engine was that it was too good at finding problems. It
would oftentimes find problems that weren't real. So this is a Soviet era quote, right? Show me the
man and I'll show you the crime. This is exactly what Engine was doing. It was finding crimes in all
sorts of places

[07:25] where there weren't any. And what we quickly realized was that the process of finding and
identifying real and meaningful issues and distilling those into actionable blockades that we could
work on was the most important part of the process. The generation of the right fix, and the
generation of these fouls for afterwards, was definitely connected to the identification of
meaningful errors. So a lot of our structural and architectural thinking was centered around this.
How can we find good and meaningful errors?

[08:00] Don't get into more of the architecture. I want to just address something I think is really
incredible, which is that to power Engine and this agent, we are using all sorts of blockchain
products, whether that's our deep agents framework that VivWorks put on, or whether that's our newly
announced Landsmith sandboxes. These are all powered by the deep. Now, more to the weeds, Landsmith
Engine is triggered from a whole set of different sources. An agent can kick it off ad hoc, but by
default, it runs on a schedule.

[08:30] Whenever it's kicked off and triggered, that hits our Landsmith backend. That feeds into this multi-
tenant orchestration with this distributed task queue that we have on our Landsmith backend. And
that all feeds into a single, multi-tenant orchestration. So a single Landsmith deployment that
serves all of the different tenants, all of the agents that we have being improved by Engine are all
on this single Landsmith deployment. Where we have these deep agents running and improving based on
those traces. These deep agents are pulling in traces and the source code of the agents that we're
running on

[09:02] into a Landsmith sandbox, one for each one. And this process that you can see here actually is
really just the process of identification of it. We decided early on, that because of the massive
context that it required to look through and investigate these different traces, that it was better
for us to split out the functionality from the identification of issues from the actual fixes to
them. So you can see here there's this little section for fixes, because it looks very identical to
the graph that you're seeing, and it's just this version specifically meant to create fixes and
attach those to the various issues that we have required.

[09:38] So once this set of issues is defined, and each issue has a defined fix, we service that to the user
in the UI. And all of this, I'll note here, all of this is being traced via Landsmith, and this is
what we're using to understand how our agents are doing. Now, I've mentioned that the engine ingests
traces of our users. And I just want to drive home the point that traces are far and away the most
valuable equipment that our engine operates in. The code of your agent is valuable, obviously,

[10:11] but it doesn't do a great job of telling you, how your agent will perform in the wild. There's a
massive amount of non-determinism when building with agents. And so the traces really are the window
to the agent's soul, and how it will perform in the wild. And so that is really what powers engine
null functionality. Now, we don't ingest the entirety of every single trace that is being traced to
Landsmith, the power engine. Instead, we send it a condensed and summarized version of your traces.
And the agent has the ability to grab,

[10:42] and investigate, and look through more closely into those traces, looking for patterns that it
thinks are interesting, or might hold some kind of error. Now, we also rely on our customer's source
code, right? This can be optimally connected into engine, so that we can more accurately diagnose
fixes, as well as create these PRs to very quickly address the problems that engine will serve. But
like I said,

[11:12] the traces are really what perform so much of what powers engine null. Now, when it comes to these
actual fixes that are proposed, sometimes these are small fixes. Sometimes this is a minor prompt
adjustment. Sometimes that's an adjustment to your skills file, or your agent's MD file. And many
times it's an adjustment to the actual code of the agent's structure. But what I feel really excited
about is not just the actual code adjustment, but this contribution back into your evaluation.
Whether that's creating online evaluators to test those problems that you're recurring, or creating
these regression examples that we talked through during that demo.

[11:45] That process of taking an input that you've seen in production with a bad or unsatisfactory output,
and creating a ground truth reference output is very tedious today. So we're really excited about
the way that engine proposes what ground truth might or should look like for those different inputs.
And then lastly, we want engine to work really well for our customers and for the different
priorities that they have. When we launch the first or formal version of the engine, to that group
of customers that I mentioned earlier, we saw that there was an important learning for us,

[12:17] which is that what's good for one of our customers, and one of our teams using engine, might be very
uninteresting to another team. We have a customer reach out to me early on and say, this is working
great, you guys are finding really awesome places where my agent is hallucinating, or leaking
sensitive data, please stop showing the issues related to our context being bloated. We don't care,
our agent is working great. On the other hand, we have customers saying, this is awesome, you're
pointing out places where our cost is too high,

[12:48] our latency is way too high, and please stop showing us places where our agent is hallucinating a
little bit, it's too in line. So what we learned is that for engine to work well, engine needs to
work differently for the different teams that we are servicing. And so with that in mind, we've
allowed for this improvement process for agents to learn from our customers' behavior. So whether
that is changing the priority of an issue that has surfaced, whether that is ignoring an issue
altogether, providing reasoning for why this doesn't matter to you, or providing any other kind of
natural language feedback,

[13:20] that all feeds into what we call this agent coverture. It's kind of like a memory file for engines
to understand what you care about and how you are an agent. So we're really excited about the
process that we've undergone to improve engine and make this better for our customers. We've been
observing it very closely, and we've been evaluating it very closely. And to talk about the process
of evaluating engines, I will pass it off to Viv. Thank you. Good job. Amazing.

[13:50] So like Ben mentioned, engine today did not look like engine four weeks ago. And the question that I
really care about is how do we know that engine is working? How do we know it's working today? How
do we know it was working four weeks ago? And how do we know it's going to work two months from now?
The answer to that question is, of course, eval. It's always eval. So like Ben mentioned, engine
does three main things. It identifies issue traces from a huge corpus of data.

[14:23] It generates fixes. And a little bit of eval section, it creates evals. And for all of those things,
we need evals. So we need evals for engine to create evals for you, basically. How do we actually
get those evals? Well, the main way that we source them is from internal traces. So like Ben
mentioned, we turned engine on for a background coding issue or for a GPM issue. That gave us real
user data with users for how is engine doing?

[14:53] How can we improve it? And that sort of kicked off this feedback cycle saying, OK, engine's not that
great here, but let's take those traces and turn them into evals. So all including is a really,
really good way, if you are your own users, to push traffic evals. The other thing is synthetic
eval, which is, let me ask an LLM to generate what a good eval looks like. I put human review in
here in particular because LLMs are really smart. They're actually very bad at this task, which we
found, which is generating a good eval.

[15:23] They need a lot of rubrics and a lot of guidance and a lot of human review to be like, OK, this is a
good eval. This is not a good eval. And the third thing is we love the open community of evals. But
what we found is that the best way to generate a good eval is to put human reviews in there. And the
third thing is that the best eval for our agent is one that's bespoke for our agent. So we use them
to dust check our evals, but we don't always use them directly. It's actually better. Great. So
we're getting evals from somewhere, but how do we actually build a diverse and rounded

[15:55] out eval system? So a trace can be short. It can be long. It can be from the finance domain. It can
be from the medical domain. We want to have evals that basically capture that entire suite of evals.
The other thing is scenario based tasks. So I might just want to measure how good engine is at
calling a specific tool. The best way to do that is by making a targeted task that just measures
that. And I just said targeted task and then I said NTEN and that's because really what we found

[16:25] is the best way to measure engine is by putting in a representative set of faces and seeing the
entire NTEN. And that allows us to do sub-divide data. So that's the first thing. like I said, let's
just focus on this tool. Let's just focus on this prompt. Let's just focus on the skill that we
thought of, right? Just focus on those things. And how do those, how do those beguile actually form
our decisions? Today, every month we see a new model drop. There's five five, four seven, there's
four six,

[16:56] there's four five. The fact that you all understand what those numbers mean is crazy. But every
month we have to look at what models we're picking. That could be based on performance, cost,
latency, and all of these are feedback signals that Engine takes in and can offer as suggestions for
our customers. The other thing is, context engineering still matters a ton. How should we design the
tools? We don't wanna put 30 tools into Engine. We really design Engine to be autonomous and give it
a minimal set of tools that it can orchestrate itself.

[17:27] And the last thing is, here's a blog that prompt engineering is dead. Me and Ben did not find this
to be true. Prompt engineering is dead. It's still very much alive and sometimes the best way to
actually decode what your customers care about is by writing very detailed case-taking prompts. And
that's one thing we found in our . And I just talked about, okay, I did this whole rigorous process
of, I designed this eval and I took feedback from this person, but now I'm talking about this bias.
I think there is a part of agent measurement

[18:00] that can fully be captured by evals all the time, especially very, very, very, very, very, very,
very, very, very early in our journey. We did not have any, granted. A big part of that was dog-
fooding the product, using it over and over again, and getting feedback like that. And the first
thing here that says trust user feedback, this is huge for us. Usually, when a customer says, hey,
Engine's kinda out of whack, and our eval's, hey, that's not out of whack, it's probably out of
whack, and we need to update our eval suite to measure this. Now, this last slide is kind of the
thing

[18:33] I'm most excited about. As a research junkie, but it's the first part we're seeing of this self-
autonomous agent improvement. So this is Engine, which is an agent improving Engine. Engine produces
traces just like any other agent. And we can take those traces, ease them back into a version of
Engine, and say, hey, how should we improve Engine? Your context management isn't good. Use a
different model on finding these issues. And that's one thing I'm really excited about.

[19:03] I will have a bunch more to share soon, but I will take it back to my brother, Ben, to bring us up.
Just to summarize, a couple of the key learnings we've had in this process. For us, defining that
very crucial point of finding important and meaningful issues was really, really valuable for us
creating a good product. That definition of the most crucial point of flow, I think, is an important
thing to nail as we're building. The other thing is that while we were building agents, we were very
closely monitoring and understanding how was our agent performing under different scenarios.

[19:36] And that was all possible by using tools that we provide at Engine, especially our hardware. And
then I made this comment about the preferences of users that we allow and accommodate for our
Engine. The truth is that Engine was surfacing real and accurate issues to our customers. And even
the accurate issues weren't always meaningful to them because they were misaligned with their
priorities. So the accommodation of this to make sure that our preferences are listened to and that
our customers are able to use our Engine was a small fix, but made a huge difference.

[20:06] Make Engine not just an okay and functioning product, but something that really makes a huge
difference for our customers. And then lastly, as David talked about, this process of creating a
balance is super important for making a good agent. High quality balance ultimately going to lead to
high quality. We are really excited about the future of Langsmith Engine. We really feel like this
is the future of Langsmith and the way that our customers are going to more easily have
observability and create fixes, create their evaluation. And we're excited for that loop to spin
faster

[20:37] and for it to spin more autonomously over time. And we really can't wait for you guys to try it out.
Thank you.