Run Untrusted Agent Code with LangSmith Sandboxes – Mukil Loganathan
Speaker(s): Mukil Loganathan (Engineering Manager, LangChain)
Session: Interrupt 2026 · Day 2 (May 14) · ~11:30 AM PT
Source: in-person audio recording, transcribed locally with Whisper large-v3.
Summary
Mukil Loganathan (Engineering Manager, LangChain) argues that agents are now writing real code across software engineering, data analysis, security pen-testing, and browser/computer control, which creates serious risk when that untrusted code runs on your infrastructure. He cites recent supply-chain and sandbox-escape incidents to motivate sandboxing as a core infrastructure primitive, then introduces LangSmith Sandboxes as a way to execute agent-generated code safely. He walks through four hard problems sandboxes must solve: user-facing latency (fast spin-up around 0.98 seconds and scaling to thousands), bad actors and container escapes (an egress proxy that routes all network traffic, with allow/deny lists and credentials kept in the proxy), long-running agents (durable persistence, pause/resume, no time limits), and agents making mistakes (snapshot/checkpoint/restore and forking). It is available on all plans, can be started with one line, supports bring-your-own Docker images, and integrates with LangChain's observability for full tracing.
Key Points
- Industry shift: agents are writing real code today; an estimated ~70% of the room raised their hand on using Claude/cloud code, and the speaker cited that 75% of code at Google is being generated by AI, GitHub-reported ~41% figures, and Stripe generating 1,300 PRs per week internally
- Code-execution use cases are expanding beyond software engineering into data analysis (e.g., Julius), automated security pen-testing, and browser/computer control (e.g., automated Playwright UI testing)
- Running untrusted agent code is risky; cited recent incidents this year including a supply-chain package attack, a sandbox-escape vulnerability, and prompt-injection-based sandbox escapes
- Problem 1 (latency/scale): sandboxes must be fast and scalable, with spin-up benchmarked around 0.98 seconds and the ability to spin up thousands without maintaining the compute yourself
- Problem 2 (bad actors / container escapes): an egress proxy routes all network traffic through the sandbox so you can lock down the environment with allow/deny lists and keep credentials in the proxy, surviving prompt injection
- Problem 3 (long-running agents): durable persistence layer with pause/resume, shared space for local agents to interact with the sandbox, and no limits on run time (unlike providers that cap at ~4 hours)
- Problem 4 (agents make mistakes): snapshot/checkpoint/restore and forking of initial state; available on all plans, startable with one line via the SDK, supports bring-your-own-cloud Docker images, with full tracing for observability
Notable Quotes
Agents are writing real code today.
But as Uncle Ben once said, Spider-Man, great power, great responsibility.
We have no limits because we know that, you know, maybe you want your agent to just be running in the background for hours or even days.
Full Transcript
Show the full timestamped transcript (auto-generated; lightly cleaned)
[00:00] If you look across the last few months, one thing that we're trying to see is a transitionary point
across the industry, that agents are writing real code today. I guess to show hands, who here uses
cloud code? You've made it CLI, DevIn, etc. Yeah, I think pretty packed house, I would guess maybe
like 70% of the people in here raised their hand on using cloud. Cinder Fijay recently said 75% of
the code at Google is being generated by AI.
[00:32] GitHub released a lot of statistics to know how many PRs and code percentages they're seeing. 41%
kind of commenced this year, same thing on AI Cloud. Stripe published a blog where 1,300 PRs per
week are just generated by their internal minions, but again. So we're kind of seeing what agents is
this meta-shift that's here today. The first kind of main use case that we've seen this is in the
field of software engineering, which is probably what brings most of you here. OpenSuite is our
internal open source coding agent that we use as well.
[01:06] It's connected to our Slack, GitHub, etc. And it commits hundreds of PRs across our variety of
computers. Cloud Code means node reduction, basically it's taking a document to the moon. DevIn
doing tons of cool stuff with sandboxing as well, with local coding, and being able to hand it out
to the community. And Ramp's written a lot of blogs about their insights into how they built this
validation where you can take screenshots of things to kind of code with really humbly humanism,
just doing the man-made stuff.
[01:38] But actually we're seeing this expand to other uses as well. It turns out that giving an agent
access to the computer is actually super-complex. Data analysis is a big emerging trend as well.
Instead of providing redefined SQL tools and things like that, you might just give your agent the
ability to write scripts and load data as a CSV. And then you can kind of figure out how to generate
a graph or do analysis as well. We're seeing this in the financial space, with Julius.
[02:09] We see it in data analysis, products like HEPs. So very, very popular trend as well as extension.
Security is another super-cool one. Companies like Expo, Fordor, etc. will run automated pen tests
against your infrastructure. Tell an agent to try and figure out what's wrong, and they'll kind of
automatically run it. This will perform really powerful penetration testing that typically would be
super labor intensive. And this is kind of an extension of the ability to use the execute code.
[02:43] There's this idea of a browser and a computer. So not only are we allowing the agent to write code,
we're allowing the agent to control a browser. Or control your entire system. Some examples of this
would be StateCamp, which has automated playwright testing. You can have it kind of non-adversely go
through your UI and verify that the plot forward. You can deploy your whole system and do cool
things. But as Uncle Ben once said, Spider-Man, great power, great responsibility.
[03:14] It's actually super risky, basically, to allow this kind of trusted thing to execute code manually.
These are all from this year. So earlier this year, we had the Versailles-Bellew attack, which I
guess earlier was the last month. And that would basically, if you install a package, export all
third-end files that were on your machine. A human, we don't even really test humans to build it. I
don't necessarily trust an agent to do that either. Tech startups are even running this work. NetEnt
recently had a sandbox escape vulnerability where you could use certain JavaScript scripts to
escape.
[03:50] And that's the sandbox that they provided. And prompt injections, you know, they've been serving for
annually building of many apps. Recently, Google was being hijacked, perhaps, had an issue where you
could kind of inject a prompt to actually escape their sandbox environment. And that's where this
idea of sandbox kind of comes in. It's, you know, this separate infrastructure primitive that allows
you to execute code safely and remove it from your code. There's a couple, you know, key
characteristics of this. There's also different levels of sandboxing. What do you mean by key
sandboxing?
[04:20] You have the sandboxing of the entire computer, et cetera. But, you know, that's what the core of
the primitive is. And that's also where, kind of, the length of the sandbox comes in. Because
building this primitive is actually quite hard for a number of reasons as well. The first problem is
that agents are becoming user-facing. So users expect to pass responses. I get pretty mad if I'm
waiting a few seconds for a response. And then we've done a little thing with it. With Bell, that's
a pass. With Bell, our table will respond. But you still want some of your dependencies. And agent-
facing workloads are also good.
[04:50] And user-facing workloads are also fairly good. You can scale thousands. You know, you might have
virtually workloads. So length of sandbox, if you saw this, really good scale. So we're actually a
little bit faster than this. We kind of classically benchmarked this. And we're possibly working to,
you know, speed up our sandbox spin-up time. But we're repeated, we did, like, 0.98 seconds right
now. So we're about to hit here. You can also spin up thousands of sandboxes with the dynamic. So
you don't need to maintain the compute to run thousands of these things.
[05:20] We'll kind of do that for you in the future as I go through the task problem. I guess the other
problems that, you know, with users, you also get bad acts. You know, people who are trying to
elaborate your compute and do bad things. Container escapes are a huge problem that we've seen. And
I'm sure that computers are not enough for this kind of problem. Just recently, there was a copy-
fail attack where you could have a 700-line, 700-byte script. And you could kind of get access to
the whole push-and-run and do the malicious thing. So prompt injections I already mentioned.
[05:52] Poison retreen or something. Get your agent to export the environment to a random website. But
malicious HTTP servers are also kind of another barrier that we're seeing. It's like, you know, like
a honeypot that get your agents to do a bad thing. And so on the latest sandbox side, we have this
concept of an off-boxy. This week, all network goes through the sandbox. And so you can lock down
the environment. And you can basically set allow lists and I lists.
[06:22] And keep your credentials as part of the proxy. So that actually is not being sensitive in the run-
back. This is super powerful. You can block these. You know, you have to keep these humans. You get
prompt injected. You can survive. And we're really excited about this. It's kind of a full-fledged
environment. Third problem is that agents are kind of long-running in a row. And we're always seeing
these get longer. So in sandboxes, you need to be able to start. Randomly pause, resume. But you
also don't want to, you know, lose that state. If I turn off my local computer and I turn it off, I
have all my files still there.
[06:56] The agent kind of expects that same kind of thing as well. And that's actually super powerful.
Traditional work was kind of done. So with Nike sandboxes, we've been able to sort of bring durable
persistence layer. So any state that you build up on your machine, you can use it the other way. We
support the ability to pause and resume. So because we have to, for instance, fast spin up, we can
spin it down in terms of activity. Spin it back up to only 20 pages or something that you can drag
and use it. We also have no limits on how long the sandbox will run.
[07:28] Most of the other providers have a limit for, you know, we can keep it up for four hours and bring
it back. We have no limits because we know that, you know, maybe you want your agent to just be
running in the background for hours or even days. And then we also have the ability to have local
agents interact with the sandbox. So we have kind of like shared space if you want to multiply it on
a muscle or a computer, you can kind of do that as well. And the last problem that we're seeing is
that agents make mistakes. You know, most of you are probably doing eval,
[07:58] and parents prevent a lot of it. But, you know, agents always make mistakes. And this is just a
problem that's like inherent of these stochastic systems. So on the sandbox side, we support a bunch
of capabilities that help with this. And I think in time, you can snapshot. And what that really
empowers is that you can do this. And the ability to also have a preschool. So I recently had an
agent delete my Slack account in Slack as well. I wrote a blog on DNS. I wish I could have
checkpointed and restored that, but of course I didn't have that. But you can do this in the online
market.
[08:28] You can also fork to take that initial state and maybe try out and with it as well. If you want,
they're going to continue doing that. You can get started with one-line-with-go today. It's part of
our placement SDK. So feel free to get started. It's available on all state plans. You can bring
your own cloud systems. If you already have Docker images that you might be building, et cetera,
with the patches that you want installed, CLI, whatever, that works. You can plug it into your
stack.
[08:58] This would be LinkedIn. You can have a rich open source ecosystem around our products. You can be
using some deep engines, OpenSuite. We have a CLI. Tons of features across the board. We've done a
lot of roadmap. We're very early in the lifecycle with the products. I was mentioning a little bit
about HandOn, which we're excited about. ShareColumn is going to bring the same code or files to a
bunch of different engines. And then we're an observability company. We do also want to support
full-extension tracing to see what's actually happening.
[09:29] Try it online for today. It's available on all state plans. Release on all the rest of the days. And
we'd love to hear your feedback.