Building Replit Agent v2

Michele Catasta

Harrison Chase
Summary
Michele Catasta from Replit discusses the evolution of Replit Agent V2, emphasizing its increased autonomy compared to V1. Key to this advancement are early investments in evaluations and observability (using tools like Langsmith). While V1 required more human-in-the-loop interaction, V2 aims for greater autonomy, running tasks for 10-15 minutes. User trust and varying risk appetites are managed by allowing user control and providing notifications via a mobile app, though the goal is to reduce interruptions as agent reliability improves. Replit Agent is used by a diverse audience, from hobbyists to businesses building internal tools, with a notable trend towards unbundling SaaS features. The agent leverages frontier models like Claude 3.5 Sonnet and uses multiple models in a single run, though Replit maintains an opinionated stance on model selection rather than offering user choice. A significant challenge is balancing performance, cost, and latency, with Replit prioritizing performance and correctness, even if it means longer run times. Future developments for V3 include integrating computer vision for testing, enhancing software testing capabilities, and exploring test-time computing with sampling and parallelism. Observability remains a critical but difficult aspect, akin to "assembly era of debugging" for agents.
Auto-Highlights
Auto-Highlights: Highlight: Replit Agent V2, Count: 2, Rank: 0.10 Highlight: autonomy, Count: 5, Rank: 0.09 Highlight: evaluations, Count: 2, Rank: 0.08 Highlight: observability, Count: 3, Rank: 0.08 Highlight: human in the loop, Count: 2, Rank: 0.07 Highlight: new models, Count: 2, Rank: 0.07 Highlight: coding agents, Count: 2, Rank: 0.06 Highlight: Langsmith, Count: 2, Rank: 0.06 Highlight: frontier models, Count: 1, Rank: 0.05 Highlight: Claude 3.5 Sonnet, Count: 1, Rank: 0.05 Highlight: computer vision, Count: 1, Rank: 0.05 Highlight: software testing, Count: 1, Rank: 0.05 Highlight: test-time computing, Count: 1, Rank: 0.05
Transcript
With that, I'm thrilled to announce our next speaker from Replit. You all are probably familiar with replit. They've made programming accessible to anyone. They've revolutionized how we've written, deployed and collaborated on code empowering a community of over 30 million developers to build more efficiently than ever before. And so I'd like to welcome my friend Michele, president at replit to the stage for our Fireside chat. Welcome, Mikhaile. Speaker B: So I think most people are probably. Speaker A: Familiar with Replit and what they do. You guys launched V2 of Replit agent six months ago. Speaker B: Two months ago. Speaker A: Two months ago. Okay. Speaker B: Early accident of late March. Speaker A: And I've heard nothing but fantastic things about it. And so if people haven't tried out Replit Agent in the last two months, what is new, what is different? Why should they try it out? Speaker B: I think the shortest possible summary is autonomy. The level of Autonomy. There is two cases compared to B1 if you try to V1 such from September last year. You recall that he was working autonomously for catalogs in mosque and right now it's not uncommon to see it running for 10, 15 minutes. And what I mean, what I say by running is not spinning the wheels like rather doing useful work and accomplishing what the user wants. And it took a lot of re architecting and also new models are coming out and things we learned, to be honest, like shipping things into actual. Teaches you a lot. I think we learn a lot of tweaks to make the agent a lot better in these months. Speaker A: Are you able to share any of those tweaks? Speaker B: Yeah. Where do I start from? I would say I usually have two pillars which by the way, I'm going to reiterate what you just explained in your on one end, investing early in evaluations, extremely important. Otherwise, especially the more your agent becomes advanced, the more you don't have an idea for introducing regression or making progress. And the other one is observability. We can go deep in there. As you know, we use Lightsmith pretty thoroughly. We also use another set of tools and I think we're all learning as a field how to lose. It's a completely different animal compared to our distributed systems in the past decades. Speaker A: One of the things that I'd to like to hear more about, we did a separate Fireside chat maybe in December and we talked about human in the loop experience and how that was important kind of like at the time. Now you're saying these agents are more autonomous. How do you think about that? Has that changed or is it just present in a different way? Speaker B: Yeah, you're spot on. There is this constant tension between wanting to put the human in the loop so that you can break the genetic flow and make sure that in case it's sideways, the human can bring it back on tracks. But at the same time, what we're experiencing from our users is that when the agent is actually working correctly, they don't want to be bothered, they just want you to get things done. And the bar keeps raising basically on a monthly basis. The more we can get done, it maybe takes a week for users to get used to that and then they just want more. So I think the strategy that we're following at the moment is we try to of close notifications also to other platforms. Mobile app for instance, obviously allows you to bring back the user to attention. But at the same time there is always a chat available where you can ask the agent to stop, you can ask it to do different work even while it's actually working. It depends I think on the user profile some users tend to be more trustworthy and they not deliver the agency to the agent and some others are a bit more heavy hands on. And you know, I'm trying to build a product that takes both of them active, but I think overall we are all going towards more autonomy over time and I think that's the big investment. Speaker A: On the topic of kind of like users, how are people kind of like using reputation? What types of things are they building? What are their backgrounds? Who are the users that you think you're part of? Speaker B: Yeah, so start. Starting from early February, we finally opened our free tier. So everyone can use revenue just creating an account. We're on track to create about 1 million applications per month. That's the level of scale that they reach today. A lot of them are just testing what agents can do. And I think the same high that we got when we were younger, we wrote our first piece of code and you actually see it running. That's what a lot of people are chasing when first trying the agent. Like realizing that you can actually build software even without having any coding backgrounds. At the same time some of them get hooked up and they realize, oh, I can build what I need for my business, I can build something that I need at work. And that's when they start to work on much more ambitious applications. So I think that one of the key differences of our product is the fact that it's not used mostly to create simple like landing pages of prototypes, but rather people find value on very long trajectories. I've seen people spending hundreds of hours on a single project with rabbit agent wrapping absolutely no lines or Just making progress with the agent. That is first of all a great technical challenge because it makes things much harder for several different reasons. And people are spending so much time, they are usually either building internal tools in companies. There's something I'm very excited about. There is this concept of unbundling SaaS that even program talks about the idea that why would I spend seven figures buying a very expensive SaaS when I do not need two features, I'm gonna rather rebuild it and put the plug in the company. So this is one direction that I see a lot of work companies working on and at the same time also personalized applications for your professionals or like even people that have their own hobby and they want to be installed, things like that. So that's the kind of escape today. Speaker A: Awesome how for people who have agents that are maybe starting with agents on the lower end of the economy and are thinking of letting it run now for 10, 15 minutes, how did you have the confidence to let it do that? Like, when was the point where you're like, okay, bring the human out of the loop and we can start letting it run? Was that based on kind of like feedback from users, internal testing metrics? Like, what did that process to get that confidence look like? Speaker B: I would say a lot of internal testing. Even before we launched V1, we had a prototype of it since early 24. So we have always been trying to make it work at the moment we find the right unlocks, which partially are due to what computer labs are working on. So the new models that they give us. And at the same time, it's also due to output this stuff that we're building. The moment it works well enough, then that's when we start to feel we should launch this. We should put it at least in front of us more after users go on. What happened with V2 is that we re architected it to best leverage the latest models out there and then we started to use it a lot internally. And we started with a approach that was a bit more similar to V1. So we were more cautious and then we just gave more leash. So how far can we take this? How good is it going to work? And it turns out that, you know, it exceeded our expectations. So the confidence, in all honesty, as usual, came during the early access program where we launched it as an opt in. We asked users just to search off to go and try it. And then we. We received exceedingly positive feedback. And then as a team, we rushed basically go to GA as soon as possible. Speaker A: You mentioned models a few times. Are you able to share what models you all are using or how generally you think of the model landscape? Speaker B: We are heavy users of the spawning models, especially 3.7 as Unlock A new level of autonomy for coding agents. I I see overall the industry pointing in that direction. The latest Gemini 2.54 is also following a very similar philosophy and I do believe that Frontier Labs are realizing that there is a lot of value in allowing companies like ours and all your customers to create much more agentic, much more advanced agentical tools compared to the past. So I wouldn't be surprised if in the next you are going to see all default models exposing tools and info strain in such a way that allows you to have much more time. Speaker A: And how many do you let users choose what model is used under the hood or is that hidden now? Speaker B: We are very opinionated and it's also a product choice in all honesty. There are platforms where of course you can pick your model use first for example to develop parts of it. So I think it's great to have a model selector and get the best possible performance from the different models that involve the market. In our case it will be a fairly big challenge to allow you to switch models. We use multiple models by the way in one run. Yeah, threesound is kind of like the combination really block for the IQ of the. Of the agent. But we also use a lot of other models to do a lot of accessory functional especially when we, when we get enough latency for, you know, for, for performance. Then we go, you know, with flash models or with models in general. So we don't give you that optionality because it would be very hard for us to even maintain several different fronts that we think about. You know, we go very deep into the rabbit flowing prompts. It would be very hard for me to go From n = 1 to n = 3, you know, prompts. That's been quite a lot of work to come out. Speaker A: Do you mostly use, do you use any open source models as well as part of this or is it mostly. Speaker B: Foundation model at this point? It's mostly foundational models. We definitely spent some time testing deep Seq and I'm very bullish in time. The reason why we're not investing too much time today fine tuning or exploring because models of lamps is because again the labs are moving at a completely different pace compared even to a year ago. I think back in the day when we talked to each other maybe there was a new leaf every six to 10 months. Now it's probably happening every couple of months. So it's probably it's better to explore what you can do today with compare labs and then eventually when things go down, the levels will happen, by the way. Or if there is a reason for us to take an even source model and tune it and for us to try to optimize some of the actions that our agent takes, then I'll be happy to spend time there. But for now it's really frantic as it is. Speaker A: You mentioned kind of like the trade off between cost and latency and there's also kind of like performance. So yeah, how do you think about that now and how have you thought about that over time? Because replit agent, I feel like at least based on what I see on Twitter, has exploded recently. And so was there a moment, I think everyone kind of has some fear when they launch indie generation AI application. Like if this becomes really popular, it's going to bankrupt me. And so did you guys have that fear as you started to see things take off? Speaker B: I spit out the fears, I think. I went on a podcast probably in early October last year saying that the three dimensions you want to optimize are performance, cost and latency. And for me performance and cost are almost the same level in terms of importance. And then already back in the V1 days I was using latex as I as a partner. It hasn't changed much today with little effect in that gap even wider. Speaker A: Because it runs for so long. Speaker B: It runs for so long. Yeah. Possibly that was the scariest bet we did when we launched it, especially when we put it on and we made it ga. And the reason is we were really not emphasizing too much the latency components. And we strongly believe that it's far more important for the agent to get down what people want and especially for the ICT that we're in mind, which is non technical people. So we went almost like one of the magnitude in terms of additional latency. And the Russia has been fairly uncontroversial. I think maybe for the first week we heard some people being shocked about the amount of time it was taking. But the moment we realized how much more it gets done and the the amount of headaches that it solves for you because you don't have to go and try to debug, even if you debug it with the agent, with an older version of the agent, you have to know what to ask. But now it's not the case anymore oftentimes. Speaker A: So do you see people modifying the code manually still or is it completely hands off? Speaker B: That's a great Question. We have an internal metric and it's one of my North Star cpu to be honest. We try to track how often people go back into our answer, which by the way, we have been adding in the product since we launched Agent V1. Speaker A: I mean that was the main. Speaker B: That was the goal. Yeah, exactly. The main product, for those of you who didn't know of it before we launched the agent was editor in the cloud and we started by showing the five key. Then now it's hidden by default and then it takes some effort to get in front of the editor and we start with I think one user out of four who are actually still editing the code, especially like the more professional ones. I think as of today we arrived at a point where it's 1 out of 10 and my goal is zero users really to the ranks of the code. Speaker A: One of the cool features of replit that I remember from before Agent was kind of like the multiplayer collaborator thing as well. When people build agents, is there a collaborative aspect to it or is it mostly kind of like. Sorry, when people build apps with Agent, is it mostly one person using the agent or is there sometimes a collaborative as well interacting with the agent? Speaker B: So for our consumers all around the world, yes. Most of them I think are just single player experience, especially more like in a business and enterprise setting. We bring them in in a team so everyone can see each other's projects and we see them using the agent together. Now we have a giant block as of now, for reasons I'm happy to explain, but we see oftentimes in the shop logs that there are several people sending basically prompts to the agent. The challenge, why it's still hard to run a lot of agents in parallel is not that much on the infrastructure side that we have everything it takes to run multiple instances because we really like steel. So there we go, be such a big leap. The real challenge is how do you merge all the different patches basically PRs that the engine creates, which is unknown to even problems for even AI frontier models. Merge conflicts are hard. Unfortunately. Speaker A: You mentioned earlier that there's some app for using Repl getting notifications. Where I'm going with this is when this agent's running for like 10, 15 minutes, how does it like, what are the communication patterns you're seeing? How do the users know when it's done? Are they just keeping the browser open and looking there? Do you have like Slack notifications? Is it this app that sends them a push? Like, what are you seeing being helpful there? And has that changed as the agent Gets longer and longer run. Speaker B: Yeah. So with V1, most of the users were all the time because the feedback loop was relatively short. And I think there was also quite a bit to learn from what the agent was doing. It's still the case today. It's fairly verbose. If you're curious. You can basically expand every single action it does if you want. You can see the output of every single tool we want. We try to be as transparent as possible. So there is a subset of users that are using the agent not because they want to build something, but also because they want to speed run their learning experience. It teaches you how to build zero to one apps in the best possible way. There are also users that absolutely don't care and then they just launch, they submit a prompt and then they go back. Maybe they go to it and then they go back and check rapid to make sure that the loop is a bit tighter. The rapid mobile app that is available, App Store Android sends a notification to when the agent wants sponsor feedback. And the vision that we have for the next release is to send you even fewer notifications. And the idea is right now, one of the bottlenecks on this for us is the fact that we rely solely on humans for testing. As you know, more and more progress is happening on the computer use side. Pretty much that back in late October, if I got correctly, OpenAI as followed and open source is also catching up to us having phase launch something similar a week ago. That is something that we are actively working on to remove even these additional hurdles from the user. Because a lot of the time what we ask you to test is very difficult. Like it's data input and peeking around a various single technique. I expect us to be able to do that with computer use very soon. Bringing in products and then jumping from say 10 minutes of autonomy to one hour of autonomy. That is my target of between a few months. Speaker A: How do you think about. There's kind of like testing, but then there's also making sure that it's doing what the human actually wanted. And oftentimes we're bad communicators and don't specify everything up front. How do you think about getting all that specification? Do you have something like deep research where it kind of grills the user back and forth at the start or how do you think about that? Speaker B: So we are changing the planning experience as we speak. We're working on a change very soon. It's hard to reconcile how most of the users have been trained by products like ChatGPT and actually how we Expect them to use a coding agent or in general ad engine because if you have a configurative task that you want to request the same. In the case of building software, you basically want to submit a PRD just so like every camp is capable of doing, they should be able to do that. Or what they do is that they write a two lines prompt, they throw it into block, they get back along PRD and then they expect reputation to follow pedantically. Every scheme collided with prd. We're not there yet. Speaker A: So. Speaker B: The challenge here is to make happy both people that love to use it as a chatbot so that they do basically one single task at a time. And we put some effort in training. We did a course with Andrea and was going to be on stage in a few hours just to tell people if you want to use it that way, it's important that you split your main goal into subtasks and you use a sequentially. But at the same time I would love to to reach a point where we go through this task in resolution, we get things done and maybe after we ask the feedback scene for one hour, then it's up to you as a user to find out if you accomplish everything that you want to. But I think there is so much that can be done autonomously that really brings you say 90% close to what the user wants. And then when we get their attention back, we basically like to polish the user experience and financings up to what they want. Speaker A: You mentioned observability and thinking about that early on. And what have you learned as Replay. Speaker B: Agents has gone crazy viral observability is even harder than expected. Regardless of the fact that you guys are doing something awesome in Lagsby. Speaker A: What are the hardest parts? Give us some product, I guess. Speaker B: So first of all, this is a bit like back in the days when we were discussing what is the best possible architecture for databases due to this one size does not fit all in this case. And there are the datadog style of scalability that is still very useful. Like you want to have aggregates, you want to have dashboards that tell you your failed use is to look 50% of the times and that bring an alert and go ahead and fix it at the same time. Something like Blindsmith is extremely important because unfortunately we're still at the assembly era of debugging for agents. I think you will agree with me because when you are trying to understand why the agent has made the wrong choice or sideways, your last resort is to actually read the entire issue from the output Generated output and trying to figure out why. Speaker A: Certainly made. Speaker B: So it's much more effort to debug compared to an advanced security system in bamo. Like aggregates are not enough. You have something that looks like a step debugger, but rather than shoot overstating memory, you need to read 100,000 tokens and figure out what's wrong. So I think we are at the early stages of absurdity. But what I recommend, everyone is starting to really think of building an agent or like any agent workflow is investing from day one. Otherwise you're gonna be lost immediately and you're probably gonna give up because you're gonna think it's impossible to put this off. And I hope that we are proof and many of the companies are proof that it's not impossible. It's just really hard. Speaker A: As we speak, who do you see kind of being the best? Who debugs these agents? Is it everyone on the team? I mean, you guys are building the technical product, so presumably everyone had some product sense and product feel for it. But is there a particular Persona that spends the majority of their time in Blank Smith looking at logs or who has the best kind of like skill or knack or intuition for that? Speaker B: Given the size of Rapid today, we are like barely 75 people across the entire company. The way it works is everyone does feel that. So even if you're an AI engineer and you're the person that's been optimizing the prompts, but there is a page and something is broken. Most of the people in technical team are capable of going all the way from almost the color surface to the metal. Now, what makes it a bit more challenging for Rapid is that we own the entire stack. So we have the execution plane where we orchestrate all the containers. We have the control plane, which is basically like a combination of our reagent code base, Langdra style, orchestration, and all the way to the product. So it's important, unfortunately, as of now, to be careful of reading the traces all the way down, because problems can happen anywhere. You know, even one of the tools we interface is correct. But it could be that the binder of the tool is broken. Speaker A: We'vetalked a bit about the journey from people to V2 and maybe to close us off, what's coming in V3? What are some things that are on the roadmap that you can expect? Speaker B: So I include one of them. I expect us to bring computer use or in general make it easier to test applications. At the same time, I'm also very bullish on bringing in software testing. The beauty of being a coding agent is that code is far more observable and there are way more to kind of find a code to test if it's correct or not. And last but not least that we want to work even further on test time computing where as of today we already use fair amount of tokens as you know but definitely we want to explore both sampling and parallelism so we see this especially at the beginning. A lot of our users open several projects in parallel and do the initial build so that they can see which one matches their UI taste better. I imagine taking this concept and carrying along the entire trajectory where you sample and then you rank and pick like the best solution for the problem. So this would be like for our high spenders but it definitely helps to get better performance. Speaker A: Awesome. Well I'm looking forward to all of those. Thank you Michele for joining me.