Observing and Testing CX Agents – Cisco

Speaker(s): Carlos Pereira (Fellow & Chief Architect, Cisco)
Session: Interrupt 2026 · Day 2 (May 14) · ~10:00 AM PT
Source: in-person audio recording, transcribed locally with Whisper large-v3.

Summary

Carlos Pereira (Cisco) follows up on his Day 1 talk to show how Cisco observes and tests its customer-experience agents at production scale, across both a renewals team and technical support. The core idea is a continuous feedback loop: production traces are captured via LangSmith, every thumbs-down becomes a signal, and AI/code agents triage failing traces, group correlated cases, diagnose root causes, and open PRs that humans review and merge so fixes become permanent regression tests. He stresses that evals are part of the infrastructure (not a side project), MCP is the integration layer that lets backends be swapped without changing agents, and that humans should be kept for decisions while subject-matter experts define the routing. For support, he demos an anonymized real customer assessment where AI surfaced the highest-risk findings (over 350 high/critical severity items out of 1,176 potential findings) using semantic routing across specialized agents.

Key Points

Cisco's customer experience organization is ~19,000 people; the demo'd systems run in production at scale (around 10,000 cases at the same time)
Every thumbs-down is treated as a signal, not noise: traces are pulled from LangSmith, related code is read, similar traces are grouped, and false positives (e.g., auth errors) are filtered before opening a fix
AI/code agents diagnose failing traces and open PRs; humans review/edit and merge, and each merged fix becomes a permanent regression test that 'compounds'
Four success factors: treat evals like tests not experiments; do both component-level and end-to-end analysis; version datasets inside the code (not the dashboard); use MCP as the integration layer
MCP lets Cisco swap backends (Jira, etc.) without touching the agent, and the same MCP tools also power the UI and charts
Support: an anonymized real assessment last week found 1,176 potential security findings with over 350 high/critical severity items, surfaced via semantic recovery/routing across specialized agents (security, troubleshooting, assets, inventory)
Keep humans for decisions only and keep subject-matter experts defining the routing; LangSmith is used for observability and is described as super critical to the business

Notable Quotes

Evolves like tests, not like experiments.

Keep humans only for the decision.

MCP is our integration layer.

Full Transcript

Show the full timestamped transcript (auto-generated; lightly cleaned)

[00:00] And so I'm excited for you guys to hear about how they're building their customer experience agents.
Welcome, Carlos. Good morning everyone. Good morning and welcome back. Well, there's the Wipers, the
ones that actually made the second day. That was actually an awesome demo that we did today.

[00:32] So, we have been using a lot of what is sharing and some of this is running production. So, they
too, my name is Carlos, so the ones that were here yesterday, we talked about the names, on the
building agents today, we're going to talk about how deep went and built their automation and
testing for that. Just one observation. In addition to myself, we have Vince, Emma, John, and Thomas
on our group out there. A lot of people came to understand the debate. There is so much I can thank
you for. Over there we have all the work photos, how we did the planning yesterday,

[01:05] and all the coding that we want to take up. With that said, I'm going to put a banner here for the
ones that are watching this later on YouTube to explain what this is in minutes. And then, big time,
on what is the team-hate for the youth that we did yesterday, what they also did to the testing. And
then, of course, support. Support at Cisco is an interesting piece. We have an average of 60 median
corrections a year. And there's a lot of takes and cases.

[01:35] Last year was 1.16 and this year is 1.4. So, it increases a lot of the deflections and faints for
some of the systems that I'm going to show you. And also, we are more proactive to the system. But
think about Cisco. We haven't had to work in security. So, when they haven't had to work out, people
immediately get into the store. And there is a lot of things that doesn't work. Pretty much
everything happens at this time. So, when you have those cases, we have sometimes catastrophic
outings that happen at this time.

[02:05] So, how we do that, and how you use patients for that. So, in a nutshell, we have customer
experiences, same number of knowledge in the industry, and land adoption, expand and renew, which is
going for the teams that we have. This is the customer experience around the 19,000 people
organization. We talk about renews. We have a new team yesterday, and the teammates that now are in
the same profession, belong to that team. We are going to also go to technical support today. So,
with that said, the renews team that we mentioned yesterday,

[02:35] we are both from the initial agent foundation that we used last year. There was more about how we
earned the products and how we get an interface and a chatbot. So, now we have one in production, a
workflow-based approach, where you leave the notion of delegation, create a gigantic team-base, and
be part of the team, and has an alternative use of that, and you have the full file. So, that was
yesterday. For the ones that are watching on YouTube, I recommend you to just watch the YouTube
video from yesterday's session.

[03:05] But the question of today is, how can we close the loop between the production feedback that is
running at scale, and make this as code for fixing using AI and patients for that. How we observe on
the case of support, and actually build a system, and how we can do that semantically, because I run
around 10,000 cases at the same time, so there is no way that any man is going to be able to fix all
of this at the same time. And how the whole technology can become your neighbor,

[03:35] but also your government if you don't want it. So, let's go for that. So, for yesterday, we learned
that, hey, we built a very nice chatbot interface, 95 plus percent accuracy, energy, stat, agency,
and the techniques we are back, all the things that were right, and then I told you that people, not
so long ago, they ghosted us at some time. So, I explained to you that a lot of this was optionality
that we gave them, and then we removed it, of course,

[04:06] and then another one was to see if it was more autonomous, and give them more personalized, and more
of an ability to automate. So, we went from low usage to usage climbing up. And we have, we then, we
have an abolition feedback. Right? Good problem to have. So, everything is now a signal. It's not a
model. When I put out there, and there is a thumbs down, for feedback, this is a signal. It's not
something that can be ignored. Because otherwise, what comes up as an option is going to come up as
a feedback, so we just don't know what it is.

[04:36] Right? So, every thumbs up is a mean. Every face in the mirror is a potential regression. So, we
have to think about, what is the best way to do this? So, every thumbs up is a mean. Every face in
the mirror is a potential regression. And every confused user is a description of a plan that needs
to be addressed. Think about it. The human becomes the bar, because the same intensity that you've
danced for a teammate, cannot scale for thousands of thumbs down, thumbs down, or thumbs up at the
same time. Which means that triage becomes very important. Closer to the bottom, understand what
false negative people, false choices, and false objectives are. And then, you can do it. So, every
time you do this, you can do it. So, every time you do this, you can do it. So, every time you do
this,

[05:06] you can do it. So, every time you do this, you can do it. So, every time you do this, you can do it.
So, every time you do this, you can do it. So, this is a real question, but this is actually
becoming very important. Closer to the bottom, understand what false negative people, false choices,
and false objectives. And that's what I'm saying, the real question is, what is the most important
way for me to get a user's thumbs down? There's a feedback, there's a comment, to actually become a
much-appreciated, who requests, without particular quality, and having a chance to do it. Right?
Very thing for a second. Your success, does not matter. The success of your success, is a great
thing, and it helps you get to the bottom of the puzzle.

[05:36] So, Your success on the mobility is proportional to how much drag you're going to need to do with
it. I know it's early morning, but you know it's bigger. Okay, let's get going. So how we did that?
We actually create not a ticket viewing system. We went for a continuous feedback loop when we use
the production signals and the flow of it, particularly like change. Directly into flow change with
AI in the head with human in the loop very fast.

[06:09] So here how we approach the problem. First thing, we actually built an engine to capture the signal.
So all comes down, all the errors, all the normal test specifications, everything. We have
production trace that is captured by electric and then we use the pre-rgb. What the pre-rgb agent
did is, it raised the big data, and then the announcement from, it elaborates the length of the MCP,
and the year MCP, and thus analysis the diagram from the code,

[06:40] and has it in flow from the unit. Let me give you one comment. I will comment on Manpush and his
team on the length of the MCP, because length of the MCP is absolutely impossible. It's not like an
API that people put, crash API, and then call this an MCP, and then the crack API you have, the
crackers that MCP, and then the AI get all the MCP because of that. Length of the MCP is really
really good. I recommend it to you. So you leverage this,

[07:10] and everything becomes part of the AI agent, we then feed the code to the agent. Realize that we
have two separate tasks, how we explain the volume and the size. And the code agent does the
different diagnostics. It analyzes what the features would be. It plots the signal to belong to the
same domain. I don't want to over-heat the data, I want to keep the data for the same potential
part. Right? If I get closer, then I have one. So it does all the failing traces analysis, and then
the review,

[07:41] you have the human group. The human group, the human code redirects and edits that. The human
responsible to PRS, that relates to rights. Everything that is freeing is on my API. And then last
but not least, we merge the PR and we shift to this, to the code. And that is, is a new evolve on
itself. We don't know that, hey, we come in the day, we fix it, nice, better than before. No. It
feeds itself as far away as possible. And we look at this from two angles.

[08:12] A proactive and a reactive. The proactive is what I have on that other side, which is 30 of all.
Before we shoot any agent, we have the data subject to have an experience. And by the way, I didn't
do this, but you can do this. You can see a lot of the knowledge is on what we said before. We have
this in the kind of in production and scale already. We have the data SMEs. They're by the way,
they're out to use the leg draft as I mentioned yesterday.

[08:42] We have the progression types, shift them safely, which we are trying to prevent issues before they
potentially happen. But then, when you have this from the production, the customer may say, hey,
that answer doesn't, is not my line. It doesn't have an accuracy. It doesn't fit. So they have a
feedback. Thumbs down is a error or a feedback. So we need to do that. That's where AI agents and
the code agents come in. And after an agent shifts, the real design gets PRs and the address. So
this is the methodology that we use.

[09:13] It runs in production at scale, which is the key for us. I don't know how many users it is at a
given time. And my team is very small. So there is no way of ever going to come to balance an
agreement approach. As for the ritualist PR case, it's a very simple process. So I get a thumbs
down. As an example, for people feeling feedback. And a thumbs down can be just thumbs down as a
signal or with a comment saying, hey, I don't like this. It's not good. It doesn't serve my purpose.
It doesn't like to have this. Whatever it is. So we fetch that tracing. We read the code that
actually matched that trace.

[09:43] We put this on the classified and you have a dismissal to go to the computer. So what does it do? In
production, it pulls the traces from the length unit on the next, X amount of time. And then, it
puts the trace on the next line. And then, it puts the trace on the next line. We can define what
the X amount of hours are. And then it filters the thumbs down errors that are local and read the
related code that is trying to understand what the agent was trying to do. Add that given time for
that particular set of traces. What I mean by

[10:13] set of traces, because you have the agent but string the ones that are similar because we don't want
to first open a bunch of them that could be correlated. Second we need to decide whether or not the
agent is really whether this is false positive or not. The guy may feel that way, or he's a person
that doesn't understand what that answer, or maybe he was an authentication error, and that person
shouldn't even be on that environment, which is still something we need to trace,

[10:43] and we only open it to them when there is something to fix. And how it is built, as I said, we use
the lane change deep-dation library for the economists, and it's purchased in the system that goes
for the , for the , for the , and we decompose those into explicit steps with super agents that go
through coding and actually find the notification of the person. It's important. I'm talking about
thousands at the same time.

[11:14] We need an agent that actually correlates them, say, hey, you need three problems, 10 problems, not
a thousand problems. You have thousands of years, the whole thing gets out of control. Very quickly.
And let's use MCP for tracing, G-RAN can be used for each . So we have separation of state and
church, very clearly. So things like that, because if you start to mix domains, the complex is
starting to get messed up, and the AI that's helping you moves complex

[11:44] about this and moves the properties. And every action is traced, including the PI itself. So it
traces itself, so it traces what the trace is doing, and the same thing, and you see more of that
here. What made it work? There are four things that are worth to highlight. First, please, please,
please, you do want to get this in production and scale. Grease evolves like tests, not like
experiments. Let me repeat it again.

[12:15] Grease evolves like actual tests, as part of the pump, not experiments. Evolves with indirect, not
ACI, not the black emerges, not in the power from the energy of your boss, not all that. Two
component level analysis, and end-to-end analysis of the product. Both. Not, hey, I made a
component, it looks good, and end-to-end, it looks bad. Data sets are versioning inside the code,

[12:46] not on the dashboard. Dashboard is what the boss is looking. The agent don't care about the
dashboard. I'm making fun of it. You realize what I'm trying to say, because it's very realistic to
the real world. MCP is our integration layer. Patients call to Ransomit, to Jira, to Eats, and runs,
they're all running through MCP. Imagine what they just said before, just having running in
production for some time already. Swap the backend without touching the event

[13:18] is only possible if you use MCP as the player interface. It gives you that decomposing, but you
don't have to implement it for each player. The same MCP tools now become what powers your UI,
powers the writing charts, you saw them on the clip before, the TI station for the code. The last
product, this is an issue before, too many of them, only on racks, not on rigs. Everything like rig
spacings, opening zeros, the graph key, the movie press, all of that is autonomous.

[13:49] And I mean autonomous for you, you give out more, you do that, send the boundaries, everything that
is right, the human is alone, so the goal is to leverage, not to regret. So, some last on block.
Auto-mobility is your new world. You're gonna ship agents, and you're gonna ship them at scale, you
can might smile, but the ability is gonna be a problem. First, embrace it. Second, close the limited
agents,

[14:20] so we have the feedback, and you want to fix it. If you think you can have PR, you can have
diagnostic, you can have PR, use the agents to help you do the work of the agent. Keep humans only
for the decision. Don't fall for the threat that humans orchestration based DNA kind of system, it's
not gonna escape. When you get successful and you get an option, that's where you get the fact, you
want that. And if you become a foreign agent with feedback, your adoption is gonna fall.

[14:51] Third, devolves are part of the infrastructure. Not a side project. Every fix that could generate a
new test, obviously if you take down with the switch compounds, right, then regression becomes
permanent thing. It's not an exception. And as you think about using agents for this, keep the agent
of your life. It means the data subject matter experts, which aligns to the term of that I was
talking before, keep agents with the routing guarantees,

[15:21] which means the subject matter experts, subject matter experts on a particular domain, on my case,
we knew, when your case is going to be whatever it was in your business, they know better what the
steps should be. The coding guys that are using the API, I know who. So these, the subject matter
experts, the ones that define the routing, that is embedded on the system, that has the full density
trigger code, and they interface with the identity shots, as was said before. So, devolves for
renewals. And you can find out,

[15:55] let's go for the support. So, in the support, we have a very interesting thing. Every customer of
Cisco, that has a support program, has automatically access of all that we're going to show. This is
actually an anonymized scenario, that we ran last week. This is a network for a customer of
enterprise, and he has footprints on the screen, and he has a network of customers, and he has a
network of customers around the globe.

[16:25] So, we went and did an analysis of all the potential configuration assessment of every device on the
network. It's not big, but it's not big. Think about this. It's a very small network for our
customers, because it's a big amount of storage. A storage may have devices, connectivity, re-access
points. This may have a lot more systems. If you think about that, 5, 7 devices, 113 are probably
about the same. Now that we, for being an international, we're a clothing company, we're a
restaurant, we're a rally,

[16:57] we could be one. So, when you have 1,176 security potential findings, everyone, if you look at your
hands, you have 10 people, right? Some of them may be in packs. We have heard about the fat-feet-
tree configuration assessment. We get that a lot. A lot. So, when we did an assessment, this is one
customer configuration assessment. This is real. I just got to remove the name of the configuration.

[17:28] So, we had this line. Then the question is, hey, where do I even start? So, we then provided a
customer this, which is part of the support analysis that we have in assessment. So, we say, hey,
out of this environment that we have, you have some positive stuff. You have this product line that
is actually compliance and security. But it has some areas that you actually look at it. So, what
happened? You already went through the whole configuration environment

[17:59] for the customer globally. You already used an AI to analyze it and for the best for that customer.
We already find out the potential for the business characteristics. You already have the surface of
the package and said, hey, Google, here is your environment. Then we have an AI system. We explain
to the guy, hey, we can help you. And the feedback for the customer, hey, what should I focus on?
So, the gentleman is like, hey, what should I focus on? So, think about it. There is no
configuration.

[18:31] There is no assessment. There is no integration. None of those words that Google approves. The
customer just said, what should I focus on? This is like as big as it can get. It's like names. The
right comes to you and says, well, can we get out? You probably should say yes. You probably should
say no. You say yes. You are a monster because yesterday you said that I'm not good enough. You just
say no, you are a monster now. So, it's like here, this is a very high level content.

[19:01] I have some that people say, go, and, no. So, for an AI engine, you receive a word that, no, what
are you doing? Right? So, we realize that behind the scenes, you already have over 350 potential
high critical severity stuff that is going to hit this customer. So, we use the context of the
customer and the real time, plus the best threat,

[19:31] plus the story note, plus the pet feeders, plus all the customer cases and tickets that were opened
before, and we surface the most high risk scenario that we find. How do we do that? We do the
semantic recovery. So, the semantic recovery, what it does, is that we find ways of content which
specialized agent even a scenario that we don't have enough context on the prompt. It's exactly the
opposite that we had yesterday. Yesterday, I had a very detailed prompt

[20:03] that I need to create a planner which is hierarchical. Here, I don't have enough context on the
prompt. I need to actually have the contextualized of the real time environment, plus the story
note, to define how I approach across specialized agent. I have agent's post, creation for security,
for troubleshooting, for assets, for inventory, all of that. Who is going to hit? You run this in
file by file. And you must run in file to create the diagram. Let me explain why. Here, I'm actually
going to proactive fashion.

[20:35] I'm sharing with the customer what they have to do for it. There are some situations that they are
in the middle of the fire with their allies. And people get really heated. So, your son off
something, and then the model comes in, and you look back and say, I don't believe that the
technology that the media has on the video would help solve that problem. And I'm thinking, what if
I do? It's cursing us. Do you understand that? So, and, but it leads you to that diagram. Before
that, you gotta see how that makes the problem. And, you have semantic route group.

[21:06] This is semantic route people. And the final location. But we, again, at each step, each tracing.
So, it goes to the gardening, raiders, the agent selection, execution, through the landscape. So,
the current state feeds the routing on how the semantic approach. And every agent, every specific
specialized agent has a data set. And as you start to learn, the router now can update us on what
the context is in order to see the other route. So, this happens automatically.

[21:37] We have, with this goal, for multi-thousands at the same time. Because they cannot be judging what
the AI is gonna argue that. You know, it needs to be done at scale. So, we have this running and the
fast algorithm digress. And we have human-driven recommendations that are multi-priority and the
ones that are long and super. And we run this with LensMeet over and over again in real time as the
system self-runs itself. And, last but not least,

[22:07] we, both of, it, will serve. Because LensMeet becomes super critical for our business. So, we
actually see this as an incredible and you observe, and I just want to observe how these tools. So,
you get an idea. We have 153 days request from print. We add the time that we just need the
screenshots which are very slow and 100% has a lot of power.

[22:37] With that said, hopefully, we helped you as you invited your thanks for the book. We are all down in
the room. Thank you very, very much. Thank you. Thank you so much, class. I think I need to pop the
up a little bit. Now, I'll take a short break for the next 25 minutes. Grab a snack and a drink and
check out our sponsors, play a game, check out the server we maintain in HQ.

[23:07] We'll see you back here at 1015 for the next 15 minutes. Thank you.

Summary​

Key Points​

Notable Quotes​

Full Transcript​

Summary

Key Points

Notable Quotes

Full Transcript