Skip to main content

Building AI for Healthcare – Janie Lee (Abridge)

Speaker(s): Janie Lee (VP Product, Abridge)
Session: Interrupt 2026 · Day 2 (May 14) · ~2:00 PM PT
Source: in-person audio recording, transcribed locally with Whisper large-v3.

Summary

Janie Lee, VP Product at Abridge, describes building AI for healthcare on a clinical intelligence platform that turns each patient-clinician conversation into a clinically useful note. Abridge has scaled to 250 of the largest US health systems (Mayo, Kaiser, Johns Hopkins) and processes over 100 million conversations annually, using de-identified data (with explicit customer data rights) plus clinician edits and queries to continuously improve and evaluate models. She stresses that in this high-stakes, HIPAA-regulated space the bar is incredibly high — 'trust is earned in drops but lost in buckets' — so velocity can't come at the expense of quality. Two case studies are presented: the core notes product, where migrating to LangGraph and using LangSmith for evals plus an automated judge-calibration (APO) framework cut release cycles from one-to-two months down to days; and an agent system (the Abridge assistant) built on top, a unified composable agent persisting before, during, and after a visit. Her takeaways: you don't have to sacrifice velocity for quality if you invest in eval infrastructure, and healthcare needs our best builders.

Key Points

  • Abridge is a clinical intelligence platform that captures each patient-clinician conversation (historically unstructured data) and turns it into a clinically useful chart/note submitted to the electronic health record; healthcare is ~20% of US GDP and much of it is downstream of that conversation
  • Scaled to 250 of the largest US health systems (Mayo, Kaiser, Johns Hopkins) and processes over 100 million conversations on the platform annually, with data rights received explicitly from customers to de-identify and improve models
  • Notes are far harder than meeting summaries because the note becomes the bill and part of the patient's longitudinal record; failure modes include mis-adjudication (assigning a statement/symptom/decision to the wrong person) and confabulation/hallucination (e.g., ordering an un-prescribed medication or wrong dosage) with patient-safety and legal consequences
  • Release process improvements: migrated to LangGraph (replacing too many custom tools) for unified datasets, annotation, tracing, and validation with self-hosting and access controls; used LangSmith as the eval platform — cutting releases from one-to-two months down to days
  • Eval methodology: prioritize model issues by prevalence and severity, group issues into categories (accuracy, compliance, style, completeness), build calibrated LLM judges, and moved to an APO framework that auto-generates a calibrated judge from the annotation guide and labeled encounters; collects free-text explanations from annotators to verify data quality
  • Uses both reference-free judges (generalize across encounters, run online continuously to monitor note quality) and reference-based judges (specialty-level specificity); offline evals first, then backtesting against historical encounters, then silent A/B testing with a trusting first 10-15% of customers
  • Second product is a unified 'air conditioning'-style agent that persists across before/during/after a visit with composable capabilities (searching patient context across systems of record, editing notes, placing a medication or MRI, clinical decision support against validated medical evidence)
  • Agent design principles — ambient like air conditioning, clinician agency/control, and responsiveness (learning a clinician's editing style); agent eval complexity rose a magnitude (clinical quality, safety, boundary/adversarial testing, tool selection) versus single-step note models

Notable Quotes

Something that we like to say internally is trust is earned in drops but lost in buckets.

We have over 100 million conversations that happen on our platform on an annual basis.

you don't have to sacrifice on velocity and or quality

Full Transcript

Show the full timestamped transcript (auto-generated; lightly cleaned)
[00:00] and we've been in really high-tech and highly regulated industries, and Shuman has a lot of support.
I haven't heard. So, just to give a little bit of context about ABRIDGE, we are a clinical
intelligence platform. We started solving built-in tools for clinicians, solving as many of their
clerical and clinical workflows as possible, really starting at each patient clinician conversation.
There are over two million conversations that happen every single year in the U.S., and we think
that the

[00:31] conversations are really important. So, the first conversation is probably one of the most important
workflows in healthcare. It's obviously where our patients and clinicians and doctors are giving and
receiving care, but if you think about the 20% of our GDP that goes towards healthcare, so much of
it is downstream or adjoining of that conversation, whether it's the notes, the billing codes, the
claims, the prior obligations, the medications. And where we're at today, the first product that we
built,

[01:01] which is the ABRIDGE, is the ABRIDGE. We built the infra, the product, and the customer value to
capture this conversation, which is historically been fully unstructured data, and to turn it into a
clinically useful chart or note for doctors. And in a world today where there's a massive doctor
shortage in the country, especially because of COVID, and doctors aren't bringing out, they spend 10
to 20 hours after hours during what they call pajama time on doctors and patients, we've done a
really good job on

[01:31] delivering on the ROI of saving time. And now that we have all of this data, have gotten into this
wedge of the most important workflow, the new products that we've been building are really to help
us deliver on our second interact, which are how to be health systems save and make more money, and
given clinicians are opening our apps before, during, and after a patient walks in the room, how
might we actually save lives? And we save that with all humility. So with that, thank you. So with
that, we've scaled quickly.

[02:03] We're at 250 of the largest health systems in the country, the Mayo's, the Kaiser's, the Johns
Hopkins of the world. And we also put ourselves in the center of healthcare's data layer. The most
important signal in healthcare has historically, like I said, been unstructured and uncaptured. And
because we now have this scale and have all of this data, we've managed to get all of these
unparalleled data network effects. We have over 100 million conversations that happen on our
platform on an annual basis.

[02:34] With the data rights that we've received explicitly from our customers, we're able to de-identify it
and we're learning constantly, whether it's edits people are making to our notes, queries people are
asking with our bridge AI agent, and using that to improve our models and continuously evaluate with
our LLM and human evaluations and continue to deploy better and better products. So that it
compounds over time. That's all I've got about the company. I'm excited to actually get into the
content, which is what does it

[03:06] mean to actually build AI agents in these high-space, high-trust environments? There are a few
things that shape all of our product, business, and technical decisions. I think the first one is
the bar for what we can ship is just incredibly high. And so velocity can't come at the expense of
quality. The reason for that is the space that we have. The stakes are quite high if we incorrectly
label a medication or a dosage that has real patient safety implications.

[03:37] And HIPAA and the way that we use and take care of PHI data have to shape every decision that we
make. And finally, especially in enterprise healthcare, buyers are choosing who they'll even talk to
based on security and trust. Something that we like to say internally is trust is earned in drops
but lost in buckets. And so we know that any single model release that has massively negative
implications could lose the trust we have in customers. We'll walk through two case studies to show
how we build and ship

[04:10] quickly despite all of these things that we're thinking about constantly. We'll start first with our
core notes product and then move on to our agent system, which is what we built on top. So the notes
we create for doctors are the core product. It's how we landed at the start. We're going to use the
scale that we did. Basic trial of all this is a doctor will open our app. They'll record the
conversation that has with the patient with consent and we'll turn that into a clinically useful
chart or note

[04:41] that then gets submitted into the electronic health record. Notes are really hard. It's not just a
summary of a meeting or a conversation. And the reason for that is the note that we create
ultimately becomes the bill, the thing that hospital and the doctor gets compensated for. And that
is the thing that the doctor sees is extremely important. And all of this also becomes part of a
patient's longitudinal record and quality matters in ways that I think a lot of ML products don't
experience. A few examples is we have mis-adjudication in our notes.

[05:13] So a note that might assign a statement or a symptom or a decision to the wrong person. We have
really, really large medical and legally significant ramifications. For example, a patient saying
that they have a symptom is very different than a doctor explicitly diagnosing it and billing for
it. And if you're diagnosing and billing the wrong thing, there will be severe consequences for up-
voting or down-voting. Similarly, confabulation or hallucinations,

[05:44] if we are putting orders for a medication that wasn't actually prescribed or putting the wrong
dosage, huge, huge patient safety issues. As we've had some of these issues, in internal testing,
the biggest pain in trying to fix these was our release process, mainly getting the confidence we
needed to actually shift changes. Prior to some of the changes being made in YLAGETTO, releases
would take us anywhere from one to two months. And we had clinicians annotating data for party
labels,

[06:16] labelers, and a lot of testing tools that inter-talked to each other. And we had a generic approach
across the platforms we chose, the actual ways in which we go fund, and then finally, just good old
human operational process improvements. From a platform perspective, we migrated to LANGRAP. We were
using too many of our own custom tools, and as we were scaling, they weren't serving all of our use
cases.

[06:46] And LANGRAP gave us unified data sets, annotation, tracing, and validation in one system. And from
an enterprise security perspective, we were allowed to self-host, and also have the access controls
and auditability needed. And we used LANGSMIT for our eval platform. It gave us all the
customization we needed from a UI perspective, and I think across altitudes, whether it was
debugging a single encounter or seeing across all model releases performance,

[07:18] we could do it in one place. From an actual improvement perspective, we generally have the gift of a
lot of feedback, given how important this product is to clinicians and workflows. And across all of
the clinician and user feedback, both in-product and out-of-product, we had a full list of model
issues we knew we wanted to improve. We would prioritize it by both prevalence, so how often was it
happening, as well as the severity of the issue.

[07:49] If we got it wrong, were the stakes really high? And with that, we organized all of the models, and
we put all of these model issues into an aggregate set of colors, so things like accuracy,
compliance, style, completeness. And we started to go one by one and create judges against these
model issues. Originally, it was pretty cumbersome. You would have a clinician who knew the subject
matter well, design the annotation guide, label encounters, and then someone would actually iterate
on prompts to get the judge

[08:22] to match the clinician labels. Really slow. We moved to an APO framework, and it now takes the
annotation guide and the labeled encounters and generates the calibrated judge automatically to
reduce turnaround times. The things that always keep me up at night are just making sure that the
underlying data is actually correct and high quality. This is the basis of everything that we do. A
few things that are very helpful to us, we make sure that the annotations are actually matching

[08:52] the expected outputs of our LLBOM judges. We also like to collect pretext explanations from our
annotators. Obviously, when there are issues or inconsistencies, it allows us to debug. But more
importantly, when things aren't broken, we ask for it because it allows us to check that the people
who are annotating and evaluating are actually paying attention, and so if people are just clicking
through, this is a good way to really understand the quality of our data. And then finally, we
always think about what is the actual expertise we need up front.

[09:26] There are certain evaluation or data sets where we need a board certified physician. Sometimes you
might need a specialist, but laying that out very, very explicitly has really helped. Another design
decision that's worth calling out is how we use both reference-free and reference-based judges to
make sure that we have a blended approach and set of layers that complement each other. We have
reference-free judges. They generalize across all encounters, and they'll look at the note and the
conversation and score it,

[10:00] and it doesn't need an old standard note to compare against. And that means we can run them offline,
but more importantly, online at all times to continuously monitor the quality of our notes. They
definitely have limits. There's a lot of encounter-specific nuance and specificity that gets met,
and notes are just like a book. It's so inherently subtractive. So we supplement it with reference-
based judges that get as specific as the specialty level, and with these two approaches,

[10:30] we're able to have really high confidence in what we're serving. And even with these evals, I say
that we can't push the rug just because of the stakes that we're operating in. Offline evals happen
first. They now happen very, very quickly with the process I just described. They're really easy,
but you still will backtest against historical encounters also very quick. And I think something
that might be more unique to the industry that we're operating in is the way in which we do A-B
testing.

[11:01] When I first joined A-Bridge, I never thought A-B testing would be possible in healthcare at this
enterprise level. Because we were in the trust, we've gotten an explosive buy-in from a set of our
partners and customers who said, given what you're doing upstream, I'm okay with a silent release
without knowing about it. And we're willing to be, you know, in your first 10 to 15% of customers,
willing to innovate with you. Definitely a privilege and a lot of trust that we need to keep, but it
allows us to once again for more signal.

[11:31] Are people editing the note? What are the ratings they're leaving? Are we getting really, really
loud qualitative feedback? And this is usually signal we'll get instantly if it is so. And that
finally gives us the confidence to actually go through a full release and we'll do our continuous
online monitoring. Before this would take us one to two months, and now we can get through this in
days and get a nice square model and a new model update, and we feel really good about getting a
signal that we need.

[12:01] And so really exciting. We've seen improvements over the past few quarters in making sure that we
can continue to help clients and more importantly, our users are telling us when we ship models that
they feel it instantly. Now I'll move on to, the second case study notes are the foundation of what
we do, and we built on top a more generic, abridged assistant. And generally we want to help
clinicians make the best decisions they can

[12:33] at the right moment with the right context. But the reality is most tools that clinicians use today
fail on adoption and aren't really used. There's a lot of constraints. First, those doctors are very
busy. They're going back to back, they're going to get an appointment. They don't have the time to
pull something up, look up the context and location. A lot of existing tools also make them choose
between security or no security. And with these fragmented workloads,

[13:05] what we did was build a unified agent that persists across the workloads. So before the visit,
during that visit, after the visit, there's a uniform UX and agent that they can work with. And we
decided to hold a lot of time to put together these separate features and capabilities into one
agent that has really composable capabilities. Examples are capabilities like being able to search
patient contacts across multiple systems of record, or actually doing actions like editing your
notes,

[13:37] or placing a medication or an MRI. And then finally, enabling something like clinical decision
support. So if a doctor has a question, can we support them via really searching against literature
that comes from validated medical evidence? And instead of these multiple features living in
different places, having it as a singular agent allows us to call the right tool at the right time,
but also deliver a really coherent experience. From a product perspective,

[14:09] these are some of the principles that we used in building this to make sure that we weren't
overwhelming our doctors and clinicians. The first is, we like to say that we try to be like air
conditioning. We want to be on in the background making things better, but not have to actively be
present unless we think there's something really important. The second is agency. In healthcare
especially, it is really, really important that clinicians know that they're in control. We might be
able to suggest things,

[14:40] but ultimately we want clinicians to use all the data at hand and make the right decision. The third
is responsiveness. This can come in a lot of different forms. If a clinician is explicitly giving us
the best of feedback, we want to make sure that our products and systems are actively listening and
improving quickly. But because of where we operate, we also can implicitly learn a lot to make the
product better. So this might be, if a clinician is constantly making the same edits on a note,

[15:13] can we learn their style to automatically apply this fault? Or given the context that we have about
the patients the clinician is seeing, how might we suggest through this agent things that they might
ask at certain moments of the visit with the patient to deliver better care? And as we build this, I
think similar underlying machinery from an eval's perspective, but the complexity went up a
magnitude just because we're not now using single step models.

[15:45] Some of the eval criteria that we thought about, clinical quality, is it actually accurate? Safety,
are we making sure that this agent is effectively recommending things that would harm the patient?
Boundary and adversary testing, what is happening at the edges? Making sure that the agent isn't
answering questions on things that isn't answering questions that it wasn't trained to do. And
finally, from a tool selection perspective, is the agent picking the right tool and behaving the way
that the clinician would respond?

[16:18] So with that, I think the two takeaways I hope to leave folks with is, first, you don't have to
sacrifice on velocity and or quality. I think if you invest in the right eval infrastructure up
front and are really, really specific on the things at which your product needs to get right, you
can do both quickly and probably know the outcomes of your product even more quickly. And second,
maybe more on a personal note, hopefully this has excited some folks

[16:48] to potentially think about building in healthcare. I think from an impact perspective, healthcare is
probably where we need some of our best builders, and there's probably no more universal or current
problem. Given 20% of our GDP is going towards here, I also think this is where the largest
businesses will get created. And given the high stakes, I think some of the hardest AI challenges
will also need to be solved here. So thank you so much for your time. Enjoy the rest of your day.
It's been a pleasure.