It starts with iteratively refining the right combination of prompts, model choice, and model parameters, then building a continuous feedback loop in production so that the experience improves with each interaction.
Tactically, this means pairing offline evaluations (evals) with real-world signal: experimenting, tracking analytics, and refining the system as users interact with it.
Through our work on Statsig's AI support agent Statbot, we’ve seen firsthand how powerful it is to combine rigorous offline validation with insights from real customers.
From the beginning, Statsig treated support as a primary feedback channel for product development. Regardless of role—founder, senior engineer, product manager—when a customer hit a problem, it became everyone's problem.
That philosophy has stayed with us as we've grown. To this day, Statsig's CEO, Vijaye, still jumps in to debug customer issues and surface insights that shape our roadmap.
As our customer base and product surface area expanded, the “everyone does support” model needed a way to scale without losing the tight feedback loop that helped us stay customer obsessed in the first place.
About two years ago, we introduced Statbot. What started as a a simple automation has evolved into an agentic workflow, and will soon be a full multi-agent system. Throughout its evolution, the scope has remained intentionally focused: agents should augment how we interact with customers and stakeholders, not replace those relationships entirely.

At the same time, the environment that Statbot operates in has changed significantly. We've shipped new features, removed constraints, expanded our customer base, and seen use cases grow more diverse. A support agent operating in this environment must evolve in parallel, continuously evaluating and refining prompts, tools, and behaviors to handle both legacy and new inquiries as the product evolves.
Improving AI agents is fundamentally a data analysis and evaluation problem.
Offline Evals provide repeatable guardrails and regression protection before changes reach users.
Online Evals reveal how the system behaves under real customer conditions, where missing context, evolving product behavior, and unexpected edge cases introduce additional complexity.
A/B Experimentation closes the gap between what evals say is “good” and what actually drives customer and business impact (often these don't map 1:1 as one might expect).
Continuous improvement comes from combining these three signals into a single loop: detect issues, cluster failures, fix underlying knowledge or behavior gaps, refine graders, re-evaluate, and then roll out candidates through experiments. We run this loop weekly on Statsig’s platform to operate an AI-native support model at scale.

Every message sent to Statbot first passes through a classifier agent. This agent categorizes the message into product areas (Experimentation, SDKs, Analytics, etc.) and determines the appropriate action: ignore, respond, or escalate. Correct classification ensures customers are routed efficiently and that subject-matter experts can resolve issues quickly when human intervention is needed.
Before any Statbot update goes live, we run a suite of offline evals designed to catch regressions, guardrail failures, and unintended behavior changes. These repeatable tests allow us to move quickly while maintaining reliability and consistency in customer-facing behavior.

To evaluate both classification and response quality, we started by uploading a curated dataset of real-world inputs: customer questions, validated answers, correct categorizations, and supporting metadata. This dataset draws from anonymized historical support tickets, known customer pain points, and key Statsig product concepts. Topics range from basic SDK misconfigurations to complex experimentation failures. As Statbot runs in production, we continuously expand this dataset through trace-driven error analysis, ensuring our offline tests stay representative of real usage.
During a recent update to Statbot’s triage classifier, offline evals made it easy for me, as a PM, to iterate without touching production code. I made changes to prompt wording, model selection, and parameters and was able to measure impact against cosine similarity metrics, custom Python graders, and LLM-as-a-judge evaluations.

After seeing a 34% improvement in classification accuracy from our baseline prompt to the prod-triager prompt, the team was comfortable moving toward deployment. Because Statbot already runs on Statsig’s AI SDK, promotion was as simple as marking the prompt version as Live in the console, immediately routing online traffic to the new configuration.

Offline testing gets us confidence to ship, but can’t capture the full complexity of real-world usage.
Once a message is triaged, Statbot decides whether to retrieve knowledge, invoke tools, or escalate to a human. To maintain consistency between development and production, we reuse a subset of the same graders in both environments. These graders score each production interaction on key dimensions we've specified, like helpfulness, conciseness, and escalation behavior—ensuring the signals we optimize offline translate directly to live performance.
Under the hood, we use OTEL and our AI SDK to make everything Statbot does observable. Each LLM call, retrieval step, tool invocation, and grader evaluation emits detailed spans, producing an end-to-end trace from the initial customer message to final resolution.
These traces are critical for debugging and improvement. When a customer interaction goes wrong, we can inspect exactly which documents were retrieved, which tools were used, how the agent reasoned, and where the failure occurred.
Traces also make human-in-the-loop analysis scalable. Support engineers and PMs routinely review low-scoring conversations by drilling into their traces. This lets us identify recurring patterns, such as knowledge base gaps or incorrect tool selection, validate the correct answer, anonymize the example, and add it back into the offline evals dataset to prevent future regressions.
Example: we were able to identify a knowledge base retrieval failure from customers checking Statsig’s system health and apply relevant corrections.

Online evals enrich datasets and sharpen grader rubrics. Subsequent offline evals improve production behavior. Production behavior feeds back into evals. This creates a continuous improvement cycle.
Even with strong eval signals though, multiple candidates often look "good enough.” That's where experimentation becomes essential.
Offline and online evals tell us whether a new release candidate is scoring well, but they don’t always tell us which behavior is “better” for customers.
After using evals to refine grading criteria, datasets, and agent logic, we typically end up with multiple release candidates that all pass offline and online evals. These might be prompt variations, different agent architectures, or more advanced multi-agent workflows. At this point, evaluation alone can't tell us what to ship.
Controlled experimentation does.
We run A/B or A/B/n experiments with clear control and test traffic, measuring outcomes such as customer satisfaction, faster resolution, reduced escalations, and downstream product usage. These results then feed back into how we define “better” in our evals, tightening the loop between evals, experiments, and real-world impact.
Last year, we ran a short-lived fun experiment: we changed Statbot's tone to reply as "Dogbot" for a subset of traffic. Evals for correctness and helpfulness all passed without regressions. Customers' initial requests were resolved successfully. However, experimentation revealed that a secondary business metric—follow-up support inquiries—was trending in the wrong direction. While customers were amused, the data showed they were concerned that Dogbot (aka Statbot) had been let loose.

This is how evals and experimentation work together: evals gate safety and correctness; experiments select for optimal user behavior and downstream business outcomes. Observability and analytics ensure those behaviors stay aligned as Statbot, our product, and customer expectations evolve.
Building production AI agents like Statbot isn’t a one-time project. To iterate reliably and sustainably at scale, you can’t rely on ad-hoc patches or occasional fixes. You need a continuous evaluation and improvement loop that treats offline testing, online behavior, and experimentation as a single system.
Statsig brings offline evals, online evals, and A/B experimentation together in one product development platform. PMs, engineers, and AI operators work from the same data, metrics, and source of truth. Offline evals gate safety and correctness. Online evals surface real-world behavior and human feedback. Experiments determine which behaviors actually drive customer and business outcomes. With observability, metrics, and configuration in one place, teams stay aligned on what’s running, what’s being tested, and what’s driving statistically significant impact.
If you’re building production AI agents, we believe a practical starting loop looks like this:
Define the user problems and business outcomes you care about.
Prepare clean data sources and create a small, then create a high-quality evaluation dataset.
Establish a baseline with offline evals and iterate to production quality.
Monitor live outputs, traces, and human feedback in production to drive an evaluation and iteration loop for new graders, datasets, and release candidates.
Promote qualified candidates into experiments and ship based on real customer and business outcomes.
This loop is how Statbot improves week over week to triage thousands of questions that help our customers build better and faster with Statsig.
We’re continuing to invest in Statsig’s AI capabilities, including richer eval tooling, deeper observability, multi-agent workflows, and more flexible MCP integrations.
AI Evals are currently in beta, and we're actively iterating toward general availability. If you're interested in building an evaluation + experimentation loop with Statsig, reach out to us, or ask Statbot to help you get started! Your feedback directly shapes what we build next.