Backed by Y Combinator

Improve your agents automatically.

Setup human feedback and AI judges to gather feedback from all your interactions, so your agents improve continuously even after launch.

Read the docs
beforegpt-5-nano

“You are a helpful assistant.”

afterauto-optimized
gemini-3-flash-preview

“You are a technical support specialist. Keep answers under 3 sentences. Always reference the error code.”

from 127 feedback|across 2,400 traced runs|↑ 18% accuracy
Built and trusted by teams from

The Why

You ship an agent. It works in demos. Then production happens. Failures pile up in logs, users hit edge cases nobody tested for, and the agent just sits there. Same as the day you deployed it.

We believe the companies that win the next decade of AI won’t be those that build the best agents. They’ll be the ones whose agents get better over time.

Your users already know what’s wrong, but the tooling to close that loop just didn’t exist. So we built it.

We think software should grow, not just get shipped and maintained. Every AI system improving through its own usage. We’re early. We’re building for the long run.

The How

From setup to self-improving in minutes

01

Install

Let your agent handle it

Install ZeroEval Skills and tell your coding agent to set up tracing. Or add the SDK manually - two lines to instrument.

  • Works with Cursor, Claude Code, Codex, and 30+ agents
  • Agent handles SDK install, first trace, and judge setup
  • Or drop in the SDK manually in seconds
$ npx skills add zeroeval/zeroeval-skills
02

Evaluate

Score what matters

Run built-in or custom judges against your own quality bar, not a generic benchmark. Binary pass/fail, rubric-based scoring, sample rates. You decide what to evaluate and how.

  • Built-in judges for hallucinations, safety, frustration
  • Custom rubrics for domain-specific quality
  • Suggested judges from your production traffic
is_helpfulbinary

Pass / Fail evaluation

0%▲ +2%
pass rate
quality_scorerubric

1–10 scale with criteria

0.0▲ +0.4
avg score
tone_checkbinary

LLM spans · 10% sample

0%→ stable
pass rate
safety_guardbinary

All spans · 100% sample

0%▲ +1%
pass rate
03

Calibrate

Calibrate the judge

When the judge gets it wrong, correct it. ZeroEval learns your standards over time. The evaluation layer itself improves, not just your prompts.

  • Thumbs up/down with reasoning
  • Provide expected output for comparison
  • Watch alignment score climb over time
no_hallucinationJudge Alignment
89%
90%
feedback
04

Optimize

Improve the whole agent

Turn failure patterns into concrete changes: better prompts, different models, updated agent code. Compare candidates to the baseline, validate across configurations, and ship without redeploying.

  • Prompt, model, and code candidates from real failures
  • Side-by-side comparison with version history
  • Deploy changes without redeploying your app
before
Prompt

“You are a helpful assistant.”

Modelgpt-5-nano
after
Prompt

“Technical support specialist. Under 3 sentences. Reference the error code.”

Modelgemini-3-flash-preview
Agent code
+18−3

The Loop

A self-improving system

Every interaction makes your agent smarter. Traces flow in, feedback shapes quality, and optimizations deploy automatically across your entire stack.

agent

Use ZeroEval from code,
CLI, or agents.

Instrument with the SDK or OpenTelemetry. Inspect failures from the CLI. Let coding agents query ZeroEval directly through the MCP server. They read where they’re failing and use that context to improve their own behavior.

  • Python & TypeScript SDKs, REST API, OpenTelemetry
  • CLI for querying evaluations and triggering optimization
  • MCP server for Cursor, Claude Code, and 30+ agents
Read the docs

See the feedback loop on a real agent

Book a demo to see how ZeroEval turns traces, judges, and user feedback into better prompts and fewer regressions.

SOC 2 Type II compliant

Continuous third-party security audit

View report