Backed by Y Combinator

Agent optimization from human
and AI feedback.

ZeroEval captures every run, scores quality with calibrated judges, and turns user feedback into better prompts, so your agents improve after launch, not just before it.

Read the docs
beforegpt-5-nano

“You are a helpful assistant.”

afterauto-optimized
gemini-3-flash-preview↓ 40% cost

“You are a technical support specialist. Keep answers under 3 sentences. Always reference the error code.”

from 127 feedback|across 2,400 traced runs|↑ 18% accuracy
Built and trusted by teams from

The Why

You ship an agent. It works in demos. Then production happens. Failures pile up in logs, users hit edge cases nobody tested for, and the agent just sits there. Same as the day you deployed it.

We believe the companies that win the next decade of AI won’t be those that build the best agents. They’ll be the ones whose agents get better over time.

Your users already know what’s wrong, but the tooling to close that loop just didn’t exist. So we built it.

We think software should grow, not just get shipped and maintained. Every AI system improving through its own usage. We’re early. We’re building for the long run.

The How

One system from trace to better prompt

01

Trace

Trace every run

Capture inputs, outputs, tool calls, latency, cost, and errors from the same production traffic your users generate. Two lines to instrument.

  • Python & TypeScript SDKs, REST API, OpenTelemetry
  • Auto-instruments OpenAI, Anthropic, Google, LangChain
  • Sessions, tags, and metadata for filtering
$ npx skills add zeroeval/zeroeval-skills
02

Evaluate

Score what matters

Run built-in or custom judges against your own quality bar, not a generic benchmark. Binary pass/fail, rubric-based scoring, sample rates. You decide what to evaluate and how.

  • Built-in judges for hallucinations, safety, frustration
  • Custom rubrics for domain-specific quality
  • Suggested judges from your production traffic
judges
active
binary
passing
is_helpful
Pass / Fail evaluation
94%pass rate
rubric
healthy
quality_score
1–10 scale with criteria
8.2
binary
10% sample
tone_check
LLM spans only
87%pass rate
binary
100% sample
safety_guard
All spans
99%pass rate
03

Calibrate

Calibrate the judge

When the judge gets it wrong, correct it. ZeroEval learns your standards over time. The evaluation layer itself improves, not just your prompts.

  • Thumbs up/down with reasoning
  • Provide expected output for comparison
  • Watch alignment score climb over time
calibration
no_hallucinationbinary

“You can request a full refund within 90 days of purchase.”

PASS
FAIL
“Our refund policy is 30 days, not 90”
12 corrections
94%aligned
04

Optimize

Optimize the prompt

Turn feedback into candidate prompt versions. Compare them to the baseline, validate across models, and deploy without redeploying your app.

  • Candidates generated from real failure patterns
  • Side-by-side comparison with version history
  • One-click deploy via runtime prompt fetching
optimization
v1 → v2.4
before
78% pass

“You are a helpful customer support assistant.”

after
96% pass

“You are a technical support specialist. Keep answers under 3 sentences. Always reference the specific error code. Be direct and specific.”

Patterns extracted
43“Too verbose”
28“Didn’t address my error”
12“Too generic”
from 83 feedback entriesready to deploy

The Payoff

See how failures become a better prompt

A weak prompt creates bad outcomes. User feedback and judge results reveal the pattern. ZeroEval proposes a better version, compares it to the baseline, and lets you ship the winner.

1
Input

“You are a helpful assistant.”

78% negative feedback
2
Feedback
Too generic
Way too verbose
Didn’t answer my question
+
Clear and concise
Missing context
3
Output

“You are a technical support specialist. Keep answers under 3 sentences.”

89% positive
01

Generate a candidate from real feedback

02

Compare against the baseline

03

Validate across models

04

Deploy without redeploying your app

agent

Use ZeroEval from code,
CLI, or agents.

Instrument with the SDK or OpenTelemetry. Inspect failures from the CLI. Let coding agents query ZeroEval directly through the MCP server. They read where they’re failing and use that context to improve their own behavior.

  • Python & TypeScript SDKs, REST API, OpenTelemetry
  • CLI for querying evaluations and triggering optimization
  • MCP server for Cursor, Claude Code, and 30+ agents
Read the docs

See the feedback loop on a real agent

Book a demo to see how ZeroEval turns traces, judges, and user feedback into better prompts and fewer regressions.

SOC 2 Type II compliant

Continuous third-party security audit

View report