Backed by Y Combinator

Build self-improving agents

Calibrated LLM judges score every output. Feedback refines them. Your prompts improve automatically.

The Why

You ship an agent. It works in demos. Then production happens. Failures pile up in logs, users hit edge cases nobody tested for, and the agent just sits there. Same as the day you deployed it.

We believe the companies that win the next decade of AI won’t be those that build the best agents. They’ll be the ones whose agents get better over time.

Your users already know what’s wrong, but the tooling to close that loop just didn’t exist. So we built it.

We think software should grow, not just get shipped and maintained. Every AI system improving through its own usage. We’re early. We’re building for the long run.

The How

From install to self-improving agents in minutes

01

Install

Let your agent handle it

Install ZeroEval Skills and tell your coding agent to set up tracing. Or add the SDK manually - two lines to instrument.

  • Works with Cursor, Claude Code, Codex, and 30+ agents
  • Agent handles SDK install, first trace, and judge setup
  • Or drop in the SDK manually in seconds
$ npx skills add zeroeval/zeroeval-skills
Then tell your agent:
> “Set up zeroeval”
SDK installed
First trace verified
Starter judge created
02

Evaluate

Define what “good” looks like

Create LLM judges that score your agent outputs in production. Binary pass/fail, rubric-based scoring, sample rates. You decide what to evaluate and how.

  • Pass/fail or 1–10 scoring with custom rubrics
  • Sample a percentage of traffic to control cost
  • Filter by span type, tags, or custom properties
judges
active
JudgeResult
is_helpfulbinary
Pass / Fail evaluation
94%pass
quality_scorerubric
1–10 scale with criteria
8.2avg
tone_checkbinary
LLM spans only · 10% sample
87%pass
safety_guardbinary
All spans · 100% sample
99%pass
03

Calibrate

Correct the judge, it learns

When a judge gets it wrong, flag it with a reason. Your corrections train the judge to match your quality bar, not a generic one.

  • Thumbs up/down with reasoning
  • Provide expected output for comparison
  • Watch alignment score climb over time
calibration
before
no_hallucinationbinary
PASS

“You can request a full refund within 90 days of purchase.”

“Our refund policy is 30 days, not 90”·Hallucinated policy details
after

“You can request a full refund within 90 days…”

FAIL

“Our refund window is 30 days from purchase…”

PASS
after 12 corrections
+32%94% aligned
04

Optimize

Ship better prompts automatically

ZeroEval extracts patterns from feedback and rewrites your prompts to address them. No manual prompt engineering. Just review, approve, and deploy.

  • New prompt versions generated from real failures
  • Full version history with diff view
  • One-click deploy to production
optimization
v1 → v2.4
before
78% pass

“You are a helpful customer support assistant.”

after
96% pass

“You are a technical support specialist. Keep answers under 3 sentences. Always reference the specific error code. Be direct and specific.”

Patterns extracted
43“Too verbose”
28“Didn’t address my error”
12“Too generic”
from 83 failure signalsready to deploy

Your agent’s interface
to evaluation.

The CLI lets agents and automation inspect traces, filter evaluations, read judge reasoning, and kick off prompt optimization - all from the terminal.

  • Query evaluations, filter by result, date, judge
  • Read judge insights and calibration data
  • Trigger and promote optimization runs
terminal
$ zeroeval judges evaluations abc123 \
--evaluation-result false --limit 3
SpanResultReason
sp_7f2✕ failResponse exceeds length…
sp_a91✕ failMissing citation for…
sp_3e8✕ failTone inconsistent with…
$ zeroeval optimize prompt start task_92x \
--optimizer-type quick_refine
Optimization run started(run_k8m)
Analyzing 847 evaluations…
agent

Agents that diagnose
themselves.

The MCP server gives your AI agents direct access to ZeroEval. They query where they’re failing, read the judge’s reasoning, and use that context to improve their own behavior. The loop closes itself.

  • Agent reads evaluation results in context
  • Surfaces failure patterns and judge reasoning
  • Uses insights to refine its own prompts and behavior

Your agents should get better every day.

See how ZeroEval turns user feedback into higher-performing prompts, without manual tuning.

SOC 2 Type II compliant

Continuous third-party security audit

View report