ZeroEval

Improve your agents automatically.

Setup human feedback and AI judges to gather feedback from all your interactions, so your agents improve continuously even after launch.

Read the docs

beforegpt-5-nano

“You are a helpful assistant.”

afterauto-optimized

gemini-3-flash-preview

“You are a technical support specialist. Keep answers under 3 sentences. Always reference the error code.”

from 127 feedback|across 2,400 traced runs|↑ 18% accuracy

Built and trusted by teams from

The Why

You ship an agent. It works in demos. Then production happens. Failures pile up in logs, users hit edge cases nobody tested for, and the agent just sits there. Same as the day you deployed it.

We believe the companies that win the next decade of AI won’t be those that build the best agents. They’ll be the ones whose agents get better over time.

Your users already know what’s wrong, but the tooling to close that loop just didn’t exist. So we built it.

We think software should grow, not just get shipped and maintained. Every AI system improving through its own usage. We’re early. We’re building for the long run.

•

The How

From setup to self-improving in minutes

Install

Let your agent handle it

Install ZeroEval Skills and tell your coding agent to set up tracing. Or add the SDK manually - two lines to instrument.

Works with Cursor, Claude Code, Codex, and 30+ agents
Agent handles SDK install, first trace, and judge setup
Or drop in the SDK manually in seconds

$ npx skills add zeroeval/zeroeval-skills

Evaluate

Score what matters

Run built-in or custom judges against your own quality bar, not a generic benchmark. Binary pass/fail, rubric-based scoring, sample rates. You decide what to evaluate and how.

Built-in judges for hallucinations, safety, frustration
Custom rubrics for domain-specific quality
Suggested judges from your production traffic

is_helpfulbinary

Pass / Fail evaluation

0%▲ +2%

pass rate

quality_scorerubric

1–10 scale with criteria

0.0▲ +0.4

avg score

tone_checkbinary

LLM spans · 10% sample

0%→ stable

pass rate

safety_guardbinary

All spans · 100% sample

0%▲ +1%

pass rate

Calibrate

Calibrate the judge

When the judge gets it wrong, correct it. ZeroEval learns your standards over time. The evaluation layer itself improves, not just your prompts.

Thumbs up/down with reasoning
Provide expected output for comparison
Watch alignment score climb over time

no_hallucinationJudge Alignment

89%

90%

feedback

Optimize

Improve the whole agent

Turn failure patterns into concrete changes: better prompts, different models, updated agent code. Compare candidates to the baseline, validate across configurations, and ship without redeploying.

Prompt, model, and code candidates from real failures
Side-by-side comparison with version history
Deploy changes without redeploying your app

before

Prompt

“You are a helpful assistant.”

Modelgpt-5-nano

after

Prompt

“Technical support specialist. Under 3 sentences. Reference the error code.”

Modelgemini-3-flash-preview

Agent code

+18−3

The Loop

A self-improving system

Every interaction makes your agent smarter. Traces flow in, feedback shapes quality, and optimizations deploy automatically across your entire stack.

agent

Use ZeroEval from code,
CLI, or agents.

Instrument with the SDK or OpenTelemetry. Inspect failures from the CLI. Let coding agents query ZeroEval directly through the MCP server. They read where they’re failing and use that context to improve their own behavior.

Python & TypeScript SDKs, REST API, OpenTelemetry
CLI for querying evaluations and triggering optimization
MCP server for Cursor, Claude Code, and 30+ agents

Read the docs

See the feedback loop on a real agent

Book a demo to see how ZeroEval turns traces, judges, and user feedback into better prompts and fewer regressions.

SOC 2 Type II compliant

Continuous third-party security audit

View report

Improve your agents automatically.

From setup to self-improving in minutes

Let your agent handle it

Score what matters

Calibrate the judge

Improve the whole agent

A self-improving system

Trace

Score

Calibrate

Optimize

Use ZeroEval from code,CLI, or agents.

See the feedback loop on a real agent

Use ZeroEval from code,
CLI, or agents.