Backed by Y Combinator

Build self-improving agents

LLM judges that calibrate over time. Feedback loops that optimize your prompts automatically.

How it works

Four steps to agents that improve themselves

1

Install with two lines

Get started in seconds

Install the SDK and initialize it. Automatic instrumentation captures every LLM call from OpenAI, Anthropic, LangChain, LangGraph, and more.

  • Zero configuration required
  • Automatic tracing for all major LLM providers
  • Works with existing codebases instantly
terminal
$ pip install zeroeval
import zeroeval as ze
ze.init()
Dashboard
Binary Judge
Pass / Fail evaluation
is_helpful
Scored Rubric
1-10 scale with criteria
quality_score
Smart Filter
Evaluate specific spans
kind: llm
Sample Rate
Evaluate % of traffic
10%
2

Create judges for any need

Binary, scored, filtered

Define what good looks like. Create judges with binary pass/fail, score-based rubrics, sample rates, and smart filters to evaluate exactly what matters.

  • Binary judges for clear pass/fail decisions
  • Scored rubrics with custom criteria and thresholds
  • Filter by span type, tags, or custom properties
3

Calibrate with feedback

Teach your judges

When a judge gets it wrong, tell it why. Your feedback trains the judge to align with your standards. The more feedback you provide, the smarter it becomes.

  • Thumbs up/down with optional reasoning
  • Expected output for comparison
  • Track alignment score over time
4

Optimize your agents

Close the loop

Turn feedback into action. We automatically rewrite your prompts based on patterns in user feedback. Your agents improve without manual tuning.

  • Automatic prompt optimization from feedback
  • Version control for all prompt changes
  • Deploy optimized versions with one click

Feedback in
one API call.

Flag bad outputs with a thumbs down and why it failed. We extract constraints from complaints and evolve your prompts to address them.

feedback.py
import zeroeval as ze
ze.init()
# When user gives feedback on a response
ze.send_feedback(
prompt_slug="customer-support",
completion_id=response.id,
thumbs_up=False,
reason="Response was too verbose",
expected_output="Keep it under 3 sentences"
)
prompt-optimization
v2.4
before
78% negative

“You are a helpful assistant.”

after
94% positive

“You are a technical support specialist. Keep answers under 3 sentences. Be direct and specific.”

from 127 user signals
optimized

Prompts that improve
themselves.

ZeroEval analyzes feedback patterns and automatically generates optimized prompt versions. Deploy improvements without touching your code.

Make your AI agents smarter.

Join teams using production data to continuously improve their AI agents