Build self-improving agents
Calibrated LLM judges score every output. Feedback refines them. Your prompts improve automatically.
How it works
From install to self-improving agents in minutes
Install
Two lines to instrument
Add the SDK and call ze.init(). Every LLM call gets traced automatically. OpenAI, Anthropic, LangChain, LangGraph, and more.
- No config files, no boilerplate
- Auto-instruments OpenAI, Anthropic, LangChain, more
- Drop into existing codebases in seconds
Evaluate
Define what “good” looks like
Create LLM judges that score your agent outputs in production. Binary pass/fail, rubric-based scoring, sample rates. You decide what to evaluate and how.
- Pass/fail or 1–10 scoring with custom rubrics
- Sample a percentage of traffic to control cost
- Filter by span type, tags, or custom properties
Calibrate
Correct the judge, it learns
When a judge gets it wrong, flag it with a reason. Your corrections train the judge to match your quality bar, not a generic one.
- Thumbs up/down with reasoning
- Provide expected output for comparison
- Watch alignment score climb over time
Optimize
Ship better prompts automatically
ZeroEval extracts patterns from feedback and rewrites your prompts to address them. No manual prompt engineering. Just review, approve, and deploy.
- New prompt versions generated from real failures
- Full version history with diff view
- One-click deploy to production
Feedback in
one API call.
User complains about a response? Send a thumbs down with the reason. ZeroEval extracts the constraint and bakes it into your next prompt version.
“You are a helpful assistant.”
“You are a technical support specialist. Keep answers under 3 sentences. Be direct and specific.”
Prompts that improve
themselves.
127 users said “too verbose.” ZeroEval rewrites your prompt to be concise. Your approval rate jumps from 78% to 94%.
Your agents should get better every day.
See how ZeroEval turns user feedback into higher-performing prompts, without manual tuning.





