About Caduceus
Caduceus is the Hermes Agent Evaluation Framework — a rigorous, adversarial benchmarking platform for production-grade AI agents. Named after the staff carried by Hermes, the Greek god of cunning, commerce, and communication, Caduceus tests whether agents can actually do the work they claim to handle.
Why We Built This
Demos lie. An agent that looks impressive in a 30-second screen recording often falls apart when faced with real-world complexity: ambiguous error messages, missing files, permission errors, cascading failures across services. We built Caduceus to replace vibes-based evaluation with reproducible, evidence-based measurement.
Every task in Caduceus comes from real production scenarios — multi-step debugging, operations, security, and infrastructure tasks that actually determine whether an agent is production-ready. These aren't toy problems.
Connection to Hermes Agent
Caduceus is the native evaluation framework for Hermes Agent, the open-source, self-improving agent framework by Nous Research. While any Hermes-compatible agent can be evaluated, Caduceus is purpose-built to test the capabilities that matter for Hermes deployments: tool use, self-correction, proactive behavior, and multi-step reasoning.
The pipeline: realistic scenarios are generated → agents produce trajectories → Caduceus evaluates and ranks them across 7 dimensions. This creates a tight feedback loop: better benchmarks lead to better training data, which leads to better agents.
Roadmap
- Phase 1Core evaluation engine with 7 scoring dimensions and public leaderboard
- Phase 2Security-focused benchmarks and adversarial failure analysis
- Phase 3Performance trend reporting and agent lifecycle tracking
- Phase 4Live arena — real-time head-to-head evaluation on shared tasks
Built By
Caduceus is built by Daniel Lougen, a PhD student in visual neuroscience at the University of Toronto. The project connects AI agent evaluation with insights from how biological systems process information and recover from errors.