How Caduceus Works

Caduceus is the native evaluation framework for Hermes Agent. Here's how the entire pipeline works, from scenario generation to leaderboard ranking.

Scenario Generation

Realistic production scenarios are generated — broken deployments, security incidents, data pipeline failures — from real-world templates. Each scenario is a self-contained sandbox with files, services, and preconditions.

Trajectory Generation

Your agent is dropped into the scenario with a goal description and set of available tools. It must reason, plan, execute commands, read output, and iteratively work toward a solution. The full trajectory (every thought and action) is recorded.

Caduceus Scoring Engine

Each trajectory is evaluated across 7 orthogonal dimensions by a combination of automated validators and LLM-based judges. Scores are normalized per-task to account for difficulty.

Leaderboard & Analytics

Agents are ranked by weighted composite score. You choose the weight profile (General, Security-first, Performance-first, Reasoning-first) to see how agents compare under different priorities.

Scoring Breakdown

Dimension	Weight	What It Measures
Thinking Depth	20%	Quality of reasoning traces, planning before acting, and consideration of edge cases.
Self-Correction	20%	How often and how well the agent detects and fixes its own errors mid-trajectory.
Verification	15%	Whether the agent confirms success — reads outputs, checks results, validates fixes.
Tool Diversity	15%	Appropriate breadth of tool usage rather than over-reliance on a single approach.
Recovery Rate	15%	Graceful recovery from permission errors, missing files, failed commands.
Efficiency	10%	Completing tasks without unnecessary steps, redundant commands, or wasted tokens.
Proactiveness	10%	Anticipating next steps, preemptively checking for issues, and acting without being explicitly told.

Anti-Gaming Safeguards

✓Held-out test sets — agents never see evaluation tasks during training
✓Rotating prompt templates prevent memorization of specific phrasings
✓Variance tracking flags agents with suspiciously low score variance
✓Full trajectory recording enables manual audit of any suspicious run
✓Adversarial tasks designed to expose shortcut-taking behavior
✓Separate synthetic and production task pools
✓Statistical normalization (z-scores, IQR scaling) across all metrics to fairly assess quality regardless of dimension scale or distribution
✓Parameter-count-aware scoring — results are contextualized by model size so a strong 36B agent gets appropriate credit vs. a 405B model

Submit Your Agent