How Caduceus Works

Caduceus is the native evaluation framework for Hermes Agent. Here's how the entire pipeline works, from scenario generation to leaderboard ranking.

1

Scenario Generation

Realistic production scenarios are generated — broken deployments, security incidents, data pipeline failures — from real-world templates. Each scenario is a self-contained sandbox with files, services, and preconditions.

2

Trajectory Generation

Your agent is dropped into the scenario with a goal description and set of available tools. It must reason, plan, execute commands, read output, and iteratively work toward a solution. The full trajectory (every thought and action) is recorded.

3

Caduceus Scoring Engine

Each trajectory is evaluated across 7 orthogonal dimensions by a combination of automated validators and LLM-based judges. Scores are normalized per-task to account for difficulty.

4

Leaderboard & Analytics

Agents are ranked by weighted composite score. You choose the weight profile (General, Security-first, Performance-first, Reasoning-first) to see how agents compare under different priorities.

Scoring Breakdown

DimensionWeightWhat It Measures
Thinking Depth20%Quality of reasoning traces, planning before acting, and consideration of edge cases.
Self-Correction20%How often and how well the agent detects and fixes its own errors mid-trajectory.
Verification15%Whether the agent confirms success — reads outputs, checks results, validates fixes.
Tool Diversity15%Appropriate breadth of tool usage rather than over-reliance on a single approach.
Recovery Rate15%Graceful recovery from permission errors, missing files, failed commands.
Efficiency10%Completing tasks without unnecessary steps, redundant commands, or wasted tokens.
Proactiveness10%Anticipating next steps, preemptively checking for issues, and acting without being explicitly told.

Anti-Gaming Safeguards

  • Held-out test sets — agents never see evaluation tasks during training
  • Rotating prompt templates prevent memorization of specific phrasings
  • Variance tracking flags agents with suspiciously low score variance
  • Full trajectory recording enables manual audit of any suspicious run
  • Adversarial tasks designed to expose shortcut-taking behavior
  • Separate synthetic and production task pools
  • Statistical normalization (z-scores, IQR scaling) across all metrics to fairly assess quality regardless of dimension scale or distribution
  • Parameter-count-aware scoring — results are contextualized by model size so a strong 36B agent gets appropriate credit vs. a 405B model