Docs / Features / Token Metrics

Token Metrics

AgentLens tracks token usage and timing at every level of the pipeline – per LLM call, per agent role, and across the entire pipeline. This makes it possible to compare model costs, identify bottlenecks, and optimize per-role model assignments.

Per-Call Metrics

Every LLM call to the Ollama API returns:

  • eval_count – number of tokens generated (completion tokens)
  • eval_duration – time spent generating tokens (nanoseconds)
  • prompt_eval_count – number of prompt tokens processed
  • prompt_eval_duration – time spent processing the prompt (nanoseconds)

These are captured for every call: each ReAct iteration, each Grader invocation, the Judge call, and the Fallback call (if triggered).

Per-Role Aggregation

The orchestrator aggregates metrics by role across all rounds:

  • Agent tokens – sum of all ReAct iterations across all rounds (1-10 calls in worst case: 5 iterations x 2 rounds)
  • Grader tokens – sum of grader calls (1-2, one per round)
  • Judge tokens – sum of judge calls (1-2, one per round)
  • Fallback tokens – single fallback call (0 or 1)

Each role's total tokens and timing are available in the pipeline response.

Pipeline Metrics

The PipelineMetrics object in the response includes:

  • tokens_generated – total completion tokens across all roles
  • tokens_per_sec – overall throughput (total tokens / total time)
  • fallback_used – whether the Fallback path was triggered
  • fallback_msFallback LLM call duration (if used)

Live Tab Display

Token metrics appear in two places in the Live tab:

  • Stage headers – each completed stage (ReAct, Grader, Judge, Fallback) shows its token count and timing inline with the role-colored header
  • Done banner – the final Done event displays total tokens generated and overall tokens-per-second rate for the entire pipeline

This makes it easy to spot which stage dominates token usage. A common finding: the Judge (running a 24B model) generates fewer tokens than the ReAct Agent (running a 3B model with multiple iterations) but takes longer per token due to model size.

Model Comparison

Token metrics enable quantitative model comparison. Run the same query with different per-role model assignments and compare:

  • Total tokens generated (cost proxy)
  • Tokens per second (throughput)
  • Per-stage timing (latency breakdown)
  • Answer quality vs. token cost tradeoff