Token Metrics
AgentLens tracks token usage and timing at every level of the pipeline – per LLM call, per agent role, and across the entire pipeline. This makes it possible to compare model costs, identify bottlenecks, and optimize per-role model assignments.
Per-Call Metrics
Every LLM call to the Ollama API returns:
eval_count– number of tokens generated (completion tokens)eval_duration– time spent generating tokens (nanoseconds)prompt_eval_count– number of prompt tokens processedprompt_eval_duration– time spent processing the prompt (nanoseconds)
These are captured for every call: each ReAct iteration, each Grader invocation, the Judge call, and the Fallback call (if triggered).
Per-Role Aggregation
The orchestrator aggregates metrics by role across all rounds:
- Agent tokens – sum of all ReAct iterations across all rounds (1-10 calls in worst case: 5 iterations x 2 rounds)
- Grader tokens – sum of grader calls (1-2, one per round)
- Judge tokens – sum of judge calls (1-2, one per round)
- Fallback tokens – single fallback call (0 or 1)
Each role's total tokens and timing are available in the pipeline response.
Pipeline Metrics
The PipelineMetrics object in the response includes:
tokens_generated– total completion tokens across all rolestokens_per_sec– overall throughput (total tokens / total time)fallback_used– whether the Fallback path was triggeredfallback_ms– Fallback LLM call duration (if used)
Live Tab Display
Token metrics appear in two places in the Live tab:
- Stage headers – each completed stage (ReAct, Grader, Judge, Fallback) shows its token count and timing inline with the role-colored header
- Done banner – the final Done event displays total tokens generated and overall tokens-per-second rate for the entire pipeline
This makes it easy to spot which stage dominates token usage. A common finding: the Judge (running a 24B model) generates fewer tokens than the ReAct Agent (running a 3B model with multiple iterations) but takes longer per token due to model size.
Model Comparison
Token metrics enable quantitative model comparison. Run the same query with different per-role model assignments and compare:
- Total tokens generated (cost proxy)
- Tokens per second (throughput)
- Per-stage timing (latency breakdown)
- Answer quality vs. token cost tradeoff