Agents Overview
AgentLens uses a two-agent architecture built from scratch – no LangGraph, LlamaIndex, or other frameworks. The pipeline coordinates a Retrieval Agent, a Grader, and a Quality Judge through a structured feedback loop managed by the TwoAgentOrchestrator.
Pipeline Flow
When a query arrives, a heuristic classifier first decides whether retrieval is needed. Greetings and short non-questions skip directly to a direct LLM call. For retrieval queries, the orchestrator runs:
- Retrieval Agent runs a ReAct loop (1–5 tool calls) to find relevant documents
- Grader scores each retrieved chunk on a 1–5 relevance scale and filters out irrelevant ones
- Quality Judge evaluates the filtered chunks, generates an answer, and issues ACCEPT or RETRY
- On RETRY, the Retrieval Agent runs again with the Judge's targeted feedback
- If the Judge produces no extractable answer, Fallback generates a final answer via PromptAssembler
Default Models
Each pipeline role runs on its own model, chosen to balance reasoning quality with inference speed. The defaults scale up in size as the pipeline progresses from retrieval to final judgment:
| Role | Default Model | Params | Why |
|---|---|---|---|
| ReAct Agent | nemotron-3-nano:30b | 30B | Fast tool-use planning with thinking support |
| Grader | qwen3-next:80b | 80B | Accurate chunk relevance scoring |
| Judge | gpt-oss:120b | 117B | Strong answer generation + verdict reasoning |
| Fallback | gemini-3-flash-preview | Cloud | Direct LLM answer when judge defers |
Models can be overridden per-request via the UI dropdowns or per-deployment via environment variables. See Models Overview for the full override priority chain.
LLM Call Budget
The common case uses 2–3 LLM calls (one retrieval round + grader + judge accept). The worst case is 7 calls across 2 retrieval rounds, grading, judging, and fallback. This is fewer than the 4–8 calls used in the prior single-agent design.
Fail-Open Design
If any agent produces unparseable output, the system defaults to accepting with a low confidence score of 0.1 rather than failing the query. This ensures users always get an answer, even when a model misbehaves.