Quality Judge

The Quality Judge evaluates the pre-filtered chunks from the Grader in a single LLM call. It checks relevance, groundedness, and completeness, then generates an answer and issues a verdict.

Default Model

The Judge runs on gpt-oss:120b (117B, OpenAI) by default – the largest model in the pipeline. A bigger model at the judgment stage gives stronger reasoning for answer generation and verdict decisions. In the AgentLens Live tab, Judge events appear with green headers.

Verdict Protocol

The Judge outputs a structured response. The format differs based on the verdict:

On ACCEPT (answer is satisfactory):

VERDICT: ACCEPT
CONFIDENCE – a score between 0.0 and 1.0
ANSWER – detailed answer citing sources as [filename]
ASSESSMENT – one sentence about groundedness

On RETRY (more retrieval needed):

VERDICT: RETRY
FEEDBACK – specific guidance for the Retrieval Agent on what to search for differently

On ACCEPT, the Judge's answer becomes the final response. On RETRY, the feedback is forwarded to the Retrieval Agent for a targeted second round.

Retry Budget

The TwoAgentOrchestrator allows a maximum of 2 rounds. On the final round, the Judge receives a FINAL_ROUND_INSTRUCTION that forces it to respond with ACCEPT and provide the best possible answer using available chunks, even if evidence is incomplete.

Fallback Trigger

If the Judge produces no extractable answer (the ANSWER field is empty or unparseable), the pipeline moves to the Fallback stage. This is distinct from RETRY – fallback triggers when the Judge accepts but has nothing to return.

Fail-Open Parsing

If the Judge's output cannot be parsed:

Unparseable VERDICT defaults to ACCEPT
Unparseable CONFIDENCE defaults to 0.1
Unparseable ANSWER defaults to None (triggers fallback)
Unparseable FEEDBACK defaults to "Search for more specific information"

This ensures the pipeline never stalls on a misbehaving model.