Keyword Search

The Keyword Search tool uses BM25 (Best Matching 25) via the rank-bm25 library (BM25Okapi variant) to find documents by exact term overlap. The Retrieval Agent calls this tool when the query contains specific terms, names, or technical identifiers that benefit from literal matching.

How BM25 Works

BM25 scores documents based on term frequency (how often query terms appear in the document) and inverse document frequency (how rare those terms are across the corpus). Common words contribute less to the score than distinctive terms.

BM25 Preprocessing Pipeline

AgentLens uses a custom 5-stage tokenizer (not Porter Stemmer). The same tokenizer runs on both index and query sides to ensure consistent matching.

BM25 Preprocessing Pipeline

Tokenizer

custom 5-stage

AgentLens

Lowercase, punctuation stripping (preserves hyphens/underscores), compound ID expansion (TC-409-USR-010 becomes [TC-409-USR-010, TC, 409, USR, 010]), stop word removal (42-word frozenset), suffix stemming (20+ rules, longest-first, minimum 3-char stem). Industry default is Porter Stemmer (studying -> study).

Compound splitting

custom regex

AgentLens

Split hyphenated IDs, error codes, and SKUs into searchable sub-tokens. Split on hyphens, underscores, and camelCase boundaries while keeping the original compound as an additional token.

Lowercasing

always

Case-insensitive matching. Benchmark data shows lowercasing alone provides the largest single lift in BM25 accuracy.

Stopwords

42-word frozenset

AgentLens

Remove common words ("the", "is", "at") that add noise. Industry default: NLTK English stopword list. Elasticsearch clips negative IDF to 0 for the same reason.

BM25 variant

BM25Okapi

Standard variant. Consider BM25Plus if your chunks are short (< 100 tokens). It ensures matched terms always contribute a positive score, reducing bias against short documents.

Key insight: The tokenizer determines what BM25 can even search for.

If the tokenizer does not split compound IDs like TC-409-USR-010 into tc, 409, usr, 010, then BM25 cannot find it. The score will be 0 no matter what k1 and b are set to.

BM25 Search Configuration

BM25 Search Parameters

1.2

industry default

Term frequency saturation. Controls how much repeated words boost the score before diminishing returns. Higher k1 = more credit for repetition. Lower k1 = one mention is nearly as good as ten. Range 0-3. For short documents try 1.5-2.0, for long documents 1.0-1.2 is fine.

top_k

AgentLens

Number of top-scoring documents to return. Industry standard: retrieve 10-20 candidates initially, then rerank down to 3-5 for the LLM context. AgentLens default: 5.

0.5

AgentLens

Document length normalization. Controls how much longer-than-average documents are penalized. b=1.0 means full normalization, b=0.0 means none. Industry default is 0.75. AgentLens uses 0.5 because all chunks are pre-sized to uniform 512-word length. If chunks are uniform, lower to 0.3-0.5. If lengths vary widely, keep 0.75.

Score Thresholds

The Retrieval Agent's system prompt embeds these BM25 score thresholds so it can reason about result quality:

BM25 Score Thresholds

Score	Quality
`15+`	strong Exact ID/code match
`8-15`	good Multiple term matches
`3-8`	partial Some keyword overlap
`1-3`	weak Marginal relevance
`0`	none No matching tokens

Industry practice: set a minimum threshold of 1.5-2.0 for production. Results with score 0 are filtered out before being returned to the agent.

Production Checklist

Setting	Recommended	Impact
BM25 k1	`1.2-1.5`	Keep default unless chunks are very short
BM25 b	`0.5` if chunks are uniform	Less penalty for length if chunks are pre-sized
Tokenizer	Custom compound splitting	Enables exact-ID lookup for hyphenated codes

Interactive Comparison

See the BM25 vs Vector Search page for an interactive demo comparing keyword and semantic scoring side by side.

Sources

Elastic Blog: Practical BM25 Part 3: Picking b and k1
Trotman, Puurula, Burgess: "Improvements to BM25 and Language Models Examined" (2014)
Microsoft Azure: Configure BM25 Relevance Scoring
AutoRAG: BM25 Tokenizer Benchmarks
LangChain: BM25 Retriever Integration