Docs / Tools / Keyword Search

Keyword Search

The Keyword Search tool uses BM25 (Best Matching 25) via the rank-bm25 library (BM25Okapi variant) to find documents by exact term overlap. The Retrieval Agent calls this tool when the query contains specific terms, names, or technical identifiers that benefit from literal matching.

How BM25 Works

BM25 scores documents based on term frequency (how often query terms appear in the document) and inverse document frequency (how rare those terms are across the corpus). Common words contribute less to the score than distinctive terms.

BM25 Preprocessing Pipeline

AgentLens uses a custom 5-stage tokenizer (not Porter Stemmer). The same tokenizer runs on both index and query sides to ensure consistent matching.

BM25 Preprocessing Pipeline
Tokenizer
custom 5-stage
AgentLens
Lowercase, punctuation stripping (preserves hyphens/underscores), compound ID expansion (TC-409-USR-010 becomes [TC-409-USR-010, TC, 409, USR, 010]), stop word removal (42-word frozenset), suffix stemming (20+ rules, longest-first, minimum 3-char stem). Industry default is Porter Stemmer (studying -> study).
Compound splitting
custom regex
AgentLens
Split hyphenated IDs, error codes, and SKUs into searchable sub-tokens. Split on hyphens, underscores, and camelCase boundaries while keeping the original compound as an additional token.
Lowercasing
always
Case-insensitive matching. Benchmark data shows lowercasing alone provides the largest single lift in BM25 accuracy.
Stopwords
42-word frozenset
AgentLens
Remove common words ("the", "is", "at") that add noise. Industry default: NLTK English stopword list. Elasticsearch clips negative IDF to 0 for the same reason.
BM25 variant
BM25Okapi
Standard variant. Consider BM25Plus if your chunks are short (< 100 tokens). It ensures matched terms always contribute a positive score, reducing bias against short documents.
Key insight: The tokenizer determines what BM25 can even search for.

If the tokenizer does not split compound IDs like TC-409-USR-010 into tc, 409, usr, 010, then BM25 cannot find it. The score will be 0 no matter what k1 and b are set to.

BM25 Search Configuration

BM25 Search Parameters
k1
1.2
industry default
Term frequency saturation. Controls how much repeated words boost the score before diminishing returns. Higher k1 = more credit for repetition. Lower k1 = one mention is nearly as good as ten. Range 0-3. For short documents try 1.5-2.0, for long documents 1.0-1.2 is fine.
top_k
5
AgentLens
Number of top-scoring documents to return. Industry standard: retrieve 10-20 candidates initially, then rerank down to 3-5 for the LLM context. AgentLens default: 5.
b
0.5
AgentLens
Document length normalization. Controls how much longer-than-average documents are penalized. b=1.0 means full normalization, b=0.0 means none. Industry default is 0.75. AgentLens uses 0.5 because all chunks are pre-sized to uniform 512-word length. If chunks are uniform, lower to 0.3-0.5. If lengths vary widely, keep 0.75.

Score Thresholds

The Retrieval Agent's system prompt embeds these BM25 score thresholds so it can reason about result quality:

BM25 Score Thresholds
ScoreQuality
15+strong Exact ID/code match
8-15good Multiple term matches
3-8partial Some keyword overlap
1-3weak Marginal relevance
0none No matching tokens

Industry practice: set a minimum threshold of 1.5-2.0 for production. Results with score 0 are filtered out before being returned to the agent.

Production Checklist

SettingRecommendedImpact
BM25 k11.2-1.5Keep default unless chunks are very short
BM25 b0.5 if chunks are uniformLess penalty for length if chunks are pre-sized
TokenizerCustom compound splittingEnables exact-ID lookup for hyphenated codes

Interactive Comparison

See the BM25 vs Vector Search page for an interactive demo comparing keyword and semantic scoring side by side.

Sources

  1. Elastic Blog: Practical BM25 Part 3: Picking b and k1
  2. Trotman, Puurula, Burgess: "Improvements to BM25 and Language Models Examined" (2014)
  3. Microsoft Azure: Configure BM25 Relevance Scoring
  4. AutoRAG: BM25 Tokenizer Benchmarks
  5. LangChain: BM25 Retriever Integration