Keyword Search
The Keyword Search tool uses BM25 (Best Matching 25) via the rank-bm25 library (BM25Okapi variant) to find documents by exact term overlap. The Retrieval Agent calls this tool when the query contains specific terms, names, or technical identifiers that benefit from literal matching.
How BM25 Works
BM25 scores documents based on term frequency (how often query terms appear in the document) and inverse document frequency (how rare those terms are across the corpus). Common words contribute less to the score than distinctive terms.
BM25 Preprocessing Pipeline
AgentLens uses a custom 5-stage tokenizer (not Porter Stemmer). The same tokenizer runs on both index and query sides to ensure consistent matching.
TC-409-USR-010 becomes [TC-409-USR-010, TC, 409, USR, 010]), stop word removal (42-word frozenset), suffix stemming (20+ rules, longest-first, minimum 3-char stem). Industry default is Porter Stemmer (studying -> study).If the tokenizer does not split compound IDs like TC-409-USR-010 into tc, 409, usr, 010, then BM25 cannot find it. The score will be 0 no matter what k1 and b are set to.
BM25 Search Configuration
Score Thresholds
The Retrieval Agent's system prompt embeds these BM25 score thresholds so it can reason about result quality:
| Score | Quality |
|---|---|
15+ | strong Exact ID/code match |
8-15 | good Multiple term matches |
3-8 | partial Some keyword overlap |
1-3 | weak Marginal relevance |
0 | none No matching tokens |
Industry practice: set a minimum threshold of 1.5-2.0 for production. Results with score 0 are filtered out before being returned to the agent.
Production Checklist
| Setting | Recommended | Impact |
|---|---|---|
| BM25 k1 | 1.2-1.5 | Keep default unless chunks are very short |
| BM25 b | 0.5 if chunks are uniform | Less penalty for length if chunks are pre-sized |
| Tokenizer | Custom compound splitting | Enables exact-ID lookup for hyphenated codes |
Interactive Comparison
See the BM25 vs Vector Search page for an interactive demo comparing keyword and semantic scoring side by side.
Sources
- Elastic Blog: Practical BM25 Part 3: Picking b and k1
- Trotman, Puurula, Burgess: "Improvements to BM25 and Language Models Examined" (2014)
- Microsoft Azure: Configure BM25 Relevance Scoring
- AutoRAG: BM25 Tokenizer Benchmarks
- LangChain: BM25 Retriever Integration