BM25 vs Vector Search

Every RAG pipeline needs a retrieval step: given a user query, find the most relevant document chunks. There are two main approaches: keyword-based scoring (BM25) and semantic embeddings (vector search). This page explains how each works, when each wins, and why AgentLens uses both.

How Each Mechanism Works

BM25

Keyword scoring

Counts exact word matches, weighted by rarity

1Tokenize – split query into individual words

2IDF weight – rare words score higher than common ones (e.g. “transformer” > “the”)

3TF saturation – diminishing returns after a word appears several times

4Length norm – shorter documents score higher per match

Vector

Semantic embedding

Compares meaning, not words

1Embed query – encode into a 384-dim vector using all-MiniLM-L6-v2

2Nearest neighbors – find stored chunks closest in vector space

3Cosine similarity – score = cos(query, chunk), range [0, 1]

4Rank – return top-k chunks by similarity

An Analogy

BM25

Like a book index. You look up a keyword, and it tells you exactly which pages mention that word. Fast and precise, but it cannot find pages about the same concept when different words are used.

Vector

Like a librarian. You describe what you are looking for in plain language, and they bring you books that are about the right topic – even if the books never use your exact words.

Score Ranges

BM25 Thresholds

15+ – strong (exact ID/code match)

8-15 – good (multiple term matches)

3-8 – partial (some keyword overlap)

1-3 – weak (marginal relevance)

0 – none (no matching tokens)

Unbounded scale. Values depend on corpus size and query length.

Vector Thresholds

0.5+ – strong (near-semantic match)

0.35-0.5 – good (clearly relevant)

0.2-0.35 – marginal (possibly relevant)

0.1-0.2 – noise (likely irrelevant)

<0.1 – garbage (random proximity)

Normalized [0, 1]. Using all-MiniLM-L6-v2 embeddings.

Live Comparison

Pick a query to see how BM25 and vector search score the same document chunk differently. The scores are from real AgentLens pipeline runs.

Matched chunk

The PromptAssembler uses a 4-layer token budget to order system prompt, retrieved context, conversation history, and user query within the model context window.

BM25: 8.2

Vector: 0.72

Tie – both methods effective

Both match well. BM25 finds exact "PromptAssembler" and "token" keywords. Vector understands the semantic concept of token budgeting.

Why Hybrid Retrieval

Neither approach is strictly better. BM25 excels at exact-match queries – names, error codes, specific terms. Vector search excels at conceptual queries where the user does not know the exact terminology.

AgentLens uses hybrid retrieval: the ReAct agent has access to both vector_search and keyword_search tools, and decides which to call (or both) based on the query.

Finding the right chunks is not enough. With 3B-parameter models, retrieval accuracy means nothing if the LLM cannot reason over what it found. AgentLens solves this with a multi-agent architecture: three specialized agents that each own a different failure mode. ReAct retrieves and reasons. Grader filters out irrelevant chunks before they reach the answer. Judge rejects answers that are not grounded in evidence. No single LLM handles everything.

Hybrid Search with RRF

As of 2025-2026, hybrid search is the production default for enterprise RAG. Every major vector database (Pinecone, Qdrant, Weaviate, Elasticsearch) supports it natively. BM25 and vector search have complementary strengths, and combining them yields 15-20% precision improvement over either alone.

Reciprocal Rank Fusion (RRF): Industry-Standard Fusion

RRF merges ranked results from BM25 and vector search without needing to normalize their incompatible score scales. It works on rank position, not raw scores.

# Reciprocal Rank Fusion (Cormack et al., SIGIR 2009)
# The industry-standard way to combine BM25 + Vector results

def reciprocal_rank_fusion(rankings, k=60):
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, 1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# k=60 is the standard default (from the original paper)
# Documents appearing in BOTH lists get boosted
# No score normalization needed. Pure rank-based fusion

k parameter

Smoothing constant. k=60 is the paper default and works well without tuning. Range: 50-100. Higher k = more weight to lower-ranked results.

Alpha weighting

0.5-0.7 BM25

When using weighted fusion instead of RRF, data heavy on codes and IDs typically benefits from weighting BM25 higher (0.6-0.7). General knowledge bases: 0.3-0.5 BM25.

Why RRF Over Score Normalization

BM25 scores are unbounded (0 to 20+) while cosine similarity is bounded (0 to 1). You cannot simply combine them. A BM25 score of 16.8 and a cosine score of 0.44 are incomparable. RRF solves this elegantly by ignoring raw scores entirely and fusing only rank positions. This is why it is the industry default: zero calibration needed.

Optimized RAG Retrieval Pipeline

Based on the industry research, here is the production-standard retrieval pipeline that top-performing RAG systems use:

Parallel retrieval: BM25 + VectorAgentLensReAct agent-driven tool calling

Run both searches simultaneously. BM25 with k1=1.2, b=0.75, custom tokenizer for compound IDs. Vector with cosine similarity, same embedding model as indexing.

Score filteringAgentLensVector cutoff=0.25, BM25 filters score 0

BM25: drop results with score < 1.5. Vector: drop results with cosine similarity < 0.25. This eliminates noise before fusion.

RRF Fusion (k=60)

Merge filtered BM25 and vector results using Reciprocal Rank Fusion. Documents appearing in both lists get boosted. Take top 10-20 candidates.

Cross-encoder rerankingAgentLensGrader agent scores and ranks chunks via LLM

Pass top 10-20 candidates through a cross-encoder reranker (e.g., BGE Reranker, Cohere Rerank) that scores each document against the query with much higher accuracy. Return top 3-5.

Context injection to LLMAgentLensPurpose-built prompts per agent role

Pass the top 3-5 reranked chunks as context to the LLM. The agent needs to see full chunks, not truncated previews, to judge sufficiency.

Benchmark result from Anthropic (2024):

Reranked Contextual Embedding + Contextual BM25 reduced the top-20-chunk retrieval failure rate by 67% (5.7% to 1.9%). Hybrid search + reranking is worth the investment.

Sources

Cormack, Clarke, Buttcher: "Reciprocal Rank Fusion Outperforms Condorcet" (SIGIR 2009)
Elastic: A Comprehensive Hybrid Search Guide
Google Cloud: About Hybrid Search in Vertex AI