Docs / Tools / Semantic Search

Semantic Search

The Semantic Search tool uses vector embeddings to find documents by meaning rather than exact words. It encodes the query into a dense vector and retrieves the closest document chunks from ChromaDB using cosine similarity.

How Vector Search Works

Vector search maps text into high-dimensional vectors where geometric proximity corresponds to semantic similarity. A query is encoded into the same vector space as the indexed documents, and a nearest-neighbor lookup returns the closest chunks. Documents about similar topics cluster together even when they share no exact words.

Embedding Model Comparison

AgentLens uses all-MiniLM-L6-v2 from Sentence Transformers to generate 384-dimensional embeddings. This model balances speed and quality: it encodes text in milliseconds while capturing semantic relationships between concepts.

Embedding Models
all-MiniLM-L6-v2
384 dims
AgentLens
Open-source, fast, good for local/on-prem. Sub-millisecond encoding with strong semantic quality for general English text.
text-embedding-3-large
3072 dims
OpenAI. Highest quality, supports Matryoshka truncation (reduce dims without retraining).
text-embedding-3-small
1536 dims
OpenAI. Good balance of cost and quality for production workloads.
Cohere Embed v3
1024 dims
Enterprise-grade, supports int8/binary compression for reduced storage and faster retrieval.
BGE-M3
1024 dims
Multilingual, hybrid sparse+dense embeddings in a single model. Strong for cross-lingual retrieval.
Key insight: The embedding model is the "tokenizer" of vector search.

Just as BM25's tokenizer determines what keywords are searchable, the embedding model determines what meanings are capturable. A model that was not trained on domain jargon will not embed specialized terms meaningfully. This is why domain-specific fine-tuning or contextual chunk enrichment (prepending context to chunks before embedding) dramatically improves retrieval quality.

Vector Search Configuration

Vector Search Parameters
Similarity metric
cosine
Cosine similarity is the standard for text. Measures the angle between vectors, ignoring magnitude. Range: -1 to 1 (in practice, 0 to 1 for text). Alternatives: dot product (when model was trained with it), Euclidean distance (for clustering).
top_k
5-20
Number of nearest neighbors to retrieve. Industry standard: retrieve 10-20 candidates initially, then rerank down to 3-5 for the LLM context. AgentLens default: 5.
Chunk size
400-800 tokens
Too small = loss of context. Too large = diluted embeddings. The enterprise standard (per Anthropic's contextual retrieval benchmarks) is 400-800 tokens with overlap.
HNSW
ef=200, M=16
Approximate nearest-neighbor index. Higher ef_construction improves index quality at the cost of build time; M controls the number of bi-directional links per node. Typical production values.
similarity_cutoff
0.25
AgentLens
Results below this cosine similarity are filtered out before being returned to the agent. Without a cutoff, irrelevant queries return results with scores 0.06-0.14, random vector proximity, not real matches. The industry standard is to add a cutoff so irrelevant queries return empty results instead of noise.

Score Thresholds

The Retrieval Agent's system prompt embeds these vector score thresholds so it can reason about result quality:

Vector (Cosine) Score Thresholds
ScoreQuality
0.5+strong Near-semantic match
0.35-0.5good Clearly relevant
0.2-0.35partial Possibly relevant
0.1-0.2weak Likely irrelevant
<0.1none Random proximity

Industry practice: set similarity_cutoff at 0.25-0.3 minimum. LlamaIndex uses SimilarityPostprocessor(similarity_cutoff=0.75) for high-precision. LangChain uses score_threshold: 0.25 (cosine distance = 1 - similarity).

Production Checklist

SettingRecommendedImpact
Vector thresholdsimilarity_cutoff=0.25Eliminates noise from irrelevant queries
RerankerBGE Reranker or cross-encoder67% reduction in retrieval failure (Anthropic benchmark)

Interactive Comparison

See the BM25 vs Vector Search page for an interactive demo comparing keyword and semantic scoring side by side.