Document Indexing

The ingestion service processes documents into searchable chunks stored in ChromaDB. This is a one-time operation per document – once indexed, chunks are available to both keyword and semantic search tools.

Supported File Types

.txt – plain text (simple read)
.pdf – via PyMuPDF (fitz)
.doc / .docx – via python-docx

Chunking Strategy

Documents are split using a sentence-boundary chunker. The chunker splits text on sentence-ending punctuation, then accumulates sentences until the word count threshold is reached:

Chunk size: 512 words
Chunk overlap: 50 words

Each chunk includes metadata: document ID, filename, chunk index, start/end character offsets, and ingestion timestamp. Chunk IDs follow the format filename#idx for traceability.

Dual Indexing

Each chunk is indexed twice:

Vector index – the chunk text is encoded into a 384-dimensional embedding using all-MiniLM-L6-v2 and stored in ChromaDB with HNSW cosine similarity
BM25 index – the chunk text is tokenized using the custom tokenizer and added to an in-memory BM25 index for keyword search

Ingestion Endpoints

POST /ingest – upload and index a single file (idempotent: removes existing chunks with the same filename before re-ingesting)
POST /ingest-samples – bulk ingest all sample documents from the data directory
POST /refresh-index – rebuild the BM25 index from current ChromaDB contents

Sample Documents

AgentLens ships with 12 sample documents organized across 6 categories: education, enterprise, legal, support, technical, and telecom. These are ingested automatically on first startup via the /ingest-samples endpoint.