Document Indexing
The ingestion service processes documents into searchable chunks stored in ChromaDB. This is a one-time operation per document – once indexed, chunks are available to both keyword and semantic search tools.
Supported File Types
- .txt – plain text (simple read)
- .pdf – via PyMuPDF (fitz)
- .doc / .docx – via python-docx
Chunking Strategy
Documents are split using a sentence-boundary chunker. The chunker splits text on sentence-ending punctuation, then accumulates sentences until the word count threshold is reached:
- Chunk size: 512 words
- Chunk overlap: 50 words
Each chunk includes metadata: document ID, filename, chunk index, start/end character offsets, and ingestion timestamp. Chunk IDs follow the format filename#idx for traceability.
Dual Indexing
Each chunk is indexed twice:
- Vector index – the chunk text is encoded into a 384-dimensional embedding using all-MiniLM-L6-v2 and stored in ChromaDB with HNSW cosine similarity
- BM25 index – the chunk text is tokenized using the custom tokenizer and added to an in-memory BM25 index for keyword search
Ingestion Endpoints
POST /ingest– upload and index a single file (idempotent: removes existing chunks with the same filename before re-ingesting)POST /ingest-samples– bulk ingest all sample documents from the data directoryPOST /refresh-index– rebuild the BM25 index from current ChromaDB contents
Sample Documents
AgentLens ships with 12 sample documents organized across 6 categories: education, enterprise, legal, support, technical, and telecom. These are ingested automatically on first startup via the /ingest-samples endpoint.