Docs / Features / Document Indexing

Document Indexing

The ingestion service processes documents into searchable chunks stored in ChromaDB. This is a one-time operation per document – once indexed, chunks are available to both keyword and semantic search tools.

Supported File Types

  • .txt – plain text (simple read)
  • .pdf – via PyMuPDF (fitz)
  • .doc / .docx – via python-docx

Chunking Strategy

Documents are split using a sentence-boundary chunker. The chunker splits text on sentence-ending punctuation, then accumulates sentences until the word count threshold is reached:

  • Chunk size: 512 words
  • Chunk overlap: 50 words

Each chunk includes metadata: document ID, filename, chunk index, start/end character offsets, and ingestion timestamp. Chunk IDs follow the format filename#idx for traceability.

Dual Indexing

Each chunk is indexed twice:

  • Vector index – the chunk text is encoded into a 384-dimensional embedding using all-MiniLM-L6-v2 and stored in ChromaDB with HNSW cosine similarity
  • BM25 index – the chunk text is tokenized using the custom tokenizer and added to an in-memory BM25 index for keyword search

Ingestion Endpoints

  • POST /ingest – upload and index a single file (idempotent: removes existing chunks with the same filename before re-ingesting)
  • POST /ingest-samples – bulk ingest all sample documents from the data directory
  • POST /refresh-index – rebuild the BM25 index from current ChromaDB contents

Sample Documents

AgentLens ships with 12 sample documents organized across 6 categories: education, enterprise, legal, support, technical, and telecom. These are ingested automatically on first startup via the /ingest-samples endpoint.