Document Indexing

Ingest, chunk, embed, and retrieve documents with LocalDocumentIndex.

Table of contents

Overview
Creating a document index
- Constructor options
Adding documents
- upsertDocument
- Ingesting from files and URLs
Chunking
Querying documents
- queryDocuments
- Rendering sections
Hybrid retrieval (BM25)
FolderWatcher
Deleting documents
Listing documents
On-disk layout

Overview

LocalDocumentIndex handles the full document lifecycle: you provide raw text (strings, files, or URLs), and Vectra splits it into chunks, generates embeddings, stores everything on disk, and retrieves ranked results at query time.

ingest → chunk → embed → store → query → rank → render sections

This page covers each stage in detail. For the item-level API where you supply your own vectors, see Core Concepts.

Creating a document index

import { LocalDocumentIndex, OpenAIEmbeddings } from 'vectra';

const embeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'text-embedding-3-small',
  maxTokens: 8000,
});

const docs = new LocalDocumentIndex({
  folderPath: './my-doc-index',
  embeddings,
});

if (!(await docs.isIndexCreated())) {
  await docs.createIndex({ version: 1 });
}

Constructor options

Option	Type	Default	Description
`folderPath`	`string`	–	Path to the index folder (required)
`embeddings`	`EmbeddingsModel?`	–	Embeddings provider (required for upsert/query)
`tokenizer`	`Tokenizer?`	`GPT3Tokenizer`	Tokenizer for chunk counting
`chunkingConfig`	`Partial<TextSplitterConfig>?`	See below	Chunking configuration
`storage`	`FileStorage?`	`LocalFileStorage`	Storage backend
`codec`	`IndexCodec?`	`JsonCodec`	Serialization format
`indexName`	`string?`	`'index.json'`	Name of the index file

Adding documents

upsertDocument

Add or update a document. Vectra splits the text into chunks, generates embeddings for each chunk, and stores everything atomically.

await docs.upsertDocument('doc://readme', 'Full document text here...', 'md');

Parameter	Type	Description
`uri`	`string`	Unique document identifier (any string — URLs, file paths, custom IDs)
`text`	`string`	Document content
`docType`	`string?`	Hint for chunking separators (e.g., `'md'`, `'txt'`, `'html'`, `'py'`)

If a document with the same URI already exists, it is replaced (old chunks deleted, new chunks inserted).

Ingesting from files and URLs

Vectra provides fetcher classes to read content before passing it to upsertDocument:

FileFetcher — Read local files or recursively scan directories (Node.js only):

import { FileFetcher } from 'vectra';

const fetcher = new FileFetcher();
await fetcher.fetch('./docs/', async (uri, text, docType) => {
  await docs.upsertDocument(uri, text, docType);
  return true; // continue processing
});

WebFetcher — Fetch web pages and convert HTML to markdown (Node.js only):

import { WebFetcher } from 'vectra';

const fetcher = new WebFetcher({ htmlToMarkdown: true });
await fetcher.fetch('https://example.com/page', async (uri, text, docType) => {
  await docs.upsertDocument(uri, text, docType);
  return true;
});

Option	Type	Default	Description
`htmlToMarkdown`	`boolean`	`true`	Convert HTML to markdown
`summarizeHtml`	`boolean`	`false`	Extract summary from HTML
`headers`	`Record<string, string>?`	Browser-like defaults	Custom HTTP headers
`requestConfig`	`RequestInit?`	–	Custom fetch options

BrowserWebFetcher — Fetch web pages in browsers/Electron using Fetch API + DOMParser:

import { BrowserWebFetcher } from 'vectra/browser';

const fetcher = new BrowserWebFetcher();
await fetcher.fetch('https://example.com', async (uri, text, docType) => {
  await docs.upsertDocument(uri, text, docType);
  return true;
});

Chunking

When a document is upserted, Vectra splits it into chunks using TextSplitter. Each chunk becomes one item in the underlying vector index.

How TextSplitter works

Recursive splitting — tries the first separator in the list. If a resulting piece exceeds chunkSize tokens, it recurses with the next separator.
Separators by docType — markdown files get \n## , \n### , \n\n, \n, ` ; Python gets \nclass , \ndef , \n\n, \n, `; etc.
Overlap — when chunkOverlap > 0, each chunk stores leading/trailing overlap tokens for context continuity.

Chunking defaults

LocalDocumentIndex applies its own defaults, which differ from standalone TextSplitter:

Setting	LocalDocumentIndex	TextSplitter (standalone)
`chunkSize`	512	400
`chunkOverlap`	0	40
`keepSeparators`	`true`	`false`

Configuring chunking

const docs = new LocalDocumentIndex({
  folderPath: './my-index',
  embeddings,
  chunkingConfig: {
    chunkSize: 256,
    chunkOverlap: 50,
    keepSeparators: true,
  },
});

Option	Type	Description
`chunkSize`	`number`	Maximum tokens per chunk
`chunkOverlap`	`number`	Overlap tokens between adjacent chunks
`keepSeparators`	`boolean`	Preserve separator text at chunk boundaries
`docType`	`string?`	Default document type for separator selection
`separators`	`string[]?`	Custom separator list (overrides docType)
`tokenizer`	`Tokenizer?`	Custom tokenizer for token counting

Chunking tips

Start with defaults (512 tokens, 0 overlap) — they work well for most use cases
Smaller chunks (128-256) improve precision but may lose context
Add overlap (20-50 tokens) when queries need cross-chunk context
Keep chunkSize under your embedding provider’s maxTokens
Use docType hints (e.g., 'md', 'py') to get appropriate separators

Querying documents

queryDocuments

const results = await docs.queryDocuments('What is Vectra?', {
  maxDocuments: 5,
  maxChunks: 20,
});

Option	Type	Default	Description
`maxDocuments`	`number?`	`10`	Maximum documents to return
`maxChunks`	`number?`	`50`	Maximum chunks to evaluate per query
`filter`	`MetadataFilter?`	–	Metadata filter (on chunk metadata)
`isBm25`	`boolean?`	`false`	Enable hybrid BM25 keyword retrieval

Returns an array of LocalDocumentResult objects sorted by relevance score.

Rendering sections

Each LocalDocumentResult can render its matched chunks into readable sections within a token budget — ideal for feeding into an LLM prompt:

const results = await docs.queryDocuments('search query', { maxDocuments: 3 });

for (const result of results) {
  console.log('Document:', result.uri, 'Score:', result.score.toFixed(4));

  const sections = await result.renderSections(2000, 1, true);
  for (const section of sections) {
    console.log('  Section score:', section.score.toFixed(4));
    console.log('  Tokens:', section.tokenCount);
    console.log('  BM25 match:', section.isBm25);
    console.log(section.text);
  }
}

Parameter	Type	Description
`maxTokens`	`number`	Token budget per section
`sectionCount`	`number`	Number of sections to render
`overlap`	`boolean?`	Include overlapping context between sections

renderSections merges adjacent high-scoring chunks into contiguous text sections, respecting the token budget. Each section includes:

text — the rendered text
score — relevance score
tokenCount — actual token count
isBm25 — whether this was a keyword (BM25) match

Hybrid retrieval (BM25)

By default, queries use semantic retrieval only — ranking chunks by embedding cosine similarity. Enable hybrid mode to also find strong keyword matches using Okapi BM25:

const results = await docs.queryDocuments('error handling patterns', {
  maxDocuments: 5,
  maxChunks: 20,
  isBm25: true,
});

When isBm25: true:

Semantic search runs first, returning the top chunks by cosine similarity
BM25 keyword search runs on remaining chunks
Top BM25 matches are appended to results
Each rendered section’s isBm25 flag tells you which source it came from

Use hybrid retrieval when:

Queries contain exact terms (function names, error codes, identifiers)
You want to catch matches that semantic search might miss
Users search with short, keyword-heavy queries

FolderWatcher

FolderWatcher keeps a LocalDocumentIndex in sync with a directory of files. It performs an initial full sync, then monitors for real-time changes.

import { FolderWatcher } from 'vectra';

const watcher = new FolderWatcher({
  index: docs,
  paths: ['./docs', './notes'],
  extensions: ['.txt', '.md'],
  debounceMs: 500,
});

watcher.on('sync', (uri, action) => console.log(`${action}: ${uri}`));
watcher.on('error', (err, uri) => console.error(`Error: ${uri}`, err));
watcher.on('ready', () => console.log('Initial sync complete'));

await watcher.start();

// Later:
await watcher.stop();

Option	Type	Default	Description
`index`	`LocalDocumentIndex`	–	The index to sync into
`paths`	`string[]`	–	Folders or files to watch
`extensions`	`string[]?`	all files	File extensions to include
`debounceMs`	`number?`	`500`	Debounce interval in ms

Node.js only — uses fs.watch() for filesystem monitoring. For the CLI equivalent, see vectra watch.

Deleting documents

await docs.deleteDocument('doc://readme');

Removes the document body, all its chunks from the vector index, and updates the catalog. If the URI doesn’t exist, this is a no-op.

Listing documents

const documents = await docs.listDocuments();

Returns all documents in the catalog.

On-disk layout

A document index folder contains:

my-doc-index/
├── index.json          # chunk vectors + chunk metadata (documentId, startPos, endPos)
├── catalog.json        # URI ↔ document ID mapping + document count
├── <docId>.txt         # document body (plain text)
├── <docId>.json        # optional per-chunk metadata
└── ...

With Protocol Buffers enabled, .json files become .pb files (except document bodies which stay .txt). See Storage Formats for details.