Self-Hosted RAG With DuckDB vss

Retrieval-augmented generation is the default way to ground an LLM in private data : Chunk the data, embed it, store the vectors, retrieve the most similar ones at query time, hand them to the model. The interesting question for an enterprise is not whether to do RAG ; It is what stores the vectors. The I-Machine platform runs the entire retrieval pipeline in-process on top of DuckDB and its vss extension. This article describes what that means concretely and why it is the right default for self-hosted infrastructure.

The components

Three pieces live inside the AIHubPlugin and cooperate to make RAG work :

CodeChunker takes a source document (source code, internal documentation, ticket history, whatever the deployment wires up) and emits chunks suitable for embedding. Chunk boundaries respect structural cues (functions, paragraphs, sections) rather than fixed token counts.
VectorStore persists chunks and their embeddings in DuckDB with the vss extension enabled. Embeddings are computed by the LLM backend and stored alongside the source text and the metadata needed for filtering at query time.
RAGPipeline orchestrates the query path : Embed the question, search the store, rank, optionally rerank, and return the top-k chunks ready to be appended to the LLM prompt.

Every component runs in the same process as the MCP server. There is no separate vector database deployment, no network hop per query, no driver to manage, no firewall rule to request.

Why DuckDB vss

The standard answer to “where do I store vectors?” in 2026 is a managed service (Pinecone, Qdrant Cloud, Weaviate Cloud) or a dedicated database (Qdrant, Weaviate, Milvus, Chroma) running alongside the application. These work well at scale and ship with operational maturity.

DuckDB with vss is a different shape of answer. DuckDB is an embedded analytical database that runs inside the host process ; The vss extension adds approximate nearest-neighbour search on top of column-stored vectors. Three properties make it the right primitive for a self-hosted MCP platform :

Zero external dependency. No service to stand up, no port to open, no credential to rotate. The vector store is a file (or set of files) on the host filesystem.
Zero egress. The embeddings, the chunks, and the query never leave the process. For organisations under GDPR or with sensitive code/document repositories, this is the only architecture that makes the legal conversation simple.
SQL and analytical primitives in the same engine. Filtering by metadata (which project, which date, which access level) is a normal SQL WHERE clause composed with the vector search. There is no second query language to learn and no second consistency model to reconcile.

Operational consequences

Per-query latency is dominated by the embedding step, not by the vector search. The search itself runs in the same address space as the caller ; The framing overhead is a function call. A query that would take 50–150 ms against a managed service sitting across the public internet returns in single-digit milliseconds against DuckDB locally.

Backups are normal database backups, taken with the same tooling as the rest of the platform. Disaster recovery is a file restore. Migration to a different machine is a copy of one directory.

Scale is the honest trade-off. DuckDB vss handles millions of vectors comfortably on commodity hardware ; It is not designed for hundreds of millions across multiple nodes. For platforms whose total document corpus is in the tens of gigabytes (which describes most enterprise RAG use cases), the ceiling is not in sight. For consumer-facing search over billions of documents, the architecture is different and so is the storage choice.

What this replaces

The conventional self-hosted alternative is to deploy Qdrant or Weaviate as a service alongside the application. That works, but it introduces a second runtime to monitor, a second backup schedule, a second access-control surface, and a second network path on every query. For a platform whose other components (LLM, agent loop, plugin host) are already in-process, adding a managed RAG service breaks the locality that made the rest cheap.

DuckDB vss brings RAG back inside the locality boundary. The cost is accepting that vector search is one workload among others in the host process rather than its own service ; The benefit is that everything else stays simple.

For where this fits in the broader MCP architecture, see Self-Hosted MCP Infrastructure for Enterprise. For the deployment side of the same self-hosted philosophy, see Deploying MCP in a Private Network.