Self-Hosted MCP Infrastructure for Enterprise : Architecture, Plugins, and Deployment

The Model Context Protocol (MCP) is rapidly becoming the standard way to plug AI models into the real systems an organisation runs on : Databases, internal APIs, file shares, ticketing tools, CRMs, IDEs. Most of the public conversation about MCP, however, revolves around quick wins : Spinning up a server in TypeScript, wiring it to Claude Desktop, calling a couple of tools. That works as a demo. It is not what an enterprise needs.

This article is the long-form version of how we think about MCP at I-Machine. It covers the protocol choices, the plugin model, the AI capabilities you actually need to host, the agent orchestration layer on top, and the realities of shipping all of that to production inside a corporate network. It links out to deeper articles for each individual decision. Think of this article as the map and the others as the territory.

1. Why self-hosted MCP

The default mental model when developers discover MCP is “tools for Claude”. That framing obscures the actual value proposition for an enterprise. There are four reasons an organisation runs MCP on its own infrastructure rather than as a hosted SaaS adapter.

Data residency and privacy. A useful AI agent invariably needs to read sensitive data : Customer records, financial transactions, source code, contracts. An MCP server that runs on your hardware never sends a row of that data to a third party. The LLM, if local, doesn’t either. For European organisations subject to GDPR, this is not a preference ; It is the only viable architecture for many use cases.

Latency. Every external hop adds tens to hundreds of milliseconds. An agent that issues twenty tool calls during one reasoning turn feels qualitatively different when each call is 1 ms versus 100 ms. We routinely see agent loops complete in 2–5 seconds on a self-hosted setup that would take 30 seconds when the tool layer is hosted across the public internet.

Cost. A single mid-sized enterprise running tens of thousands of tool invocations a day quickly outruns the unit economics of any per-call SaaS pricing model. Local hosting amortises the fixed cost of a few GPU-equipped servers across all teams.

Control. Self-hosted MCP lets the organisation decide what gets versioned, what gets audited, what gets deprecated. The plugin set, the LLM weights, the RBAC matrix, the audit log schema. All of it is your code, in your VCS, deployed by your pipelines.

2. Anatomy of an enterprise MCP server

An MCP server is conceptually small : It speaks a JSON-RPC dialect, advertises a set of tools, and answers calls. The complexity lies entirely in what surrounds that core.

Our server is written in C++26, compiled with GCC 16, and exposes its endpoints over a persistent WebSocket connection speaking JSON-RPC 2.0. The reasons for those choices are not aesthetic. We needed :

Per-call latency in the low milliseconds, even when tools wrap hot loops over local data (a database query plan, a DuckDB scan, a vector search).
Bidirectional streaming, so the server can push notifications (a plugin reloads, a long-running job completes, a peer agent emits a proposal) without the client polling.
A plugin model where new capabilities can be added at runtime without restarting the server or interrupting connected clients.
Modern C++ ergonomics for the parts where they matter most : Reflection (for tool schemas), coroutines (for I/O), and ranges (for the data layer).

We explore the protocol decision in detail in MCP vs REST : Protocol Design for AI Agents, and the C++26 decision in Why We Built Our MCP Server in C++26.

3. The plugin model

A bare MCP server with three hand-written tools is a curiosity. A production MCP infrastructure has dozens of tools, grouped into cohesive capability bundles, owned by different teams, evolving on different cadences. The architecture that makes that scale is a plugin host.

Each plugin in our system is a .so shared library that registers itself with the host on load. It declares its tools, its data dependencies, its configuration schema, and optionally a UI client counterpart. Plugins are loaded by the host at startup and can be hot-reloaded at runtime, an essential property when you have multiple teams iterating on capabilities and you don’t want a server restart to interrupt active agent loops.

The structural pattern we converged on across the forty-plus plugins currently in production is hexagonal architecture : A pure domain layer, an application-service layer with explicit ports, and adapters at the edges, one for the MCP server, one for the native ImGui client, one for the Angular web client. The cost of that discipline is some up-front design ; The payoff is that we can refactor any layer in isolation, swap UI front-ends without touching domain code, and test the application layer without spinning up a server. We document the pattern at length in Hexagonal Architecture for Plugin Systems.

Hot-reload itself is non-trivial in C++. The interaction between dlopen, the host’s plugin registry, the message queue, and active client subscriptions hides several traps that only manifest under load. We covered those in Plugin Hot-Reload in C++ : Three dlopen Constraints.

Tool schemas are generated automatically from C++ structs via the C++26 Reflection feature (P2996, available on GCC 16). That means a tool author writes a normal C++ struct for the input arguments, and the JSON schema served to the LLM is derived from it at compile time : No manual JSON, no drift between schema and signature. See C++26 Reflection in Production for the details.

4. AI capabilities you actually need to host

An MCP server is a bridge to capabilities, but in an enterprise setting many of the capabilities you want to expose are AI capabilities themselves : LLM inference, embeddings, speech-to-text, text-to-speech, image generation, vector search, model fine-tuning. Hosting these inside the same infrastructure has compounding advantages : Shared GPU pool, unified RBAC, single audit trail, coherent observability.

Our AIHubPlugin bundles all of these behind a uniform tool surface. Under the hood, it manages :

Local LLM inference via llama.cpp exposed through an OpenAI-compatible HTTP layer, with model lifecycle managed by a long-lived server process.
A retrieval pipeline built on DuckDB’s vss extension : Embeddings, chunking, vector search, all running in-process, with no egress to a managed vector database. Latency and privacy both benefit.
Whisper for STT, a small TTS backend, and Stable Diffusion when image generation is needed.
A training backend that wraps QLoRA / LoRA fine-tuning, with the GPU pool able to switch to exclusive mode when training runs.

GPU memory is the constraint that decides whether several models can be hosted together. Our VRAMManager allocates GPU buffers behind RAII handles, evicts under LRU when memory pressure hits, shares weights across plugins that need the same model, and lets the agent on the critical path override eviction priority. When a training job runs, the manager switches to exclusive mode so the trainer gets the full card.

5. Agent orchestration

Tools and AI primitives are necessary but not sufficient. What actually unlocks value for the business is an agent loop, a control-flow primitive that lets an LLM observe state, propose actions, take them, observe again, until a goal is reached.

Our agent runtime implements the ReAct pattern (reason ↔ act) on top of MCP. It supports multiple backends (local model, Claude API), persists conversation state for resume and replay, exposes a tool surface filtered by RBAC, and emits streaming notifications so a watching UI can render the chain of thought in real time.

Where we go further than most off-the-shelf agent frameworks is in the mission-as-template abstraction. An agent is not a hardcoded prompt ; It is a typed capability schema with input bindings, allowed tools, and an output contract. A user invokes a mission by filling in inputs against the schema, the schema is server-side, versioned, audited. That removes an entire class of prompt-injection and configuration-drift problems, and lets non-technical users instantiate agents without writing prompts.

6. Multi-client by design

Internal AI tools live in three places : A desktop interface for power users, a web interface for everyone else, and command-line or IDE integrations for developers. A common mistake is to build each of these against a different backend. Tool surfaces drift, auth diverges, audit logs scatter.

Because everything in our infrastructure goes through the same WebSocket JSON-RPC layer, the same plugin tools are reachable from :

A native ImGui client compiled for Linux desktops
An Angular web client served alongside the server
A WebAssembly build of the native client, hosted in any browser
Command-line bridges (mcp-ws-bridge) for scripting and CI

All four speak the same protocol against the same plugins. There is one RBAC enforcement point. There is one audit trail. The cost of adding a tool is paid once and consumed everywhere.

7. Deployment

A server that only the developer can run is not infrastructure. For a self-hosted MCP stack to be production-grade, deployment has to be boringly repeatable.

We package the server and each plugin as .deb artefacts, publish them to a private APT repository served by nginx and signed with a project-owned GPG key, and run the server under a systemd unit on the target host. A new plugin version is a apt install away, rollback is apt install <pkg>=<prev-version>, and the host has no idea any of this happens unless we tell it.

Configuration is versioned, layered per environment, and held outside the artefacts. Secrets are pulled at boot from the organisation’s secret manager, so binaries never carry plaintext credentials. Logs are structured and shipped to the organisation’s existing observability stack, with enough correlation context to reconstruct an incident deterministically.

8. What’s next

The architecture we’ve described here is what runs today. The next frontier we’re actively building is modular AI : Composing small specialised models, a router, a math evaluator, a domain-specific summariser, a code understanding model, behind the same plugin host that serves everything else. The idea is to treat micro-transformers the same way we treat database connectors : As plugins with explicit capability contracts, hot-reloadable, scoped by RBAC, accountable in the same audit trail.

If your organisation is moving from “a few prompts in a Slack bot” to “AI agents that take real action on real systems”, the foundations described in this article are most of what stands between you and that outcome. Get in touch, we help organisations get there.