Modular AI : Micro-Transformers as Hybrid Plugins

Hosting one large language model behind a tool surface is the obvious shape of an AI platform. The less obvious shape is what we treat as Phase 0 of modular AI in the I-Machine platform : Small specialised models, each in its own plugin, composed behind the same MCP plugin host that serves every other capability. This article describes what the contract looks like and what the first concrete plugins do.

The motivation

A monolithic LLM is forced to do three jobs at once : Decide what kind of input it has, decide what kind of action that input deserves, and actually produce the action. Each of those is a different cognitive task with different latency, accuracy, and cost profiles. A 70B parameter model classifying a query as “math” versus “language” is the wrong tool for the classification step ; A symbolic math evaluator answering “42” is the wrong tool for everything that isn’t math.

Modular AI breaks the monolith. Specialised modules handle the parts of the work they are good at ; A router decides which module sees which input ; The expensive general model handles only what no specialised module can. Each module is small, independently versioned, independently deployed, and addressable through the same protocol as every other plugin.

The contracts

A module is a plugin that implements ITensorModule on top of the usual IMcpPlugin. Two further contracts let it participate in the shared infrastructure :

VRAMManager manages GPU buffers through RAII handles, LRU eviction, shared-weight references, priority overrides, and exclusive mode for training. Every module that allocates on the GPU goes through it.
EmbeddingBus is the shared conduit for vector representations. Modules that produce or consume embeddings publish and subscribe on the bus rather than reaching into each other’s state.
KVCachePool shares attention caches across modules that benefit from them, again behind a contract so the modules themselves don’t know which neighbour wrote a given entry.
TrainingHook is the contract a module implements to expose its parameters to the training backend. Modules that don’t train ignore it.

The Tensor type is a thin opaque handle to a buffer, carrying shape, dtype, and the residency information the manager needs. The contracts above traffic in Tensor, not raw pointers.

LangModule : The general-purpose language module

LangModule is an IMcpPlugin, an ITensorModule, and an ITrainingHook all at once. It wraps a llama.cpp HTTP backend and exposes four MCP tools :

lang_generate : Standard generation against the loaded model.
lang_embed : Produce embeddings for downstream consumers.
lang_score : Score a candidate against a prompt for ranking and routing.
lang_status : Report load state, residency, and warm-up progress.

Because LangModule implements the bus and pool contracts, the same instance can serve as the embedding producer for a vector search elsewhere and as the model behind a router decision elsewhere again. The plugin host loads it once ; The capabilities are addressable from every other plugin.

NeuralRouter : The CPU-only dispatcher

NeuralRouter sits in front of the language capabilities and decides who handles a given input. It does not load a model. Its job is to classify cheaply and dispatch ; Cheap means CPU heuristics.

Two components live inside it. The InputClassifier applies fast heuristics on CPU to label inputs by domain. The MathEvaluator handles the math cases without touching the GPU, returning a numeric answer when it can recognise an expression. Whatever the classifier can route deterministically never reaches the large model ; The large model handles only the residual.

The router exposes two tools : neural_route for the dispatch decision and neural_route_status for the watching UI. It is small enough to be production-ready on day one : Thirty-eight tests in the test suite cover its full behaviour and all pass.

What modular looks like at runtime

A query enters neural_route. The classifier labels it ; The router dispatches. A math query goes straight to the evaluator and returns in microseconds. A language query goes to LangModule, which loads its weights through the VRAMManager if not already resident, runs inference, returns a result. An embedding-consuming downstream subscribes on the EmbeddingBus to whichever module produced relevant vectors.

The host owns the lifecycle, the GPU, the bus, and the cache. The modules own their domain knowledge. Adding a new module means writing a new plugin against the same four contracts ; No core changes are required for the host to discover and schedule it.

What Phase 0 does not solve yet

The plumbing is in place. The harder problems live in the layers above : Composing modules into pipelines that learn over time, propagating gradient information across module boundaries when several participate in a training run, and managing the LoRA inheritance graph that lets a specialised module fine-tune from a general one without duplicating the base weights. Those are the topics that drive Phase 1 of the modular AI work.

For the broader infrastructure context, see Self-Hosted MCP Infrastructure for Enterprise. For the VRAM management layer this modular work relies on, see VRAM Allocation for Modular Neural Plugins.