VRAM Allocation for Modular Neural Plugins

The Neural layer of the I-Machine platform hosts several ITensorModule plugins inside the same process. They all want GPU memory, none of them know about each other, and physical VRAM is a single hard-bounded resource. The VRAMManager is the contract that lets them coexist. This article describes its allocation model, scoped to what the class actually does today.

The contract

VRAMManager lives in source/include/Mcp/Neural/VRAMManager.h. A neural plugin obtains GPU residency for a model or a buffer through one of its allocation entry points and receives back a VRAMHandle. The handle is the only interface the plugin holds onto ; The manager remains the owner of the underlying device memory.

Three properties of the handle matter at the plugin level :

It is an RAII handle. Its destructor releases the allocation back to the manager. Dropping it on early return, exception, or scope exit cannot leak GPU memory.
It carries a priority declared at allocation time, used by the eviction policy when memory tightens.
It can be a shared-weight handle, where several plugins consume references to the same underlying buffer.

LRU eviction with priority

When an allocation does not fit, evictForSpace walks the manager’s tracked handles and evicts until the requested space is free. The order is computed by lruOrder, which sorts the handles by priority first and by lastUsed timestamp second. The effect : Higher-priority handles survive longer ; Among ties, the coldest one goes first.

Eviction releases the device-side allocation but keeps the handle’s identity. A subsequent access through the handle triggers re-materialisation from the backing source. The plugin code never branches on evicted-versus-resident state ; The handle abstracts that lifecycle.

Shared weights with refcounting

A common neural deployment runs several plugins that want the same base model loaded. VRAMManager exposes acquireSharedWeights and releaseSharedWeights : Plugins that request the same weights receive handles to the same physical buffer, and the manager keeps a refcount. The buffer survives until the last reference is released, and eviction pressure counts it as one allocation rather than as one per consumer.

What this article does not claim

Two clarifications worth making, because the public framing around “VRAM management” tends to conflate distinct layers :

VRAMManager is a Neural-layer component. The AIHubPlugin, which hosts the production-scale LLM, Whisper, TTS, and image backends, manages its GPU usage through a separate component (GPUResourceManager) and through per-backend bookkeeping (for example currentVRAM_MB_ on AILLMBackend). The VRAMManager contract described above does not sit underneath those backends.
Training-mode resource handover (suspending other workloads to give a training job the full card) is not a feature of VRAMManager as it stands. The class is an allocation manager with eviction, not an arbiter between training and inference.

Where it fits

VRAMManager is one of the foundational contracts of the modular AI layer (Phase 0), alongside EmbeddingBus, KVCachePool, and TrainingHook. A new ITensorModule plugin participates in the shared GPU pool by allocating through the manager and respecting the handle lifecycle. There are no other rules, and the manager does not need to know anything about the plugin.

For the broader modular AI design that this allocation layer supports, see Modular AI : Micro-Transformers as Hybrid Plugins. For the wider infrastructure, see Self-Hosted MCP Infrastructure for Enterprise.