← Infrastructure

AI Hub

Unified AI backend integrating local LLMs, speech processing, image generation, RAG pipelines, and model training.

LLM Integration

Local LLM inference via llama.cpp with an OpenAI-compatible HTTP API. Supports model loading, GPU offloading, context management, and multiple concurrent sessions. The Copilot backend provides code-aware assistance with project context.

llama.cppOpenAI APIGGUF modelsGPU offloadcontext window

Speech & Audio

Speech-to-Text

Whisper integration for real-time transcription. Supports multiple languages and model sizes from tiny to large-v3.

Text-to-Speech

Neural TTS with configurable voices, speed, and output format. Streaming audio generation for real-time playback.

RAG Pipeline

Retrieval-Augmented Generation with DuckDB vss (vector similarity search). Documents are chunked, embedded, and stored in a vector database. Queries retrieve relevant context before LLM generation.

VectorStore

DuckDB-backed vector database with HNSW indexing. Automatic index promotion when chunk count exceeds threshold.

CodeChunker

Language-aware code chunking that respects function/class boundaries for accurate code search and retrieval.

RAGPipeline

End-to-end pipeline: ingest sources, chunk, embed, store, query, and augment LLM prompts with relevant context.

Training & GPU Management

Fine-tuning support with QLoRA and LoRA via Python subprocess orchestration. GPU memory management with VRAM LRU eviction and exclusive mode for training workloads.

QLoRALoRAVRAM LRUTesla V100exclusive mode