Platform Documentation
Excruciating detail on platform mechanics, E2E architecture, design decisions, and real-world implementation samples.
1. Distributed Fleet Topology
Modelsmith runs on the reduced post-handover fleet: two DGX Spark nodes for high-memory workloads plus an RTX 4090 Ubuntu node for training and inference. The nodes are orchestrated via Tailscale mesh VPN and high-speed QSFP 200Gbps links where available, keeping workload placement explicit while the customer hardware handover proceeds.
2. Grace Blackwell (GB10) Architecture
NVIDIA's Grace Blackwell superchips provide a unified memory space (128 GiB per node) that is shared seamlessly between the ARM CPU and the Blackwell GPU. This eliminates traditional PCIe bottlenecks but introduces shared I/O contention.
Unified Memory I/O Contention
Because memory is unified, massive sequential file reads (like downloading a 60GB safetensors model) flood the Linux page cache. This page cache lives in the same physical RAM modules as the GPU's KV cache. We enforce strict node-level lock files to prevent downloads from evicting actively generating LLM context windows.
3. vLLM Engine & TurboQuant35
Modelsmith uses a highly customized branch of vLLM. To maximize the batch size for AdTech inference across the 128 GiB nodes, we quantize both the model weights and the KV cache.
AWQ 4-bit Weights
Qwen3-32B is quantized using Activation-aware Weight Quantization (AWQ), compressing the 32B model from ~64GB to ~18GB of VRAM while retaining >99% FP16 accuracy.
ADR-038: TurboQuant35
The KV cache is quantized dynamically using TurboQuant35. This yields ~2x the context capacity of standard FP8, allowing us to hit 64 concurrent workers on a single Spark.
4. Tiered Agent Pipeline via LangGraph
AdTech bid requests are highly complex. Instead of relying on a single zero-shot prompt, Modelsmith utilizes a LangGraph Directed Acyclic Graph (DAG) to orchestrate a tiered team of specialized agents.
Input
Tier 1: Domain
Tier 2: Profile
Tier 3: Legal
// src/lib/multi-agent/langgraph-workflow.ts
const selectedSubagents = [...new Set(baseAgents)].slice(0, config.maxConcurrentEvaluations);
const result = await this.workflow.invoke(initialState, { configurable: { thread_id } });5. GPU Exclusivity & Memory Isolation
ADR-031 enforces strict GPU exclusivity. Running inference and training concurrently on the same unified memory space leads to catastrophic OOMs and CUDA hangs.
- Inference node: locked to inference (port 8000); never co-runs training.
- Primary training node: GRPO workload owner.
- Overflow nodes: failover targets when the primary is unreachable.
6. Network & Tailscale Serve Proxy
To avoid port drift and secure the platform within the mesh, Modelsmith leverages Tailscale Serve combined with a Caddy reverse proxy.
# infra/caddy/Caddyfile
:8080 {
reverse_proxy localhost:3002 {
lb_try_duration 30s
lb_try_interval 1s
}
}
# Tailscale route: http://modelsmith/ -> localhost:80807. Expert-per-context Training (ADR-036)
Instead of training one massive LoRA that suffers from catastrophic forgetting, we dynamically route requests to 4 domain-clustered LoRAs: Exchange, Gaming, Campaign, and Trust. This architecture allows deep specialization in niche AdTech fraud vectors without degrading baseline conversational capabilities.
8. Fleet Routing Layer & Affinity Scoring
ADR-057a: The training pool is not statically defined. `select_training_host()` executes parallel SSH probes across the fleet (5s timeout). Hosts are scored: primary+idle=100, non-primary+idle=50, busy=10, unreachable=0. The highest-scoring node automatically claims the GRPO workload.
9. Observability & Telemetry
Traditional GPU monitoring tools fail on GB10 architectures. We use custom heuristics querying `nvidia-smi dmon -s u -c 1` for SM utilization and `free -b` for system memory. LangSmith tracing is natively integrated via the `ProductAnalysisTracer` to track token burn rates across the LangGraph DAG.
10. Hardware Provisioning
The fleet is fully immutable. Bare-metal nodes are bootstrapped via `bootstrap/setup-spark.sh`, which provisions Docker, OFED drivers, and Tailscale. The standalone RTX node is reserved primarily for training and inference workloads, with embedding services discovered from the shared embedding registry rather than hard-pinned to that host.