Modelsmith
SYS/LEARN
Vision · Methodology · Roadmap

Learn

Deep dive into the Modelsmith vision — what autonomous, agentic-first, and config-driven mean in practice.

What Is Modelsmith?

Create, evaluate, and operate domain-specialised language models for your workflows.

Modelsmith helps you continuously produce low-latency, domain-trained small language models that outperform frontier general models on narrow, commercially important tasks.

Modelsmith automates the full lifecycle of your domain-specialised models: evaluate how well a model performs on a specific task, train it to improve, re-evaluate to confirm the improvement, and promote it to production when it meets your quality gates. Modelsmith is designed so that your engineers, product managers, analysts, and domain experts can participate through an agentic operating surface instead of a notebook-only workflow. Agents can identify what the model gets wrong, generate candidate training data, and prepare promotion evidence; your team approves the judgment gates that change evals, datasets, policy, or production state.

You get domain-native models that are faster, smaller, and more accurate on your target task than prompting a general-purpose frontier model — while remaining governable through structured promotion, rollback, and lineage tracking.

Get Started

Fork the repo, run Modelsmith in your own environment, and send reusable platform improvements back upstream.

The starting path is the same from the web app, CLI, or an agentic coding workflow. Your team operates from a customer-controlled fork, connects private data inside your environment, and contributes generic platform improvements back to Agentsia through pull requests.

01
Fork the repo

Create an organisation-owned fork of the Agentsia Modelsmith repo. Deployment-specific configuration and private connectors start there.

02
Clone and install

Install dependencies in your approved development environment and run baseline checks before attaching sensitive data.

03
Configure private infrastructure

Set secrets, training hosts, inference substrates, storage paths, network policy, and deployment settings outside the upstream repo.

04
Stage controlled data

Connect proprietary datasets, operational failure modes, approved documents, and domain-specific knowledge inputs inside your environment.

05
Select the first wedge

Pick one commercially important workflow with measurable success criteria, clear frontier baselines, and a governed eval set.

06
Run onboarding validation

Run schema checks, smoke tests, data access checks, and a small eval to prove the fork and infrastructure are ready.

07
Run the first specialization loop

Evaluate, train, re-evaluate, inspect regressions, prepare rollback, and produce promotion evidence from the same accepted state.

08
Contribute reusable improvements

Open upstream PRs for platform changes such as bugfixes, adapters, runbooks, API improvements, UI fixes, docs, and agent-operating patches.

Flows upstream

Bugfixes, integration adapters, runbooks, API improvements, UI fixes, documentation, and agent-operating patches.

Stays in your environment

Private datasets, eval evidence, specialist weights, secrets, private configs, deployment state, and business-specific rubrics.

Where Modelsmith Sits In The Stack

Three separate responsibilities: your workflow, your specialist control plane, and your compute substrate.

Application Layer
Where you use the specialist
Consumes specialists
Owns

Your product workflow, operating experience, business rules, and final user-facing decision.

Examples
  • An RTB quality gate inside an exchange
  • A moderation decision inside a trust workflow
  • A campaign diagnostic surfaced in your operations console
Boundary

It does not train, evaluate, promote, or version the specialist model.

Modelsmith Layer
Where you create and govern the specialist
Creates and governs specialists
Owns

Eval design, failure analysis, training runs, promotion gates, rollback state, lineage, and specialist packaging.

Examples
  • Turn failed evals into new training data
  • Promote qwen3-4b-adtech only after quality gates pass
  • Produce the serving artifact and rollback contract
Boundary

It does not own your workflow or compete with raw inference vendors on serving throughput.

Substrate Layer
Where your work physically runs
Runs workloads
Owns

Training hardware, inference runtimes, GPUs, hosted serving platforms, queues, networking, and storage.

Examples
  • DGX Spark for local training and validation
  • Groq or Cerebras for low-latency serving
  • Fireworks or your cloud account for hosted deployment
Boundary

It does not decide what specialist should exist, whether it is good enough, or when it should be promoted.

Modelsmith opens model specialisation to the adjacent data science functions that already understand your workflows: engineering, product management, analytics, operations, and domain leadership. They do not need to become ML infrastructure specialists. They work through issues, PRs, runbooks, eval reviews, and approval records, while human-in-the-loop gates keep methodology, policy, promotion, and rollback accountable.

Your application asks for a decision. Modelsmith supplies the accepted specialist, evidence, version, and rollback contract. The substrate runs the model wherever you choose to serve it. Modelsmith owns the specialisation lifecycle, not the app workflow or the raw compute layer.

Example: your campaign console flags suspicious supply. Your application owns the operations screen and business action. Modelsmith owns the adtech specialist, eval evidence, promotion state, and version history. The substrate owns the training job and the production inference endpoint.

Groq and Cerebras optimise model execution. Fireworks optimises hosted model customisation. Modelsmith optimises the creation, improvement, and operation of domain specialists. If described as a cheaper self-hosted fine-tuning stack, it loses clarity. Its category is closer to a specialisation control plane.

IP Ownership

Your domain artefacts stay yours. Contributions to the Modelsmith platform stay with the platform owner.

Your eval suites, training data, model weights, domain policies, operational examples, and deployment decisions remain under your control. They are the business-specific artefacts that make your specialist useful.

The Modelsmith repo is different. Bi-directional contributions to that repo, including bugfixes, integration improvements, runbook changes, and platform patches made by you or your team, are owned by Agentsia. That keeps the shared control plane coherent while preserving your ownership of your data, eval evidence, and specialist outputs.

Bi-directional means fork-first contribution. You operate Modelsmith from your fork, then contribute reusable platform work back through pull requests. It does not mean automatic upstreaming of private data, evaluation evidence, specialist weights, secrets, deployment state, or business-specific approval decisions.

Design Principles

The architectural reasoning behind how Modelsmith approaches model specialisation.

01
Specialisation Outperforms Scale On Narrow Tasks
A 4B-parameter model trained on domain-specific reward signals can outperform Claude, ChatGPT, or Gemini on the same narrow task — while being cheaper and easier to serve at production scale. Smaller models also allow faster iteration cycles: each eval-train-re-eval loop completes in hours instead of days.
02
Retrieval And Training Solve Different Problems
RAG gives a model access to knowledge it was not trained on — useful for facts that change frequently. Domain training internalises judgment and reasoning patterns that retrieval alone cannot provide. Modelsmith uses both: RAG for dynamic knowledge, training for durable domain competence.
03
Latency Is An Engineering Constraint
Many production workflows (real-time bidding, live content moderation, in-session personalisation) require sub-second responses. Modelsmith trains and develops specialists locally, then packages them for deployment on your chosen cloud inference vendor, where small task-specific models are easier to tune for tight latency budgets than frontier lab APIs.
04
Domain Eval And Training Data Compound Over Time
Each iteration of the eval-train loop produces better evaluation scenarios, more targeted training data, and a more accurate model. This compounding effect means that early iterations improve slowly, but later iterations improve rapidly as the eval suite and training corpus grow in coverage.
05
A Fleet Of Specialists Mirrors Microservices
Rather than one universal model handling every task, Modelsmith creates a fleet of specialists — each narrow enough to be excellent and small enough to be fast. This mirrors the microservices pattern: independent deployment, independent scaling, and independent improvement cycles.

How The Iterate Loop Works

The core automation: evaluate, train, re-evaluate, repeat until the model meets its target.

The iterate loop is the engine that drives model improvement. It runs autonomously, deciding what to improve next based on evaluation results and stopping when the model reaches its composite score target.

Evaluate
Extract failures
Generate training data
Train (GRPO)
Re-evaluate
Promote or continue
Eval phase

The model is tested against a suite of domain-specific scenarios. Each scenario has success criteria — rubric points the model's output must satisfy. The composite score combines core accuracy, robustness (rephrased variants), and micro-benchmarks into a single number.

Train phase

Failed scenarios are extracted and used to generate augmented training data. The model is then fine-tuned using GRPO (Group Relative Policy Optimisation), which uses a reward function derived from the same eval rubric. This means training directly optimises for eval performance.

What happens on failure
  • Failure scenarios are extracted with their rubric
  • Augmented training rows are generated and validated
  • GRPO training runs on the training fleet
  • The loop continues to the next iteration
What happens on success
  • Composite exceeds target for sustained iterations
  • Held-out scenarios are tested (no regression)
  • Model is promoted: shadow, then canary, then production
  • Rollback is automatic if post-promotion metrics degrade
Real Example
Crash recovery via saved state
.iterate/state.json
The loop saves its progress after every phase change. If the process crashes, the machine reboots, or SSH drops mid-training, re-running the script resumes from the last completed phase. Eval phases are verified against the database before being skipped, so a crash during eval does not produce phantom "completed" results.

Fleet Coordination: The Routing Layer

A deterministic rules-based router dispatches queries to the right specialist. Routing is not the moat — the specialists are.

A fleet of specialists implies a coordination mechanism. Modelsmith provides a routing layer that dispatches incoming queries to the appropriate specialist. This router is deliberately deterministic: a config-driven rules engine based on query classification — not a trained ML model.

Why deterministic?

A deterministic router is predictable, auditable, and debuggable. It does not introduce an additional trained component whose failure modes cascade across the entire fleet. As the fleet grows, the routing config grows; the router logic stays stable.

MCP per specialist

Each specialist is also exposed as an MCP tool, allowing your existing agentic coding environment or orchestration framework to invoke specialists directly — bypassing the router when that is the better fit.

Key Architecture Decisions

Engineering trade-offs that shape how your platform operates.

Config-driven
All hyperparameters, host addresses, thresholds, and model profiles live in a single JSON file (clusters.json). This means an agentic tool can tune training parameters by editing JSON — no bash scripts, no Docker compose files, no Python constants to hunt down.
Self-healing loop
When training crashes, the loop classifies the failure (OOM, SSH timeout, CUDA hang, corrupt checkpoint) and attempts automatic recovery before retrying. Unknown failures are logged with enough context for the next monitoring cycle to propose a new recovery strategy.
Agent-first operation
Your primary operator is an agentic coding tool, not a human at a terminal. If a problem requires someone to SSH in and run a one-off command, that command becomes code in the iterate loop. Manual fixes are treated as architecture bugs.
Promotion gates
Models move through shadow, canary, and production stages — similar to how code goes through staging and production. Promotion requires sustained composite scores above a threshold with no regression on held-out scenarios. Rollback is automatic if post-promotion metrics drop.
Real Example
Fleet pool failover
config/clusters.json → defaults.training_hosts
Training hosts are treated as a pool, not fixed assignments. When training starts, all hosts are probed in parallel (5-second SSH timeout) and scored by affinity and availability. If the primary host is down, training transparently fails over to the best available alternate. Adding a new machine to the fleet is a single JSON entry — all models can fail over to it with zero code changes.