Learn
Deep dive into the Modelsmith vision — what autonomous, agentic-first, and config-driven mean in practice.
What Is Modelsmith?
Create, evaluate, and operate domain-specialised language models for your workflows.
Modelsmith helps you continuously produce low-latency, domain-trained small language models that outperform frontier general models on narrow, commercially important tasks.
Modelsmith automates the full lifecycle of your domain-specialised models: evaluate how well a model performs on a specific task, train it to improve, re-evaluate to confirm the improvement, and promote it to production when it meets your quality gates. Modelsmith is designed so that your engineers, product managers, analysts, and domain experts can participate through an agentic operating surface instead of a notebook-only workflow. Agents can identify what the model gets wrong, generate candidate training data, and prepare promotion evidence; your team approves the judgment gates that change evals, datasets, policy, or production state.
You get domain-native models that are faster, smaller, and more accurate on your target task than prompting a general-purpose frontier model — while remaining governable through structured promotion, rollback, and lineage tracking.
Get Started
Fork the repo, run Modelsmith in your own environment, and send reusable platform improvements back upstream.
The starting path is the same from the web app, CLI, or an agentic coding workflow. Your team operates from a customer-controlled fork, connects private data inside your environment, and contributes generic platform improvements back to Agentsia through pull requests.
Create an organisation-owned fork of the Agentsia Modelsmith repo. Deployment-specific configuration and private connectors start there.
Install dependencies in your approved development environment and run baseline checks before attaching sensitive data.
Set secrets, training hosts, inference substrates, storage paths, network policy, and deployment settings outside the upstream repo.
Connect proprietary datasets, operational failure modes, approved documents, and domain-specific knowledge inputs inside your environment.
Pick one commercially important workflow with measurable success criteria, clear frontier baselines, and a governed eval set.
Run schema checks, smoke tests, data access checks, and a small eval to prove the fork and infrastructure are ready.
Evaluate, train, re-evaluate, inspect regressions, prepare rollback, and produce promotion evidence from the same accepted state.
Open upstream PRs for platform changes such as bugfixes, adapters, runbooks, API improvements, UI fixes, docs, and agent-operating patches.
Bugfixes, integration adapters, runbooks, API improvements, UI fixes, documentation, and agent-operating patches.
Private datasets, eval evidence, specialist weights, secrets, private configs, deployment state, and business-specific rubrics.
Where Modelsmith Sits In The Stack
Three separate responsibilities: your workflow, your specialist control plane, and your compute substrate.
Your product workflow, operating experience, business rules, and final user-facing decision.
- An RTB quality gate inside an exchange
- A moderation decision inside a trust workflow
- A campaign diagnostic surfaced in your operations console
It does not train, evaluate, promote, or version the specialist model.
Eval design, failure analysis, training runs, promotion gates, rollback state, lineage, and specialist packaging.
- Turn failed evals into new training data
- Promote qwen3-4b-adtech only after quality gates pass
- Produce the serving artifact and rollback contract
It does not own your workflow or compete with raw inference vendors on serving throughput.
Training hardware, inference runtimes, GPUs, hosted serving platforms, queues, networking, and storage.
- DGX Spark for local training and validation
- Groq or Cerebras for low-latency serving
- Fireworks or your cloud account for hosted deployment
It does not decide what specialist should exist, whether it is good enough, or when it should be promoted.
Modelsmith opens model specialisation to the adjacent data science functions that already understand your workflows: engineering, product management, analytics, operations, and domain leadership. They do not need to become ML infrastructure specialists. They work through issues, PRs, runbooks, eval reviews, and approval records, while human-in-the-loop gates keep methodology, policy, promotion, and rollback accountable.
Your application asks for a decision. Modelsmith supplies the accepted specialist, evidence, version, and rollback contract. The substrate runs the model wherever you choose to serve it. Modelsmith owns the specialisation lifecycle, not the app workflow or the raw compute layer.
Example: your campaign console flags suspicious supply. Your application owns the operations screen and business action. Modelsmith owns the adtech specialist, eval evidence, promotion state, and version history. The substrate owns the training job and the production inference endpoint.
Groq and Cerebras optimise model execution. Fireworks optimises hosted model customisation. Modelsmith optimises the creation, improvement, and operation of domain specialists. If described as a cheaper self-hosted fine-tuning stack, it loses clarity. Its category is closer to a specialisation control plane.
IP Ownership
Your domain artefacts stay yours. Contributions to the Modelsmith platform stay with the platform owner.
Your eval suites, training data, model weights, domain policies, operational examples, and deployment decisions remain under your control. They are the business-specific artefacts that make your specialist useful.
The Modelsmith repo is different. Bi-directional contributions to that repo, including bugfixes, integration improvements, runbook changes, and platform patches made by you or your team, are owned by Agentsia. That keeps the shared control plane coherent while preserving your ownership of your data, eval evidence, and specialist outputs.
Bi-directional means fork-first contribution. You operate Modelsmith from your fork, then contribute reusable platform work back through pull requests. It does not mean automatic upstreaming of private data, evaluation evidence, specialist weights, secrets, deployment state, or business-specific approval decisions.
Design Principles
The architectural reasoning behind how Modelsmith approaches model specialisation.
How The Iterate Loop Works
The core automation: evaluate, train, re-evaluate, repeat until the model meets its target.
The iterate loop is the engine that drives model improvement. It runs autonomously, deciding what to improve next based on evaluation results and stopping when the model reaches its composite score target.
The model is tested against a suite of domain-specific scenarios. Each scenario has success criteria — rubric points the model's output must satisfy. The composite score combines core accuracy, robustness (rephrased variants), and micro-benchmarks into a single number.
Failed scenarios are extracted and used to generate augmented training data. The model is then fine-tuned using GRPO (Group Relative Policy Optimisation), which uses a reward function derived from the same eval rubric. This means training directly optimises for eval performance.
- ✗Failure scenarios are extracted with their rubric
- ✗Augmented training rows are generated and validated
- ✗GRPO training runs on the training fleet
- ✗The loop continues to the next iteration
- Composite exceeds target for sustained iterations
- Held-out scenarios are tested (no regression)
- Model is promoted: shadow, then canary, then production
- Rollback is automatic if post-promotion metrics degrade
Fleet Coordination: The Routing Layer
A deterministic rules-based router dispatches queries to the right specialist. Routing is not the moat — the specialists are.
A fleet of specialists implies a coordination mechanism. Modelsmith provides a routing layer that dispatches incoming queries to the appropriate specialist. This router is deliberately deterministic: a config-driven rules engine based on query classification — not a trained ML model.
A deterministic router is predictable, auditable, and debuggable. It does not introduce an additional trained component whose failure modes cascade across the entire fleet. As the fleet grows, the routing config grows; the router logic stays stable.
Each specialist is also exposed as an MCP tool, allowing your existing agentic coding environment or orchestration framework to invoke specialists directly — bypassing the router when that is the better fit.
Key Architecture Decisions
Engineering trade-offs that shape how your platform operates.