Scorecards · Inspector · Baselines
Evaluations
Model performance across 818 scenarios, 9 clusters, and 3 domains. Compare frontier and fine-tuned models side by side.
Checking
Loading evaluations...
Model performance across 818 scenarios, 9 clusters, and 3 domains. Compare frontier and fine-tuned models side by side.