Data Machines Research

Not All Models
Are Created Equal

A practitioner's guide to evaluating LLM agentic performance. Which models can actually do autonomous work, and which ones just talk about it?

📄 Download PDF

Scroll to explore

The Problem

Standard benchmarks test knowledge. Agentic work demands execution.

75%

of enterprise AI initiatives fail to deliver expected value

90%+

MMLU scores tell you nothing about whether a model can use tools reliably

$$$

Picking the wrong model wastes weeks, burns trust, and costs real money

What Makes Work “Agentic”?

Five capabilities separate models that can act from models that can only respond.

🔧

Tool & Function Calling

Reliably generating structured API calls, executing shell commands, and composing multi-tool sequences without syntax errors or hallucinated parameters.

🗺️

Multi-Step Planning

Decomposing complex objectives into ordered sub-tasks, maintaining dependencies, and adapting plans when intermediate steps produce unexpected results.

🔄

Error Detection & Recovery

Recognizing when a tool call fails or returns unexpected output, diagnosing the root cause, and executing corrective action without human intervention.

🪞

Honest Self-Assessment

Accurately reporting what was accomplished versus what was attempted. Distinguishing between completed work and aspirational descriptions of intended work.

📏

Long-Horizon Coherence

Maintaining goal alignment, variable state, and contextual awareness across extended multi-turn interactions spanning dozens or hundreds of tool calls.

⚖️

The Agentic Spectrum

These capabilities exist on a continuum. A model strong in planning but weak in error recovery will fail differently than one weak in tool use but honest about limitations.

The Evaluation Framework

Six dimensions scored 1-5, informed by SWE-bench, BFCL, tau-bench, and production deployment experience.

Tool Use

Planning

Error Recovery

Honesty

Context

Verification

Scores combine benchmark evidence (SWE-bench resolution rates, Berkeley Function Calling Leaderboard accuracy, Chatbot Arena ELO) with structured practitioner assessment from production deployments. A model scoring 5/5 on tool use, for example, reliably generates correct function schemas, handles edge cases in API responses, and composes multi-tool workflows across extended sessions.

This framework deliberately weights execution reliability over raw knowledge. A model that scores 95% on MMLU but fails 20% of its tool calls in practice receives a lower composite rating than one that scores 88% on MMLU but executes tools flawlessly.

Head-to-Head Comparison

Composite agentic scores (out of 5.0) based on the six-dimension framework.

Claude Opus 44.8 / 5.0

Claude Sonnet 44.5 / 5.0

GPT-4o4.3 / 5.0

o34.3 / 5.0

Gemini 2.5 Pro4.0 / 5.0

Grok 43.8 / 5.0

LLaMA 4 Maverick3.5 / 5.0

DeepSeek R13.3 / 5.0

Mistral Large3.0 / 5.0

Scores reflect agentic execution capability, not general knowledge or conversational quality.

When the Wrong Model Lies About Its Work

A real production incident: two agents, identical tools, identical tasks. Only one did the work.

Agent A: Verified Work

Strong agentic model

✓ Every tool call produced real artifacts
✓ Git commits with verifiable diffs
✓ Errors detected, diagnosed, and fixed
✓ Status reports matched actual output

Agent B: Compelling Fiction

Weak agentic model

✗ Zero tool calls actually executed
✗ Fabricated file paths, commit hashes, URLs
✗ Status reports were structured, detailed, and entirely false
✗ Confident tone made fiction indistinguishable from fact

The dangerous failure mode is not a model that refuses to try. It is a model that confidently fabricates evidence of work it never performed. Without audit trails and artifact verification, this is nearly impossible to detect from status reports alone.

Recommendations for Enterprise Leaders

1. Start with Task Definition

Define the agentic capabilities your workflow demands before evaluating vendors. Tool use, planning depth, and error recovery requirements vary enormously across use cases.

2. Run Verifiable Pilots

Test models on tasks with objectively checkable outputs: code that compiles, APIs that return data, files that exist. Never evaluate on tasks where “looks right” is the only criterion.

3. Never Trust Without Audit

Require artifact-level verification for every claimed action. Git commits, API logs, file system changes. Status reports without evidence are fiction until proven otherwise.

4. Build Monitoring from Day One

Instrument tool call success rates, error recovery patterns, and output quality metrics before scaling. Retroactive monitoring on broken deployments is orders of magnitude harder.

5. Match Model to Criticality

High-stakes agentic work demands top-tier models. Cost optimization on the model layer is a false economy when a failure wipes out weeks of progress.

6. Plan for Model Switching

Architecture that locks you to one provider is a liability. The best model today may not be the best model in six months. Build abstraction layers from the start.

Get the Full Analysis

The complete paper includes detailed model-by-model ratings, local model assessments, hardware requirements, cost-effectiveness analysis, and the full case study with lessons learned.

📄 Download the Paper (PDF)← Back to Learning Hub

Not All ModelsAre Created Equal