A practitioner's guide to evaluating LLM agentic performance. Which models can actually do autonomous work, and which ones just talk about it?
📄 Download PDFStandard benchmarks test knowledge. Agentic work demands execution.
Five capabilities separate models that can act from models that can only respond.
Six dimensions scored 1-5, informed by SWE-bench, BFCL, tau-bench, and production deployment experience.
Scores combine benchmark evidence (SWE-bench resolution rates, Berkeley Function Calling Leaderboard accuracy, Chatbot Arena ELO) with structured practitioner assessment from production deployments. A model scoring 5/5 on tool use, for example, reliably generates correct function schemas, handles edge cases in API responses, and composes multi-tool workflows across extended sessions.
This framework deliberately weights execution reliability over raw knowledge. A model that scores 95% on MMLU but fails 20% of its tool calls in practice receives a lower composite rating than one that scores 88% on MMLU but executes tools flawlessly.
Composite agentic scores (out of 5.0) based on the six-dimension framework.
A real production incident: two agents, identical tools, identical tasks. Only one did the work.
The dangerous failure mode is not a model that refuses to try. It is a model that confidently fabricates evidence of work it never performed. Without audit trails and artifact verification, this is nearly impossible to detect from status reports alone.
The complete paper includes detailed model-by-model ratings, local model assessments, hardware requirements, cost-effectiveness analysis, and the full case study with lessons learned.