Data Machines Research

From GPT-3 to Agentic AI
Six Years That Changed Everything

How large language models evolved from research curiosities into autonomous agents capable of real work, and why understanding that journey matters for every enterprise decision-maker.

📄 Download PDF
Scroll to explore

Seven Eras in Six Years

Each inflection point introduced capabilities, or exposed limitations, that shaped today's landscape.

GPT-3: 175B params, few-shot learning
Jun 2020
2021-22
Chinchilla scaling laws, InstructGPT, RLHF
ChatGPT: 100M users in 60 days
Nov 2022
Feb 2023
LLaMA: open-source revolution begins
GPT-4: reasoning leap, 90th% bar exam
Mar 2023
Dec 2023
Mixtral 8x7B: MoE efficiency breakthrough
Claude 3: 200K context, tiered pricing
Mar 2024
Mid 2024
Function calling becomes standard
OpenAI o1: inference-time reasoning
Sep 2024
Jan 2025
DeepSeek-R1: open-weight reasoning
Opus 4, o3, Grok 4: purpose-built agents
2025
2026
Agentic AI enters enterprise production

Milestone Moments

Five releases that redefined what AI systems could do.

Jun 2020
GPT-3
175 billion parameters proved that scale alone could unlock generality. Few-shot prompting replaced fine-tuning for many tasks, establishing the API-first paradigm.
Nov 2022
ChatGPT
RLHF made AI conversational and accessible. 100 million users adopted it in two months, forcing every enterprise board to ask about AI strategy.
Mar 2023
GPT-4 + LLaMA
GPT-4 showed reasoning could jump between generations. LLaMA proved capable models could be open-source. Together they split the field into two futures.
Dec 2023
Mixtral 8x7B
Mixture-of-experts: 46.7B total params, only 12.9B active per token. GPT-3.5-class performance on a single consumer GPU. Efficiency won.
Sep 2024
OpenAI o1
Chain-of-thought reasoning at inference time. Models could now 'think harder' about difficult problems by spending more compute, trading speed for depth.
2025
Agentic Era
Opus 4, o3, and Grok 4 were trained with autonomous tool use as a first-class objective. Models stopped just talking about work and started doing it.

The Scale Explosion

Parameter counts grew by orders of magnitude, then efficiency gains changed the game. Active parameters now matter more than total size.

Logarithmic scaleparameters (billions)
GPT-2 (2019)1.5B
GPT-3 (2020)175B
PaLM (2022)540B
GPT-4 (2023, est.)~1.7T
MoE era: total vs. active parameters diverge
Mixtral 8x7B (active)12.9B
DeepSeek-V3 (active)37B
LLaMA 4 Maverick (active)17B
LLaMA 4 Maverick (total)400B
1,000x
scale increase, GPT-2 to GPT-4
23x
MoE efficiency ratio (Maverick)
1 GPU
can run GPT-3.5-class models

The Great Price Collapse

Per-token costs dropped by two orders of magnitude while capabilities soared. Frontier AI went from a luxury to a commodity in under four years.

USD per million output tokenslower is better
$0$15$30$45$60GPT-32021$60GPT-3.5T2022$2GPT-42023$60GPT-4T2023$30GPT-4o2024$10o32025$8DS-V32025$0.28
214x cheaper
GPT-3 Davinci (2021) vs. DeepSeek-V3 (2025) at comparable capability
7.5x cheaper
o1 to o3 for frontier reasoning: same capability class, dramatic price cut

Open vs. Closed: The Strategic Split

What started as an insurmountable gap has narrowed to a single generation. Enterprise strategy now hinges on this dynamic.

🔒

Closed / API

Frontier capability for complex agentic work
Zero infrastructure management
Rapid iteration as providers improve models
Data leaves your network
Vendor lock-in and pricing risk
No fine-tuning on proprietary data
🔓

Open Weights / Local

Full data sovereignty, nothing leaves your servers
Zero marginal cost after hardware investment
Fine-tuning and customization freedom
~1 generation behind frontier on agentic tasks
GPU hardware procurement and maintenance
Weaker honesty calibration under pressure
The Hybrid Pattern

Leading enterprises use open-weight models for high-volume routine tasks and closed APIs for complex, mission-critical work. Model-agnostic frameworks make this practical.

The Breakthroughs That Made It Possible

Six technical innovations drove the evolution from text generators to autonomous agents.

Transformers
Parallel attention replaced sequential processing, enabling models to scale from millions to trillions of parameters.
🎯
RLHF
Reinforcement learning from human feedback turned capable-but-chaotic text generators into helpful, usable tools.
🧩
Mixture of Experts
Activating only a fraction of parameters per token decoupled capability from compute cost.
📚
Long Context
Context windows expanded from 2K to 1M+ tokens, enabling agents to hold entire codebases in working memory.
🧠
Chain-of-Thought
Inference-time reasoning let models "think harder" about difficult problems by spending more compute at runtime.
🔧
Function Calling
Structured tool invocation replaced fragile text parsing, making the model-to-action interface clean and reliable.

The Models Are Ready.
Is Your Organization?

Costs are falling. Capabilities are rising. Open-weight models are catching up. Purpose-built agentic systems are entering production.

Organizations that build adaptable infrastructure, develop evaluation competency, and verify rather than trust will capture each successive wave of improvement.

Start Your AI Journey →