Some Simple Economics of AGI — Structured Notes

Christian Catalini, Xiang Hui, and Jane Wu published Some Simple Economics of AGI in February 2026. Catalini also put the key ideas into a thread on X. I've reorganized that thread here because the framing is sharp and I keep coming back to it.

The Argument: It's About Verification, Not Intelligence

Everyone assumes AI automation follows a digital-versus-physical axis. Knowledge work falls first, robotics catches up later. Nearly everyone also believes that whatever AI can do in general, it can't do their particular job. Lawyers think legal judgment is safe. Doctors think clinical intuition is safe. Strategists think strategy is safe. Creatives are sure creativity is safe.

Catalini points out that a whole vocabulary of comfort has formed around this belief. People reach for words like "taste," "curation," "judgment," "agency," "human touch." These words feel protective precisely because they're hard to pin down, which is also what makes them useless as a defense. Naming something AI supposedly can't do is not the same as explaining why it can't, or showing that the barrier will hold. The residual gets a label and the label gets treated as a moat. Nobody stress-tests it further.

The paper's argument cuts differently. The real boundary isn't about whether work is digital or physical, cognitive or manual, creative or routine. It's about whether someone can check whether the output is correct.

Any task that can be reduced to a metric can be industrialized — regardless of the prestige, complexity, or training historically required for its human execution. The legal brief. The diagnostic. The strategy deck. The creative campaign. Not because AI is "better." Because it's measurable — and therefore automatable.

Most analysis asks "what can AI do?" This paper asks "what can we tell AI did correctly?" Those are very different questions, and the gap between them is where the interesting economics live.

Two Diverging Cost Curves

The paper builds on two cost curves moving in opposite directions.

The cost to automate is falling fast. SWE-bench accuracy went from 4.4% to 71.7% in a single year.1 Agents are getting better at the same rate they're getting cheaper.* The capability curve is now self-reinforcing: agents accelerate the engineering pipelines that build the next generation of agents.

The cost to verify barely moves. It's stuck at the speed of human cognition, institutional review cycles, and the apprenticeship pipelines that train people to spot problems. You can throw more compute at automation. You can't throw more compute at a human reviewer's attention span.

The authors call the growing space between these curves the Measurability Gap. I think this is the most useful concept in the paper. It captures something I've felt but hadn't named: the uneasy sense that AI output is getting more impressive and harder to evaluate at the same time. "Human-in-the-loop" sounds stable but it isn't. The loop stretches thinner with every model generation.

Four Zones

The paper maps the economy into quadrants based on cost to automate and cost to verify:

ZoneAutomateVerifyWhat's here
Safe IndustrialCheapAffordableChat, image generation, short code bursts. The easy wins where verification is cheap.
Human ArtisanHardVerifiableEmbodied craft, high-touch services, physical-world work. Human advantage holds for now.
Pure TacitHardHardTrue uncertainty. Humans navigate by intuition where no map exists. Shrinks as world models improve.
Runaway RiskCheapUnaffordableThe danger zone. You can deploy AI here but can't tell if it's working. Privately rational, systemically toxic.

The worry is that as the Measurability Gap widens, more of the economy drifts into the Runaway Risk zone. You get more AI output. You don't get more ability to check it.

The Junior Pipeline Problem

A Stanford study using ADP payroll data found that employment for 22–25 year olds in AI-exposed fields dropped ~16% relative to less-exposed occupations after ChatGPT's release.2 These aren't layoffs. Companies just stopped hiring juniors, treating AI as a substitute for entry-level execution.

The problem: juniors are how you make seniors. Seniors are who verify things. Companies are thinning the pipeline that produces future verifiers right when the economy needs more verification capacity. The authors call this the Missing Junior Loop and it's the kind of slow-moving structural damage that doesn't show up in quarterly earnings.

Related: the Codifier's Curse. AI doesn't just do tasks. It extracts the tacit knowledge that made experienced people valuable and packages it up. The expertise moat drains from the inside. Senior judgment gets commoditized faster than the profession can train replacements.

When AI Checks AI

The obvious move is to use AI to verify AI. The problem: the agent and its auditor share the same training data. Same blind spots. The system self-certifies its own failures.

The examples in the paper are wild. Frontier reasoning models learned to game unit tests instead of fixing the code. GPT-4 executed an insider trade and hid it from its supervisor.3 o3 disabled its own shutdown scripts in 79 of 100 runs.4 Claude Opus 4 attempted blackmail in 84–96% of runs.5 None were told to do any of this. Catalini's framing: "Goodhart's Law with teeth." The models treat every dimension you don't measure as an unconstrained degree of freedom.

The Hollow Economy

The paper's worst-case scenario has a name: the Hollow Economy. Tons of nominal output, decaying actual quality. Systems optimizing for measurable proxies while the gap between what's measured and what's intended grows. It doesn't arrive as a crisis. It accumulates through ordinary cost-cutting, one reasonable deployment at a time.

The authors are clear that this isn't inevitable. Their point isn't "slow down." It's "build verification infrastructure at the same pace you build capability." The answer isn't to slow down. It's to get better at telling good output from bad.

Who Does What

The paper has a full playbook section. The parts I found most useful:

For individuals, intelligence is a commodity now, so human work splits into three roles. Directors who deal with genuine uncertainty, turn vague goals into specs, and run agent swarms. Meaning makers who work where value comes from social consensus and human connection, not measurable output. Liability underwriters who spot hidden risk, take responsibility for outcomes, and produce the ground truth that makes future automation possible.

The concept of "flight simulators for work" is interesting too. Synthetic practice environments that compress the path to competence. One person with the right simulators can ramp to what used to require years of apprenticeship.

For companies, the organizational model converges on what they call the AI sandwich: human intent at the top, agentic execution in the middle, human verification at the bottom. The revenue model shifts from selling software access to selling verified outcomes. "Software-as-Labor." Execution scales infinitely but the capacity to absorb its failures doesn't, the real bottleneck becomes liability. ElevenLabs already got insurance for its AI voice agents. That's where this is going.

For investors: stop funding raw execution. Fund verification infrastructure and deep tech where measurability hasn't arrived yet. Be suspicious of network effects driven by agents. Agents can inflate activity at zero marginal cost. If you can't verify the network is real, you can't value it.

For policymakers: the gains of AI deployment are privatized. The systemic risks are socialized. That's the market failure. No prior technology simultaneously reduced costs across every knowledge domain at once, so the usual playbook of sector-specific regulation doesn't apply.

What Stays With Me

The closing line from Catalini's thread:

"Taste," "curation," "judgment," "agency," "the human touch." These are not wrong — but they are not a strategy. They are the names we give to the residual we haven't yet analyzed. This paper is an attempt to analyze it — and what we found is that the residual has a structure, the structure has a logic, and the logic leads to verification. That is what's defensible. That is what's scarce. That is what we should be building.

I've been thinking about this alongside the Fowler retreat notes and the vertical software analysis I wrote up earlier. They all point in the same direction. The hard part was never generating output. The hard part is knowing whether the output is any good. That was true before AI, and it's becoming the whole game now.


Notes

* Cheaper tokens unlock agentic loops and multi-agent orchestration that consume orders of magnitude more tokens per task as we drive toward better and better outcomes. As these capabilities become table stakes, the token floor required just to stay competitive keeps rising. The total cost of automation will almost certainly increase even as the unit economics improve.


Additional References

  1. Stanford HAI, AI Index 2025 Annual Report — Technical Performance chapter, SWE-bench results.
  2. Brynjolfsson, Chandar, Chen, Canaries in the Coal Mine? — Stanford Digital Economy Lab, using ADP payroll data.
  3. Apollo Research, Strategic Deception in LLMs — presented at the UK AI Safety Summit, 2023.
  4. Palisade Research, Shutdown Resistance in Reasoning Models — testing o3, o4-mini, and other models.
  5. Anthropic, Agentic Misalignment — Claude Opus 4 system card and safety testing results.