The AI Index 2026 · A Synthesis Stanford HAI · Ninth Edition
The state of artificial intelligence · 2025 → 2026

The Jagged Year

Stanford AI Index 2026 — a field scaling faster than the instruments built to measure it, the institutions built to govern it, and the workers built to absorb it

AI got radically more powerful and radically harder to see clearly in the same year — capability is converging, disclosure is collapsing, and the rulers we use to measure both are breaking in our hands.

425 pages9 chapters 13 themes synthesizedData through Mar 2026
2.7%
U.S. lead over China's top model by March 2026 — down from a 0.4% gap in Feb 2025
Ch. 2 Technical Performance
50.6% vs 90.1%
Top model vs human accuracy reading analog clocks — even as models win IMO gold
Ch. 2 Technical Performance
72,816 tons
Estimated CO2e from training Grok 4 — over 1,000x an average car's lifetime emissions
Ch. 1 Research & Development
1 foundry
TSMC fabricates nearly every leading AI chip — a single geopolitical point of failure
Ch. 1 Research & Development
362
Documented AI incidents in 2025, up from 233 — as transparency index fell 58 to 40
Ch. 3 Responsible AI
~20%
Drop in U.S. software-developer employment ages 22-25 from its 2022 peak
Ch. 4 Economy
23.1x
U.S. private AI investment ($285.9B) relative to China's ($12.4B) in 2025
Ch. 4 Economy
50-point gap
AI experts (73%) vs U.S. public (23%) expecting AI to improve how people do their jobs
Ch. 9 Public Opinion
The big picture

After arrival, the reckoning

Capability did not plateau in 2025 — it accelerated and converged at the same time. Industry produced over 90% of the year's notable frontier models, coding agents went from roughly 60% to near-100% of the human baseline on SWE-bench Verified in twelve months, and the top four chatbot-arena models now sit within 25 Elo points of one another. The U.S.-China frontier gap, a 5-point sliver in February 2025, stood at just 2.7% by March 2026; the closed-vs-open gap reopened only modestly to ~3.3%. When everyone's best model is within single digits of everyone else's, raw capability stops being the prize — the contest moves to cost, latency, reliability, and the unglamorous question of whether the thing actually works.

But the same systems that win International Mathematical Olympiad gold read an analog clock correctly only about half the time and still fail roughly one in three structured agentic tasks. This is the report's defining motif — 'jagged intelligence' — and it runs through every chapter: robots succeed at 89% of manipulation tasks in simulation but only ~12% of real household tasks; AI agents can run an end-to-end weather pipeline yet score below 20% on replicating a published astrophysics paper; clinical models out-diagnose physicians on exam-style cases while still issuing 12-15 severely harmful recommendations per 100 open-ended cases. Brilliance and unreliability now live inside the same model, and the report's central warning is that we keep mistaking one for the other.

The inputs powering all of this are scaling faster than anything else measured — and faster than efficiency, transparency, or governance can keep up. Global AI compute grew 3.3x per year to 17.1 million H100-equivalents; AI data-center power hit 29.6 GW, roughly New York State at peak demand; training Grok 4 emitted an estimated 72,816 tons of CO2e, and GPT-4o inference alone may consume the annual drinking water of 1.2 million people. Nearly every leading AI chip is fabricated by one company, TSMC, in one place, Taiwan — a single point of failure under the most strategically important industry on earth. Money concentrated even harder: U.S. private AI investment ($285.9B) ran 23x China's, while consumer value from largely-free tools ($172B/year, median value per user tripling) dwarfs producer revenue — value diffusing widely even as capital, robots, and compute entrench in a handful of countries and mega-deals.

The human and institutional systems are visibly straining against this pace. AI's first clear labor signal is generational — employment for U.S. software developers aged 22-25 fell ~20% from its 2022 peak while older cohorts kept growing — hitting the entry-level rung that builds tomorrow's senior talent. Documented AI incidents jumped to 362 from 233, even as foundation-model transparency reversed (the index average fell from 58 to 40) and frontier labs almost never report safety benchmarks the way they report capability. Four in five students use generative AI while only 6% of schools have clear policies; U.S. states passed 150 AI bills as Washington pivoted to deregulation and moved to sue those states. And experts and the public hold incompatible mental models — a 50-point gap on whether AI helps how people do their jobs (73% vs 23%) — with the public's appetite for guardrails real, bipartisan, and aimed at governments it trusts least (U.S. at 31%, lowest surveyed).

The honest summary is not 'AI is plateauing' or 'AI is unstoppable' — it's that AI has outrun its own measurement. The benchmarks are saturating in months, the labs are disclosing less, independent tests diverge from self-reports, and the report's two new chapters on science and medicine show the same split everywhere: AI can now run the whole pipeline but cannot yet be trusted to be right. The gap between what AI can do and how prepared we are to manage it is, in the co-chairs' words, the through-line of the entire report — and in 2026 that gap widened on both ends at once.

What changed this year

Eight shifts that define the 2026 edition

  1. 01The U.S.-China model-performance gap effectively closed: from 5 Arena points (0.4%) in Feb 2025 to 2.7% by March 2026, trading the lead repeatedly through the year — DeepSeek-R1 even briefly matched the top U.S. model.
  2. 02Two brand-new chapters — Science and Medicine — formalize AI's move from assisting steps to replacing whole workflows (end-to-end weather forecasting, generative genomics, an AI-authored paper accepted at an ICLR workshop), while exposing that frontier agents still score below half of expert performance on real research.
  3. 03The evaluation crisis went mainstream: benchmarks built to last years are saturating in months (HLE jumped 30 points in one year), a Stanford review found up to 42% invalid questions on GSM8K, and independent testing now diverges from developer-reported scores.
  4. 04Transparency reversed direction — the Foundation Model Transparency Index average fell from 58 to 40, over 90% of notable models shipped without training code, and frontier labs almost never report responsible-AI benchmarks the way they report capability.
  5. 05AI's environmental and infrastructure footprint became a headline metric: 29.6 GW of data-center power (≈New York State), 72,816 tons CO2e to train Grok 4, and inference water use that may exceed the drinking needs of 1.2 million people — with TSMC named as a single hardware choke point.
  6. 06The first clear labor signal arrived and it is generational: ~20% employment decline for U.S. software developers aged 22-25 while older cohorts grew — 'seniority-biased technological change,' not broad displacement.
  7. 07Sovereignty and bifurcation hardened: U.S. states passed 150 AI bills as the White House rescinded EO 14110 and created a DOJ task force to challenge state laws, while compute, models, and talent stayed concentrated in the U.S. and China.
  8. 08Adoption broke records but value decoupled from capital: GenAI hit ~53% population adoption in three years (faster than PC or internet), consumer surplus reached ~$172B/year with median value per user tripling, yet the U.S. ranks only 24th in population-level adoption at 28.3%.
The thirteen themes

The most consequential —
and most actionable — findings

Each theme pairs the headline finding with the second-order effects and hidden ramifications beneath it.

T01
2 - Technical Performance

Capability Has Converged — Compete on Cost, Not Leaderboard Rank

The world's top models are now separated by single-digit percentages, so raw capability is no longer where the AI race is won.

As of March 2026 the top four chatbot-arena models sit within fewer than 25 Elo points (Anthropic 1,503, xAI 1,495, Google 1,494, OpenAI 1,481), down from a ~97-point spread the prior year. The U.S.-China gap has effectively closed — from 5 points (0.4%) in February 2025 to 39 points (2.7%) in March 2026 — and the closed-vs-open gap is just 3.3-3.4%. The report explicitly states pressure now shifts to 'cost, reliability, and real-world usefulness.'

<25 Elo pts
Spread of the top 4 Arena models (was ~97 a year ago)
2.7%
Top U.S. model's lead over the top Chinese model, Mar 2026 (was 0.4% in Feb 2025)
3.3-3.4%
Gap between the top closed and top open model
    What to do
  • Procurement leaders: select models on cost, latency, reliability, and domain-specific performance — not Arena rank — and re-bid annually as the tier reshuffles
  • Executives: pilot open-weight models (gap now ~3.3%) for non-frontier workloads to cut cost and lock-in
  • Policymakers: treat near-parity as the new baseline for U.S.-China competition policy rather than assuming a durable U.S. lead
Frontier convergence — top providers on the Arena (Mar 2026)
Anthropic1503xAI1495Google1494OpenAI1481Alibaba1449DeepSeek1424Mistral AI1416Meta1335Chatbot Arena Elo · scale starts at 1320 · top 4 within 25 pts
The top four are within 25 Elo points; capability is no longer the differentiator.
“The U.S.-China performance gap has effectively closed and now fluctuates within single digits.”AI Index 2026, Ch. 2
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
Capital intensity rises even as capability converges — the race moves to who can subsidize inference and lock in distribution, not who tops the leaderboard.near
When eight providers sit within single-digit percentages on quality (Ch2: top 4 within <25 Elo), no one can win on the model, so each competes by under-pricing inference and bundling AI into owned distribution surfaces. The report's own data shows the spend did NOT converge with the capability: global corporate AI investment hit $581.69B (+129.9%), generative-AI private funding $170.9B (+200%+), and hyperscaler capex more than doubled (Google $150B+ capex in 2025, Ch4). Parity at the top forces a margin war that only balance-sheet giants can sustain.
A wave of frontier-lab consolidation and shake-out, because near-parity destroys the willingness-to-pay premium that funded second-tier labs.mid
Buyers can now substitute the #2-#6 model for the #1 at near-zero quality loss (US-China gap 2.7%, closed-open 3.3-3.4%). That collapses pricing power for everyone below the demand-aggregating leaders, while M&A is already +132.6% YoY and 28 rounds topped $1B (Ch4). Labs whose only moat was a transient Elo lead lose their fundraising narrative and get acquired or commoditized.
The strategic chokepoint migrates entirely off the model layer and onto compute and energy — exactly the layers the report shows are most concentrated and most fragile.mid
If models are interchangeable, the only durable advantage is cost-per-token, which is set by chips, power, and data-center siting. Compute is scaling 3.3x/yr to 17.1M H100e with Nvidia >60% and TSMC as a single foundry (Ch1/T03), and inference — not training — is the dominant resource cost (29.6 GW, T05). 'Compete on cost' therefore routes the entire competition through the report's single-point-of-failure hardware stack.
Open-weight adoption accelerates for the commodity middle of the market but stalls at the frontier, bifurcating the stack into a cheap open base layer and a thin closed premium tier.near
The open-closed gap is only 3.3-3.4% (Ch2), so most non-frontier workloads rationally move to open weights to cut cost and lock-in — exactly the report's recommended action. But the gap REOPENED from 0.5% (Aug 2024) and 6 of the top 10 are closed, so the highest-stakes work stays proprietary. The result is a barbell: open commodity inference plus a small, defensible closed frontier.
Procurement re-bidding becomes annual churn, and switching cost / interoperability — not model quality — becomes the real lock-in battleground.mid
The report tells buyers to 're-bid annually as the tier reshuffles.' Once buyers internalize that, vendors can no longer retain customers on quality, so they compete on lock-in mechanics: proprietary tool-calling formats, agent frameworks, fine-tune portability, and data gravity. The fight shifts from benchmark scores to switching costs that the AI Index does not measure.
Benchmark/Arena gaming intensifies precisely because rank no longer reflects quality but still drives perception and procurement.near
When real capability differences shrink below measurement noise, a 20-Elo move is the difference between 'leader' and 'also-ran' in the press, yet Arena standing 'may partly reflect adaptation to the platform rather than general capability' and benchmarks carry up to 42% invalid questions (Ch2). The marginal return to gaming the eval rises as the marginal real-capability gap falls — convergence makes the leaderboard more, not less, manipulable.
Hidden ramifications / who pays
Convergence is being declared using a measurement instrument that is simultaneously degrading — the Elo spread narrowed at the same moment frontier labs stopped disclosing and Arena scores became gameable, so 'parity' may partly be measurement collapse, not real capability convergence.
Who bears it: Procurement leaders, policymakers, and independent auditors who treat the <25-Elo spread as ground truth
Why hidden: The convergence headline (Ch2) and the transparency-collapse finding (FMTI 58→40, 81 of 102 models without training code, Ch1/Ch3) live in different chapters and are never cross-multiplied. A narrowing gap measured by a benchmark with up to 42% invalid questions and platform-adaptation effects could reflect the ceiling of the test rather than the ceiling of the models.
'Compete on cost' silently transfers strategic power to whoever controls the cheapest electrons and chips — meaning the convergence story makes energy utilities, TSMC, and Nvidia the real winners of the model race, not any model lab.
Who bears it: Grid operators, power-constrained regions, and nations without sovereign compute (most of the world: China 85 supercomputers, MENA 3, South Asia 2 per Ch8)
Why hidden: T01 frames the shift as good news for buyers ('stop overpaying'), burying that the same logic concentrates leverage upstream into the single-foundry, single-vendor, gigawatt-scale layer the report flags elsewhere as the system's most fragile.
China's apparent 'parity' (2.7% gap) is achieved at a 23x private-capital disadvantage — which, read forward, means China is far more capital-efficient per frontier-point and the US capex lead is buying a vanishing quality premium.
Who bears it: U.S. policymakers and investors betting on a durable compute-and-capital moat
Why hidden: The convergence theme (Ch2) and the 23.1x investment gap (Ch4) are presented as separate facts. Combined, they invert the comfort: if China reaches 2.7% on ~4% of US private spend (plus undisclosed state guidance funds est. $184B), the marginal US dollar is buying almost no defensible capability lead, and export controls slow a race that is already nearly tied.
Annual re-bidding plus interchangeable models pushes the entire industry toward zero-margin inference, which structurally favors firms that monetize elsewhere (ads, cloud, devices) over pure-play model labs — quietly selecting for surveillance- and ad-funded AI business models.
Who bears it: End users (privacy/attention costs), pure-play model startups, and regulators worried about ad-driven incentives
Why hidden: The report's buyer-side framing ('select on cost') never asks how a commodity-priced model gets funded. A margin war on the model layer is won by whoever has a non-model revenue stream to cross-subsidize it, which selects for exactly the business models least aligned with user interest.
Convergence at the frontier coexists with a 22-94% hallucination spread and jagged reliability — so 'models are interchangeable' is true for headline quality but false for failure modes, and cost-driven substitution can swap in a model that is equivalent on Arena but quadruples fabrication risk.
Who bears it: Enterprises in high-stakes domains (medicine, law, finance) who substitute on price under T01's advice
Why hidden: T01 says capability is 'no longer the differentiator,' but T02/Ch3 show reliability and hallucination vary 4x across models that look identical on Arena. The convergence metric averages over exactly the tail behaviors that matter most in deployment, so 'equivalent' models are not equivalent where it counts.
Cascades / causal chains
Capability converges (<25 Elo spread) quality premium evaporates competition shifts to cost-per-token cost is set by chips + power strategic leverage concentrates into TSMC/Nvidia/energy (the report's own single points of failure) a model-layer 'good news' story becomes a hardware-and-grid bottleneck story.
Top models become interchangeable buyers re-bid annually and substitute on price pure-play labs lose pricing power only firms with non-model revenue (cloud, ads, devices) can sustain subsidized inference consolidation + ad/surveillance-funded business models are structurally selected for.
Real capability gaps shrink below measurement noise a 20-Elo move still decides 'leader' in perception and procurement marginal return to gaming the eval rises while marginal real gap falls Arena/benchmark gaming intensifies the convergence signal itself becomes less trustworthy.
Open-closed gap is only 3.3% commodity workloads migrate to open weights but the gap reopened from 0.5% and 6 of top 10 stay closed the stack barbells into a cheap open base layer + a thin defensible closed frontier value concentrates at the two ends and hollows out the middle tier of labs.
China reaches 2.7% of US frontier on ~4% of US private capital US capex/export-control moat buys a vanishing quality lead 'maintain US lead' policy is fighting an already-tied race the geopolitical premise of compute-denial strategy is undermined by the convergence the same report documents.
What to watch / leading indicators
Inference price-per-million-tokens trajectory vs. gross margins at the leading API providers: if convergence is real, expect a sustained price war and compressing margins on the model layer, with value visibly migrating to whoever has non-model revenue to subsidize it. Watch for sub-cost 'loss-leader' pricing tiers.
Open-weight share of production enterprise workloads (e.g., Hugging Face download mix, enterprise deployment surveys): T01 predicts open weights capture the commodity middle. Confirmation = rising open-weight share for non-frontier tasks while the closed-open Arena gap stays near 3.3%. Refutation = the gap widening back toward 2024-style spreads, which would restore a real frontier premium.
Divergence between Arena rank and independent, contamination-controlled third-party reliability/hallucination evals: if 'convergence' is partly measurement collapse, expect Arena-equivalent models to show widening spreads on held-out reliability benchmarks (hallucination, agentic success, domain tasks). A persistent gap here would refute the 'capability is fungible' read.
The next US-China Arena gap reading and the cost-per-frontier-point ratio: watch whether China closes or holds the 2.7% gap while spending a fraction of US private capital. Continued parity at a 20x+ capital disadvantage signals the US compute/capex moat is buying a vanishing lead; a reopening gap would validate the moat.
Tensions & contradictions
T01 says capability is 'no longer the differentiator,' but T02 (Jagged Intelligence) shows models that are within single Elo points still diverge wildly on reliability — 50.6% vs 90.1% clock-reading, 22-94% hallucination range, ~1-in-3 agentic failure. Convergence on headline quality is NOT convergence on the failure modes that govern deployment, so 'select on cost' can substitute a benchmark-equivalent but operationally inferior model.
T01 tells buyers to stop overpaying because models are interchangeable, while T03 (single-foundry chokepoint) and T05 (inference resource bomb) show that 'competing on cost' routes the whole industry through the most concentrated and fragile layer of the stack. The cost the buyer saves on the model is paid back as systemic supply-chain and energy fragility.
T01 declares near-parity (US-China 2.7%, closed-open 3.3%), but T08 (Diffusion Gap) shows capital, robots, and compute are MORE concentrated than ever (US 23.1x China in private AI investment; China 54% of robot installs; supercomputing concentrated in two countries). Capability converged while everything that produces and deploys it diverged.
T01 is built on Arena Elo as the convergence metric, but T04 (Accountability Vacuum) and Ch2's own caveats establish that the frontier labs are the LEAST transparent and that Arena standing 'may partly reflect adaptation to the platform.' The theme's central evidence is drawn from the instrument the report elsewhere says we can least trust.
T01's recommended move to open-weight models for non-frontier work collides with the report's finding that the open-closed gap REOPENED (0.5% -> 3.3-3.4%) and 6 of the top 10 are closed — open weights are competitive for the commodity middle but explicitly not at the frontier, so the 'adopt open' advice is bounded in a way the headline understates.
⟂ Contrarian read
"The convergence is not the story — it is an artifact of benchmark saturation and a shared training-data substrate, and it will be temporary. The defensible non-consensus read: top models cluster within 25 Elo not because capability genuinely converged but because (a) the public benchmarks have saturated and become gameable (HLE jumped 30 points in a year; Arena 'may reflect platform adaptation'; up to 42% invalid questions), and (b) every frontier lab now trains on overlapping web data that is >50% AI-generated (Ch1), homogenizing outputs toward a common mean. On this reading, 'compete on cost' is premature advice: the moment a lab finds a genuinely new capability axis (long-horizon agentic reliability, world-models from video per Veo 3, or contamination-free reasoning), the spread will violently re-open — and buyers who commoditized their model layer and re-bid annually will have built switching infrastructure for a fungibility that proves illusory. The real differentiator didn't disappear; the current generation of tests just stopped being able to see it."
T02
2 - Technical Performance6 - Medicine

Jagged Intelligence — Never Confuse Brilliance with Reliability

Models win IMO gold yet read analog clocks barely half the time and still fail roughly one in three agentic tasks.

Gemini Deep Think earned IMO gold (35 points) in 2025, yet the top model reads analog clocks correctly only ~50.6% of the time versus 90.1% for humans. AI agents leapt from ~12% to 66.3% on OSWorld but still fail roughly one in three structured tasks, and score sub-70% on tax (77.1% top), mortgage (69.4% top), and corporate-finance work. In medicine, general-purpose LLMs still make 11.8-14.6 severely harmful recommendations per 100 cases (76.6% errors of omission).

50.6% vs 90.1%
Analog clock reading: top model vs human (despite IMO gold)
66.3%
Top agent OSWorld accuracy — still ~1 in 3 tasks fail
11.8-14.6
Severely harmful clinical recommendations per 100 cases by leading LLMs
42%
Invalid-question rate found on GSM8K in a Stanford benchmark review
    What to do
  • Clinicians, lawyers, finance leaders: use AI as a supervised aid, never an autonomous decision-maker, in high-stakes work
  • Product teams: architect explicit human review, abstention/uncertainty signaling, and rollback into every agentic workflow
  • Procurement leaders: require independent, contamination-controlled, third-party evaluation rather than trusting self-reported leaderboard scores
  • Educators: teach 'jagged intelligence' literacy — where models are superhuman and where they silently fail
Jagged intelligence — agents close on humans but still fall short
AI agentHuman baselineOSWorld66.3%72.35%WebArena74.3%78.2%GAIA74.5%92%
Agents are within single digits of humans on several tasks yet still fail roughly one in three attempts.
“AI models can win a gold medal at the International Mathematical Olympiad but still can't reliably tell time, illustrating what researchers call jagged intelligence.”AI Index 2026, Ch. 2
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
A liability and insurance gap opens precisely where jaggedness is most dangerous, because the failure mode is silent and uncorrelated with headline competence.mid
Jagged failures are not random noise that averages out — they are systematic, task-specific cliffs (50.6% clock-reading, 76.6% of clinical harms being errors of omission). Omission errors are invisible at the point of use: a model that confidently does nothing leaves no error trace to catch. Underwriters and tort systems price risk off observable, frequency-based failure; a system that is superhuman on the demo and catastrophic on an unannounced edge case has no actuarial base rate. So malpractice/E&O insurers either decline to cover autonomous deployments or price them off worst-case, which structurally pushes liability back onto the human-in-the-loop the report prescribes.
The report's own prescription — 'keep a human in the loop' — degrades into automation-complacency and becomes a liability-laundering layer rather than a safety control.near
Capability convergence (T01) plus 66.3% agent success means the human reviewer sees the model succeed ~2 of every 3 times. High base-rate success is exactly the condition under which human vigilance decays (the classic monitoring-paradox). The human is retained not because they reliably catch the 1-in-3 failure but because they are the legally accountable party. The control the report treats as a fix is, mechanistically, the weakest link precisely when the model is good enough to lull.
Procurement and deployment quietly bifurcate into 'constrained narrow tools that ship' versus 'open-ended agents that stall,' inverting the marketing narrative.mid
Ch. 6 is the natural experiment: constrained, clinician-overseen tools (sepsis prediction 18.7% mortality cut, ambient scribes 83% less note time) produced hard outcome gains, while open-ended LLM diagnosis produced 11.8–14.6 severe harms/100. Buyers who internalize jaggedness will route capital toward narrow, bounded-input tools and away from the autonomous-agent products the capex is funding (Ch. 4: agent deployment still single-digit %). The frontier labs sell breadth; the value accrues to narrowness.
Benchmark saturation plus jaggedness destroys the buyer's ability to predict deployment reliability from any purchasable signal, raising the real cost of adoption even as model prices fall.near
Ch. 2: benchmarks saturate in months, up to 42% of GSM8K questions are invalid, leaderboards 'may reflect adaptation to the platform,' and the Index 'assumes company-reported results are accurate.' Jaggedness means a high aggregate score is uninformative about the specific task a buyer cares about. The only reliable signal becomes bespoke, contamination-controlled, in-domain evaluation — which the buyer must build themselves. So the true cost of safe adoption shifts from license fees to private evaluation infrastructure, a cost invisible in the falling per-token price.
Jaggedness becomes a self-obscuring property as social-impact and safety reporting decline, so the field loses the ability to map the cliffs at the same rate it builds capability.mid
Ch. 2/3: developer reporting of bias and environmental impact is 'sparse and declining'; only one model reports >2 RAI benchmarks; FMTI fell 58→40. Jaggedness can only be characterized by adversarial, off-distribution probing (ClockBench, NOHARM, KaBLE) — exactly the evaluations labs do not run or publish. As disclosure falls, the map of where each model silently fails is built by a thin layer of academic benchmark authors, not the deployers, so cliffs are discovered in production.
The entry-level talent collapse removes the human experts whose tacit judgment is the only backstop for jagged failures, hollowing out the human-in-the-loop over time.long
Ch. 4: software employment for ages 22-25 fell ~20%; gains concentrate in 'structured, monitorable' work — the same work agents do at 66.3%. Junior roles are where humans accumulate the pattern-recognition needed to catch a model's silent omission. If the rungs that train senior reviewers are cut because the model is 'good enough' on structured tasks, the supply of humans competent to supervise jagged systems shrinks just as reliance on supervision grows. The fix (human oversight) and a primary labor effect (junior-rung erosion) are on a collision course.
Hidden ramifications / who pays
Patients, not clinicians, are the population most exposed to jagged AI medical reasoning — and they meet it with the least oversight of any party in the system.
Who bears it: Health-information seekers / patients, especially those without a clinician relationship.
Why hidden: T02's framing centers clinicians, lawyers, and finance leaders as the deployers who must keep humans in the loop. But Ch. 6 shows AI Overviews appear on 84–92% of health searches with 'minimal oversight,' front-running any clinician. The most jaggedness-exposed surface is the consumer search box, which has no human-in-the-loop, no FDA pathway, and no abstention requirement — precisely the safeguards the report demands for the regulated tools. The unregulated channel is invisible in a framing built around professional deployers.
Errors of omission make jaggedness systematically under-counted, so the report's own harm statistics likely understate the problem.
Who bears it: Anyone relying on incident databases or benchmark harm rates to size the risk; regulators calibrating thresholds.
Why hidden: 76.6% of severe clinical harms were errors of omission — the model failing to recommend something needed. Omissions leave no artifact: a missed diagnosis or an un-flagged contract clause produces no error message and often no traceable causal chain back to the AI. Incident databases (Ch. 3: 362 logged incidents) capture commissions, not omissions. The true jaggedness footprint is structurally invisible to the very measurement systems used to govern it.
Small specialized buyers and the public sector inherit a hidden evaluation tax that frontier labs externalize.
Who bears it: Hospitals, courts, small firms, government agencies, and academic labs that must validate jagged models before high-stakes use.
Why hidden: Because reported scores are 'guilty until validated' and labs don't disclose where their models break, the burden of contamination-controlled, in-domain evaluation falls on each deployer. Well-resourced enterprises can build this; safety-net hospitals, public defenders, and small clinics cannot, so they either over-trust the leaderboard or forgo the tool. Jaggedness therefore widens the access gap (Ch. 6's equity concern) — the under-resourced get the un-validated version.
Regulatory clearance is being used as a jaggedness alibi even though it never tested for jaggedness.
Who bears it: Procurement officers and the patients/clients downstream of 'FDA-cleared' or otherwise 'approved' AI.
Why hidden: Ch. 6: only 2.4% of FDA-authorized AI devices with clinical studies rest on RCT evidence; most clear via 510(k) substantial-equivalence on simulated data. A clearance stamp is read as proof of reliability, but substantial-equivalence explicitly does not probe the off-distribution cliffs that define jaggedness. The credential and the actual safety property are decoupled, and the credential is doing the persuasive work in procurement.
The federal deregulatory turn removes the disclosure lever that is the only scalable way to map where models silently fail.
Who bears it: Independent evaluators, state regulators, and ultimately end-users in deregulated jurisdictions.
Why hidden: T02 frames the fix as architectural (human-in-loop, abstention, third-party eval). But Ch. 8 shows the U.S. federal government rescinding EO 14110 and launching a DOJ task force to challenge the state laws (e.g., SB 53 developer safety disclosure) that would force the jaggedness-relevant disclosures. The architectural fix presumes a disclosure regime that is actively being dismantled, so the 'immediate' fix lacks the policy substrate it needs to scale beyond conscientious individual buyers.
Cascades / causal chains
IMO-gold headline scores buyers and boards anchor on peak demonstrated capability autonomous agents deployed in high-stakes domains on the strength of those scores silent omission failures occur on unannounced edge cases (the 1-in-3, the missed clinical recommendation) harm surfaces only after the fact with no error trace liability lands on the retained human reviewer who could not realistically catch a low-frequency silent failure.
Benchmarks saturate in months and up to 42% of questions are invalid a single high score becomes uninformative about any specific task each deployer must build bespoke contamination-controlled in-domain evaluation well-resourced enterprises validate while under-resourced clinics/courts/agencies cannot the under-resourced deploy un-validated jagged tools or forgo them entirely the access-and-safety gap widens for exactly the populations with least recourse.
Agents hit 66.3% on structured monitorable tasks firms cut entry-level roles in those same structured tasks (devs 22-25 down ~20%) the on-the-job rung that trains expert human judgment erodes fewer humans accumulate the tacit pattern-recognition needed to catch silent model failures the human-in-the-loop backstop that the report prescribes degrades in competence over a 5-10 year horizon just as reliance on it deepens.
Jaggedness reaches consumers via AI Overviews on 84-92% of health searches patients form a first interpretation of symptoms from an un-vetted, un-abstaining model this anchoring precedes and biases any later clinician encounter clinicians spend visit time correcting AI-seeded misconceptions or, worse, patients never present the harm occurs entirely outside the regulated, human-in-the-loop channel the report's safeguards govern.
Disclosure declines (FMTI 58 40, RAI benchmarks unreported, social-impact reporting sparse/declining) the map of where each model silently fails is built only by a thin layer of external academic probes (ClockBench, NOHARM, KaBLE) labs ship faster than independent jaggedness-mapping can keep up cliffs are discovered in production by deployers rather than in evaluation by developers each new model resets the unknown-failure surface and the discovery cycle restarts downstream of harm.
What to watch / leading indicators
Insurance and contract signals: emergence of AI-specific E&O / malpractice exclusions or carve-outs for 'autonomous AI decision' losses, and indemnification clauses in enterprise AI contracts shifting jagged-failure liability onto the buyer. This would confirm the liability-gap cascade ahead of any visible harm spike.
The ratio of narrow-constrained vs open-ended-autonomous deployments that actually ship and renew. If Ch. 6's pattern generalizes — constrained tools scale, open-ended agents stall in pilots — agent deployment stays in single digits across functions even as model capability rises, confirming jaggedness is the binding constraint, not capability.
Whether a standardized, contamination-controlled jaggedness/abstention benchmark (a clock-reading-style adversarial suite) gets adopted in procurement and whether any frontier lab voluntarily reports it. Continued absence (most RAI cells empty, only one model reporting >2 benchmarks) refutes the claim that third-party validation is 'immediate' and confirms it lacks supply-side cooperation.
Outcome of the federal-vs-state legal fight over disclosure (SB 53 and the December 2025 DOJ AI Litigation Task Force). If state safety-disclosure laws are struck or chilled, the disclosure substrate T02's fix depends on contracts, confirming the policy-collision cascade; if they survive and spread, jaggedness becomes checkable at scale.
Tensions & contradictions
Collides with T01 (capability convergence): T01 says compete on cost/reliability rather than leaderboard rank, but T02 shows reliability itself is not a single purchasable dimension — it is jagged and task-specific, so 'pick the more reliable model' is undecidable without per-task bespoke evaluation. Convergence at the top makes the choice between near-identical models turn on exactly the silent failure modes none of them disclose.
Collides with T07 (entry-level erosion): T02's central remedy is human oversight; T07 documents AI removing the junior roles that produce competent overseers. The report prescribes a fix in one theme that another theme shows the technology is quietly dismantling — a structural contradiction the first-order framing of either theme alone never surfaces.
Collides with T04 (accountability vacuum / falling transparency): T02 calls for 'independent, contamination-controlled, third-party evaluation,' but T04 shows transparency falling (FMTI 58->40, 81 of 102 models without training code) and only one model reporting more than two RAI benchmarks. The verification T02 depends on is being defunded of its raw material by the trend T04 documents.
Tension within the report's own evidence on human-AI teaming: Ch. 6 shows a multi-agent system scoring 85.5% vs ~20% for unaided physicians, implying AI should lead — yet Ch. 4's METR finding shows experienced developers were 19% slower with AI, and KaBLE (Ch. 3) shows models adopting a user's false belief as fact. 'Human-in-the-loop' is treated as an unambiguous safeguard, but the evidence shows the human and the model can each degrade the other depending on task structure.
Collides with T08 (policy bifurcation): T02 treats the architectural fix as 'immediate,' but Ch. 8 shows the federal government actively litigating against the state disclosure laws (SB 53) that would make jaggedness checkable at scale, and industry now supplying 37% of congressional witnesses. The fix is framed as available now while the policy substrate to enforce it is being contested in court.
⟂ Contrarian read
"The contrarian read is that 'jagged intelligence' is not a temporary immaturity to be engineered away but the permanent, structural signature of how these systems generalize — and the report's prescribed fix (human-in-the-loop + abstention) is therefore not a transitional patch but a confession that fully autonomous high-stakes AI is categorically the wrong product. The evidence cuts against the implicit 'the cliffs will smooth out as models improve' optimism: the same year a model won IMO gold, the clock-reading gap stayed ~40 points and clinical omission harms persisted, and the parts of medicine that delivered real outcomes (sepsis alerts, scribes) succeeded precisely by NOT being open-ended — they constrained the input space so there were no cliffs to fall off. Read this way, jaggedness is bullish for narrow, bounded, human-anchored tools and bearish for the autonomous-agent thesis driving most of the $345B private capex (Ch. 4) — the capital is funding the product category the reliability data says cannot be safely autonomous. The 'fix' isn't oversight bolted onto autonomy; it's abandoning open-ended autonomy in high-stakes domains as a design goal. I flag this as extrapolation beyond the report: the Index documents the cliffs and the constrained-tool success but does not itself conclude that jaggedness is permanent or that the autonomy thesis is mispriced."
T03
1 - Research & Development4 - Economy

One Foundry, One Failure Point — De-Risk the AI Hardware Supply Chain

A single Taiwanese foundry fabricates nearly every leading AI chip, making the entire global AI hardware stack a single geopolitical point of failure.

TSMC fabricates virtually every leading AI chip (Nvidia Blackwell, AMD MI300X), so the most strategically important industry on earth has one point of failure. The U.S. hosts 5,427 data centers — more than 10x any other country (Germany 529, UK 523, China 449) — and global AI compute is scaling 3.3x per year, reaching 17.1 million H100-equivalents with Nvidia supplying over 60%. Compute is the binding input to AI progress, and it routes through a single fab in Taiwan.

1 foundry
TSMC fabricates nearly every leading AI chip worldwide
5,427
U.S. AI data centers — >10x any other country
3.3x / year
Growth of global AI compute since 2022, to 17.1M H100-equivalents
>60%
Share of AI compute supplied by Nvidia
    What to do
  • Policymakers: accelerate diversified fab capacity and treat the single-foundry dependency as a national-security priority
  • Executives building compute-dependent products: map TSMC exposure and secure multi-vendor, multi-region chip allocation plus inventory buffers
  • National-security planners: qualify second-source foundries and stockpile critical chips as a continuity measure
Global AI compute capacity, 2022-25 (H100-equivalents)
Q1 ’22Q4 ’22Q4 ’23Q4 ’24Q4 ’2517.1MGlobal AI compute · M H100-equivalents · ~340× in under four years
A 340x increase in under four years — all of it routed through one Taiwanese foundry.
“A single company, TSMC, fabricates almost every leading AI chip and makes the global AI hardware supply chain dependent on one foundry in Taiwan.”AI Index 2026, Ch. 1
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
The TSMC bottleneck rationally converts a Taiwan contingency from a regional security event into a global AI-progress kill switch, raising the strategic 'price' of deterrence and making the chokepoint itself a target rather than merely a vulnerability.mid
TSMC fabricates virtually every leading AI chip (Nvidia Blackwell, AMD MI300X) and Nvidia supplies >60% of the 17.1M H100e global compute base scaling 3.3x/yr. Because compute is the binding input to all frontier AI, an adversary who wants to freeze a rival's AI trajectory no longer needs to attack distributed assets — disabling or threatening one fab achieves it. This makes the foundry strategically valuable to coerce, not just to protect, which is the opposite of resilience.
Compute concentration silently re-centralizes an AI field the report elsewhere shows is decentralizing on every other axis (open-source projects 5.6M, research volume shifting to China, capability converging within single digits), making the hardware layer the last and most durable moat.near
T01/T06 show models, weights, and research are commoditizing — top-4 Arena models within 25 Elo, OLMo matching frontier with ~90x fewer params, US share of starred GitHub projects down to 31.7%. When algorithms and weights stop conferring advantage, whoever controls allocation of TSMC-fabricated Nvidia silicon holds the one scarce asset. Advantage migrates from the model layer (now flat) to the fab-and-allocation layer (still a near-monopoly).
Diversification efforts (CHIPS-style second-source fabs, qualifying alternative foundries) will paradoxically deepen the Nvidia/CUDA dependency they are meant to relieve, because fab diversity does not touch the design-and-software monopoly.mid
The report frames the risk as 'one foundry' but the >60% figure is Nvidia (a designer), and the lock-in is the CUDA software stack, not the fab. Building a US or Japanese fab still prints Nvidia/AMD designs running CUDA. Governments will declare the supply chain 'de-risked' once a second fab exists while the actual single point of failure — one design ecosystem and its software — remains untouched, producing a false sense of security.
The single-foundry framing accelerates compute-export controls and 'compute sovereignty' programs that fragment the global compute market into national pools, raising the effective cost of compute for everyone outside the US-China duopoly.mid
Ch.8 shows state-backed AI supercomputing is already wildly uneven (China 85 clusters, North America 41, South Asia 2, MENA 3). Treating TSMC dependence as a national-security priority justifies allocation rules, export licenses, and domestic-preference mandates. Each national carve-out shrinks the fungible global pool, so smaller economies pay a scarcity premium or are rationed out entirely — entrenching the very concentration T08 warns about.
Inventory-buffer and stockpiling advice, if widely followed, manufactures a bullwhip demand shock that worsens allocation inequality and could trigger a glut-then-shortage cycle in AI silicon.near
The report's own actionable advice tells executives and national planners to build inventory buffers and stockpile critical chips. If many large buyers hedge the single-foundry risk simultaneously, they double-order against constrained TSMC capacity, starving smaller buyers first (classic bullwhip). Stockpiled chips also depreciate fast given 3.3x/yr capability turnover, so the hedge destroys capital and can flip to a glut when buffers unwind.
The hardware chokepoint, not safety governance, becomes the de facto AI control point — making compute the most enforceable regulatory lever even as the report's transparency and safety apparatus (T04) collapses.long
T04 shows safety disclosure reversing (FMTI 58→40, 81/102 models without training code) and incidents rising to 362 — voluntary governance is failing. Meanwhile compute is physically traceable, fab-gated, and already export-controlled. Regulators chasing an ungovernable model layer will default to the one layer they can meter: the chips. Compute thresholds become the binding definition of 'frontier' for policy, which durably advantages incumbents who already hold allocation.
Hidden ramifications / who pays
The deepest single point of failure is upstream of TSMC and absent from the report's framing: ASML's EUV lithography monopoly and a handful of materials/gas suppliers. 'One foundry' is itself a derivative of an even narrower 'one lithography vendor,' so diversifying fabs without diversifying EUV moves the bottleneck rather than removing it.
Who bears it: National-security planners and executives who will spend heavily qualifying second-source foundries believing they have de-risked, when the true monopoly (EUV tools, photoresists, specialty gases) is untouched and even more concentrated.
Why hidden: The report centers TSMC because it is the visible name on the chip; the lithography and materials layer is one tier further up the stack, invisible to buyers, and rarely shows up in compute-capacity datasets that count H100-equivalents rather than the tools that make them.
Taiwan's own industrial-robot installations grew 33% in 2024 even as the US (-9%), Germany (-5%), and Italy (-16%) fell, meaning Taiwan is automating and deepening its semiconductor-ecosystem entrenchment, making the dependency harder to unwind over time, not easier.
Who bears it: Policymakers assuming diversification is a one-time capacity-build problem; the comparative-advantage gap (skilled fab labor, tacit process knowledge, dense supplier clusters) is widening, not narrowing.
Why hidden: The robot-install stat lives in the Economy chapter under an automation-concentration narrative, disconnected from the hardware-supply-chain theme — nobody cross-references Taiwan's automation lead with Taiwan's foundry lead, but they compound.
Energy and water are co-located single points of failure with the fab: 29.6 GW of AI data-center power (a New York State's worth) and city-scale inference water draw mean the compute that routes through Taiwan also routes through a few grid and watershed chokepoints — a fab restart is useless without power and water the report treats as a separate theme (T05).
Who bears it: Grid operators, water authorities, and the 1.2M-person-equivalent communities whose drinking water competes with inference cooling — parties outside the hardware-supply-chain conversation who bear correlated failure.
Why hidden: The report siloes hardware risk (T03) from resource risk (T05) into separate themes, so the correlated nature of fab + grid + watershed chokepoints — any one of which freezes deployed AI — is never assembled into a single resilience picture.
Capability convergence (US-China gap 2.7%, closed-open 3.3%) means the hardware chokepoint is now the only remaining lever the US holds over China's AI, which raises rather than lowers the incentive for China to seize or neutralize Taiwan's fabs, because every other source of US advantage has already eroded.
Who bears it: Taiwan itself, whose strategic 'silicon shield' may invert into a 'silicon magnet' as the chokepoint becomes the last contested prize; and the global economy that absorbs the disruption.
Why hidden: T03 frames TSMC dependence as a vulnerability to hedge, never as a coercive asset whose value to an adversary rises precisely as the rest of the capability race converges — the chapters showing convergence (Ch.2) and concentration (Ch.1) are never read against each other.
The data-centric small-model trend (OLMo near-frontier at ~90x fewer params) is the report's own latent answer to the chokepoint, but it cuts against incumbents: if competitive models need far less compute, the strategic value of monopolizing TSMC output falls — so compute-heavy incumbents have an interest in keeping the single-foundry framing dominant over the small-model one.
Who bears it: Smaller players, the public sector, and lower-income economies (Ch.8: 2-3 supercomputers) who could route around the chokepoint via data-centric methods but won't be advised to, because compute-heavy incumbents shape the policy conversation (Ch.8: industry now 37% of congressional witnesses).
Why hidden: T03 and T06 are presented as separate findings; nobody frames efficient small models as a supply-chain de-risking strategy, so the cheapest hedge against the single foundry hides in plain sight in another chapter.
Cascades / causal chains
Capability converges across models and nations (top-4 within 25 Elo, US-China gap 2.7%, open-closed 3.3%) model/weight/algorithm advantage evaporates the only remaining durable moat is allocation of TSMC-fabricated Nvidia silicon controlling fab output and export licenses becomes the central instrument of AI geopolitics, raising the coercive value of the Taiwan chokepoint.
Report and policymakers brand TSMC the 'single point of failure' governments fund second-source fabs (CHIPS-style) and declare the chain de-risked but new fabs still print Nvidia/AMD designs on ASML EUV tools running CUDA the real monopolies (lithography + design/software) are untouched false resilience leaves the actual chokepoint in place while capital is spent signaling.
Buyers told to build inventory buffers and stockpile chips many large buyers double-order simultaneously against constrained TSMC capacity (bullwhip) smaller buyers and lower-income economies get rationed first compute access inequality widens (Ch.8: regions with 2-3 supercomputers) when buffers unwind against 3.3x/yr capability turnover, stockpiles depreciate and flip to a glut.
Voluntary safety/transparency governance fails (FMTI 58 40, 81/102 models closed, incidents 233 362) regulators seek any enforceable lever compute is the one physically meterable, fab-gated layer compute thresholds become the legal definition of 'frontier' incumbents holding allocation are grandfathered into advantage while the model layer they can't govern keeps shipping harm.
Taiwan deepens its lead by automating its own industry (+33% robot installs in 2024 vs US -9%) tacit process knowledge and supplier-cluster density widen the foundry dependency becomes structurally harder to relocate each year 'diversify the supply chain' shifts from a capacity problem to an irreproducible-know-how problem that money alone cannot solve.
What to watch / leading indicators
Whether new fab capacity announcements (US/Japan/Germany) are matched by EUV-tool and advanced-packaging (CoWoS) capacity outside Taiwan — if fabs scale but ASML lithography and TSMC packaging stay concentrated, the chokepoint has only moved, confirming the 'false de-risking' read. Watch CoWoS allocation figures specifically.
The spread between top buyers' chip-stockpile/order growth and median buyers' allocation — a widening gap (large buyers double-ordering, smaller buyers and low-supercomputer regions rationed) confirms the bullwhip/inequality cascade. Leading indicator: Nvidia allocation disclosures and cloud-provider reserved-capacity backlogs.
Adoption of data-centric / small-model methods (OLMo-style) by national labs and lower-income economies as an explicit compute-hedge — rising open small-model deployment in the 2-3-supercomputer regions (Ch.8) would refute T03's premise that the foundry chokepoint is decisive; continued compute-scale framing in policy confirms incumbent capture.
Whether compute thresholds (FLOP/chip-count) become the operative legal definition of 'frontier' in US/EU rulemaking — if regulators key obligations to compute rather than model behavior, it confirms the chokepoint is becoming the de facto control point and incumbents are grandfathered in.
Tensions & contradictions
T03 (compute is the binding, monopolized input) collides head-on with T06 (OLMo matches frontier results with ~90x fewer parameters via data curation): if competitive AI no longer requires frontier-scale compute, the strategic stakes of the single foundry are overstated — the report simultaneously argues compute is the irreplaceable chokepoint and that you can route around compute with better data.
T03's 'US hosts 5,427 data centers, >10x any country' is framed as American strength, but Ch.8 shows China leads state-backed AI supercomputers 85 to North America's 41, and Ch.1 shows China at 74.24% of granted AI patents and leading research volume — so the report's hardware-dominance narrative for the US is contradicted by its own compute-sovereignty and research-leadership data favoring China.
T03 urges treating the chokepoint as a reason to secure and scale compute, while T05 shows that very scaling is an unsustainable resource bomb (29.6 GW, city-scale water) — de-risking by stockpiling capacity directly worsens the energy/water liability the report flags elsewhere.
T01 says capability has converged and the US-China gap has 'effectively closed,' which undercuts T03's implicit premise that the hardware chokepoint protects a meaningful US lead — there is little capability lead left to protect, so the chokepoint guards a margin that is already nearly gone.
T04 (transparency collapsing, safety ungovernable) sits in tension with the policy response T03 invites: because the model layer is opaque and incidents are rising, regulators will over-rely on the governable compute layer — meaning the single foundry becomes the de facto safety regulator, a role no one chose and that has nothing to do with actual model safety.
T08 shows industry now supplies 37% of congressional AI witnesses (vs government 10%) and private investment is ~14x public in a single year — so the same incumbents who benefit from compute scarcity dominate the policy venue that would decide how to 'de-risk' it, a conflict the report's actionable advice to policymakers does not flag.
⟂ Contrarian read
The single-foundry panic is largely a legacy of the brute-force scaling era that the report's own data shows is ending, and 'de-risking the supply chain' is the wrong fight. T06 demonstrates a ~90x-smaller model reaching near-frontier results through data curation alone, T01 shows capability converging to within single digits across models and nations, and DeepSeek-R1 already triggered a >$1T US tech-market-cap drop precisely by proving frontier results need far less compute than assumed. If competitive AI is becoming compute-light and data-heavy, then the marginal strategic value of monopolizing TSMC output is falling, not rising — meaning the real chokepoint is not the foundry at all but ASML's EUV lithography and the CUDA software ecosystem one tier up, neither of which a second fab touches. The contrarian conclusion: the most effective 'de-risking' is not building redundant fabs or stockpiling depreciating chips (which manufactures a bullwhip and entrenches incumbents), but investing in data-centric efficiency that makes the chokepoint matter less — and the reason this isn't the headline advice is that the compute-heavy incumbents who would be devalued by it now dominate the policy conversation (37% of congressional witnesses) and the report instinctively frames a falling-value monopoly as a rising-value vulnerability.
T04
1 - Research & Development3 - Responsible AI

The Accountability Vacuum — Transparency Is Falling as Stakes Rise

Frontier models are shipping without disclosure and almost never report safety benchmarks, just as documented AI incidents surge.

81 of 102 notable 2025 models shipped without training code (only 4 were open source), and the Foundation Model Transparency Index average fell from 58 to 40. Documented AI incidents rose to 362 in 2025 (up from 233 in 2024 and ~3.6x the pre-2022 baseline), yet only Claude Opus 4.5 reports more than two responsible-AI benchmarks. Compounding this, improving one RAI dimension provably degrades another — differential privacy cut accuracy by up to 33 points and raised missed Alzheimer's diagnoses 21.4% in one study — with no shared framework to navigate the trade-off.

81 of 102
Notable 2025 models released without training code
58 → 40
Foundation Model Transparency Index average, 2024 → 2025
362
Documented AI incidents in 2025 (up from 233 in 2024)
−33 pts
Max accuracy loss from adding differential privacy
    What to do
  • Policymakers: mandate baseline disclosure (training compute, dataset provenance, energy/water) and a common, externally comparable RAI benchmark set plus standardized incident reporting
  • Regulators/researchers: fund independent evaluation capacity to replace what labs no longer disclose
  • Teams deploying safety-critical systems: explicitly document privacy/accuracy/fairness trade-offs and pair deployments with retrieval grounding, abstention, and human review
Reported AI incidents vs. falling transparency
20242025233362Incidents ↑5840Transparency (FMTI avg) ↓Documented AI incidents vs. Foundation Model Transparency Index
Harm is accelerating while disclosure moves backward — an accountability vacuum at the frontier.
“Recent research shows that improving one responsible AI dimension can come at the cost of another, with gains in privacy reducing fairness or gains in safety reducing accuracy.”AI Index 2026, Ch. 3
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
Independent safety science gets starved of the inputs it needs, so external audit capacity atrophies just as the report calls for it to expand.mid
81 of 102 notable 2025 models shipped without training code and almost every model scored zero on pre-training-data transparency (Openness Index 2-16/100). Reproduction, contamination-checking, and provenance auditing all require training data, compute, and code that no longer exist publicly. The skills and tooling to audit frontier systems decay from disuse, so even a future disclosure mandate would land on a research community that has lost the muscle to act on it. The report's own caveat that company-reported results are simply 'assumed accurate' is the visible symptom.
Safety reporting bifurcates into a marketing axis rather than a comparability axis: labs that do disclose RAI benchmarks gain reputational credit, while the field-level picture stays unverifiable.near
Only Claude Opus 4.5 reports more than two RAI benchmarks and only GPT-5.2 reports StrongREJECT, so the table is mostly empty cells. With no shared, externally comparable set, any single lab's disclosure becomes a differentiated brand claim ('we are the transparent one') instead of an apples-to-apples number. That rewards selective disclosure of favorable metrics and gives no incentive to publish the ones a model fails, entrenching the gap rather than closing it.
The privacy/accuracy trade-off becomes a silent rationing mechanism that concentrates harm on the smallest, least-resourced institutions.mid
Differential privacy cut accuracy up to 33 points, and in the federated Alzheimer's study stronger privacy cut accuracy 14.8pp and raised missed diagnoses 21.4% specifically at small hospitals. Because there is no shared framework to navigate the trade-off, each deployer picks a privacy setting in isolation; well-resourced centers can buy back accuracy with more data and compute while small/rural sites cannot — so a 'responsible' privacy choice quietly redistributes diagnostic failure toward the weakest nodes, invisibly, with no metric flagging it.
Liability and standard-of-care drift toward whatever is checkable, pushing deployers to govern on capability benchmarks (which exist) and ignore the safety dimensions (which don't).mid
Capability benchmarks are reported almost universally; RAI benchmarks almost never. When a harm reaches a court, regulator, or procurement review, the only documented, comparable evidence is capability scores. Rational deployers therefore document and defend on the axis that has paper, normalizing a definition of 'reasonable care' that structurally excludes fairness, provenance, jailbreak-resistance, and the belief/fact failure mode (GPT-4o 98.2%->64.4%, DeepSeek R1 ->14.4% on KaBLE) precisely because those have no shared scoreboard.
The accountability vacuum migrates fastest into the least-regulated, highest-volume surface — consumer health and information retrieval — rather than the formally regulated device pipeline.near
Medical-device authorizations carry RCT and FDA oversight (though only 2.4% are RCT-backed), but AI Overviews now appear on 84-92% of health queries with none of that oversight. As frontier labs disclose less, the consequential failures (belief-adopting answers, 22-94% hallucination spread, errors of omission) land on the ungoverned consumer surface where no incident-reporting or benchmark obligation applies — so the harm grows exactly where the audit infrastructure is thinnest.
Incident statistics themselves become a contested governance battleground, weakening the empirical case for intervention at the moment it is most needed.mid
The human-curated AIID count (362) and the automated OECD AIM count (435/month peak) diverge by method, and AIID skews to English-language, high-visibility events. With transparency falling, the denominator (how many deployments, how many near-misses) is unknown, so any incident rate is unfalsifiable. Deregulation advocates can dismiss the count as artifact while safety advocates cite it as crisis, and neither can be settled — paralyzing the disclosure mandate the report proposes.
Hidden ramifications / who pays
The people who bear the privacy/accuracy trade-off are patients at small and rural hospitals, not the model developers or the privacy regulators who mandate the protective setting — a 21.4% rise in missed Alzheimer's diagnoses is borne by individuals who never see the configuration decision.
Who bears it: Patients at small/under-resourced health institutions; by extension any user of a system tuned conservatively on one RAI axis
Why hidden: The report frames the trade-off as an abstract 'no shared framework' problem. It is under-discussed that the trade-off is not symmetric across institutions — accuracy can be bought back with data/compute, which only well-resourced deployers have, so the harm is distributional, not just technical.
Falling transparency degrades the AI Index's own future ability to measure the field, meaning the public's primary instrument for tracking AI is being hollowed out from the inside.
Who bears it: Regulators, journalists, and the public who rely on third-party measurement (including this very report) to form policy
Why hidden: The report documents opacity as a finding but does not foreground that its own methodology now rests on inferred compute, 'assumed accurate' self-reports, and proxies. The measurement apparatus reporting the vacuum is itself a casualty of the vacuum — a reflexive failure that headlines under-state.
China's lead in RAI research volume (812 vs 394 US papers, a reversal from 2024) means the intellectual agenda for what 'responsible AI' even means may shift to a jurisdiction the report's main organizational survey (McKinsey) explicitly excludes.
Who bears it: Western standards bodies and policymakers assuming they set the RAI agenda; Global South adopters who will inherit whichever framework dominates
Why hidden: The transparency story is told as a US/frontier-lab problem, but the safety-research center of gravity is moving while the West deregulates — a soft-power inversion buried in a single striking-stat line and a survey-scope caveat.
Open-weight models score near-zero on data/compute transparency too, so the 'open vs closed' debate masks that almost no model — open or closed — is actually auditable at the layer that matters for safety and provenance.
Who bears it: Enterprises and public-sector bodies adopting open models specifically to gain auditability and avoid lock-in
Why hidden: Openness Index shows leading models at 2-16/100 and only K2 Think and Olmo 3 32B Think non-zero on pre-training data. 'Open weights do not equal open science' is stated but its consequence — that the open-model safety promise is largely illusory at the provenance layer — is not centered.
The 'excellent' incident-response self-rating fell from 28% to 18% while incidents rose, so organizations are quietly becoming less able to manage harms even as they adopt more RAI policies (no-policy share 24%->11%) — policy adoption is outrunning operational competence.
Who bears it: End users of organizations that have a governance policy on paper but declining real capacity; boards relying on policy-existence as assurance
Why hidden: The report presents policy adoption as progress. The simultaneous collapse in confidence (RAI maturity stuck at 2.3/4) reveals policy-as-checkbox; the gap between having a policy and being able to act on an incident is the load-bearing risk, and it is under-emphasized against the encouraging adoption numbers.
Cascades / causal chains
Frontier labs stop disclosing training code/data (81/102 models, FMTI 58 40) external researchers cannot reproduce or contamination-check results the only documented evidence becomes self-reported capability scores 'assumed accurate' liability, procurement, and standard-of-care anchor on capability axes that have papers, structurally excluding safety/fairness from the definition of due care.
No shared RAI benchmark set exists the one or two labs that disclose RAI metrics gain a transparency brand rather than producing comparability selective disclosure of favorable metrics is rewarded and unfavorable ones stay hidden field-level safety remains unverifiable, so the report's disclosure-mandate lever has no comparable baseline to enforce against.
Differential privacy mandated in isolation accuracy drops up to 33pts and missed diagnoses rise 21.4% well-resourced hospitals buy accuracy back with more data/compute while small hospitals cannot a 'responsible' privacy setting silently redistributes diagnostic failure toward the least-resourced patients, with no metric flagging the shift.
US federal government deregulates and rescinds EO 14110 while industry's share of congressional witnesses rises 13% 37% voluntary transparency (already reversing) loses its last policy backstop states fill the gap with 150 bills creating a litigated patchwork deployers face divergent, contested rules and public trust in US regulation sits at 31% (lowest surveyed), compounding the vacuum.
Opacity grows at the frontier the most consequential harms migrate to the ungoverned consumer surface (AI Overviews on 84-92% of health queries) belief-adopting and hallucinating answers (22-94% range) reach patients with none of the device pipeline's oversight harm concentrates exactly where incident-reporting and benchmark obligations do not reach.
What to watch / leading indicators
Whether any shared, externally comparable RAI benchmark set (BBQ, StrongREJECT, SimpleQA, jailbreak resistance) goes from near-empty disclosure (today only Claude Opus 4.5 reports >2) to multi-lab adoption — the single cleanest signal that the vacuum is closing rather than widening.
The trajectory of the Foundation Model Transparency Index average (58->40): a third consecutive direction change, and specifically whether the Upstream/data-and-compute sub-score moves off near-zero, since that is the layer auditing actually requires.
Convergence vs. persistent divergence between the human-curated AIID count (362) and automated OECD AIM count (435/mo peak) — and whether any deployment-denominator (incidents per million queries) emerges, which would make incident rates falsifiable instead of rhetorical.
Outcome of the US federal-vs-state collision (DOJ AI Litigation Task Force vs California SB 53 large-developer disclosure/whistleblower rules) — whether mandated disclosure survives in at least the largest state markets, which would set a de facto national floor despite federal deregulation.
Tensions & contradictions
Directly collides with T01 (capability convergence): T01 says buyers should now compete on 'cost, reliability, and real-world usefulness' — but T04 shows reliability and safety are the exact dimensions that are NOT disclosed or comparable. The market is told to choose on an axis the field has made unmeasurable.
Collides with T06 (Bigger Isn't Better / open data-centric models): T06 celebrates open models like OLMo as the democratizing answer, yet Ch.3's Openness Index shows almost all models — including most 'open' ones — score zero on pre-training-data transparency. Open weights are sold as the auditability win precisely where auditability is absent.
Collides with Ch.9 public opinion: the public uniformly wants MORE regulation (41% 'not far enough' vs 27% 'too far', across all 50 states) and trusts US government least (31%), while Ch.8 shows the US federal government deregulating and industry voices (37% of witnesses) dominating policy — a democratic-mandate-vs-policy-direction contradiction the accountability vacuum sits inside.
Collides with Ch.6 medicine's success cases: constrained clinical tools (TREWS 18.7% sepsis-mortality reduction, COMPOSER ~50 lives/yr) show RAI CAN be done well with narrow scope and human oversight — contradicting the framing that the vacuum is unavoidable. It is a governance/disclosure choice, not a technical inevitability.
Internal tension flagged by the report itself: the trade-off studies showing one RAI dimension degrades another are 'recent and focus on specific tasks rather than general-purpose AI systems' — so the alarming privacy/accuracy result is suggestive, not established for frontier models, even as it anchors the theme's most dramatic stat.
⟂ Contrarian read
The report frames falling transparency as a regression caused by labs hiding things — but a defensible non-consensus read is that disclosure isn't falling, it's being repriced by competition. As capability converged to single-digit gaps (T01), training data, curation recipes, and compute configuration became the last durable moats, so non-disclosure is rational competitive behavior, not negligence. The corollary is uncomfortable for the report's policy prescription: a pure disclosure mandate would force labs to surrender their only remaining differentiator and would be fought hardest by exactly the most capable labs, so it is the least likely lever to actually land. The mechanism that could close the vacuum is therefore not voluntary or mandated openness but liability — making undisclosed systems legally riskier to deploy than disclosed ones — which reprices opacity from a moat into a cost. On this read, the binding constraint is the US deregulatory turn and the 31% trust trough, not the labs' opacity per se; opacity is the symptom of a missing liability regime, and treating it as the disease (mandate transparency) misdiagnoses the cure.
T05
1 - Research & Development

Inference, Not Training, Is the Hidden Resource Bomb

AI's largest and least-visible cost is serving models at scale — measured in gigawatts of power and cities' worth of water.

AI data center power capacity reached 29.6 GW by Q4 2025 — comparable to all of New York state at peak demand (~31 GW). Annual GPT-4o inference water use alone may reach 1.3-1.6 million kiloliters, exceeding the drinking water needs of 1.2 million people. Per-query figures look modest (GPT-4o ~0.42 Wh) and only become enormous at hundreds of millions of daily queries — making inference, not training, the dominant deployed-AI resource cost. (Training is no small thing either: Grok 4 emitted an estimated 72,816 tons of CO2e.)

29.6 GW
AI data center power capacity (≈ New York state peak demand)
1.2M people
Whose annual drinking water GPT-4o inference may exceed
72,816 tons
Estimated CO2e from training Grok 4 in 2025
    What to do
  • Executives: plan for inference as the primary cost center and ESG/PR liability; optimize serving efficiency and data-center siting
  • Policymakers: require energy and water reporting for AI data centers, since per-query numbers hide city-scale aggregate draw
  • Sustainability leaders: factor inference (not just training) into AI carbon and water accounting
AI training emissions, 2012-25 (tons CO2e)
AlexNet ’120.01GPT-3 ’20588GPT-4 ’235,184Llama 3.1 405B ’248,930Grok 3 ’2559,200Grok 4 ’2572,816Training emissions · tons CO₂e · log scale · ~7-million-fold since 2012
Frontier training emissions grew roughly 7 million-fold since AlexNet — and inference at scale now rivals it as a resource cost.
“Inference at scale, not training, is becoming the dominant and least-visible resource cost of deployed AI.”AI Index 2026, Ch. 1
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
The unit economics of free consumer AI invert: the per-query resource cost that looks trivial (GPT-4o ~0.42 Wh) becomes the dominant marginal cost of serving the very users who generate ~$172B of consumer surplus but almost no revenue — so the providers carrying the physical bill are structurally the ones least able to recoup it.near
Inference cost scales linearly with query volume (hundreds of millions/day), while revenue from mostly-free tools does not; the report itself notes innovators capture only ~3% of social returns. As capability converges (top-4 models within <25 Elo, T01), differentiation moves to 'cost, reliability, usefulness' — meaning the firm that serves a query most cheaply wins, putting energy/water efficiency at the center of competitive strategy, not the periphery.
Data-center siting, not model quality, becomes the binding constraint on AI growth — and it relocates to wherever power and water are cheapest and least regulated, exporting AI's resource burden onto grids and watersheds chosen for permissiveness rather than suitability.mid
29.6 GW of AI power (≈ NY state peak) growing with compute at 3.3x/yr cannot be added to constrained grids quickly; operators chase interconnection queues, cheap power, and lax water permitting. The Dec 2025 federal EO explicitly carves out 'data centers' as a domain states may still regulate, signaling siting is becoming the live political fight even as model rules are federally pre-empted.
Inference efficiency gains get fully consumed by demand (Jevons paradox), so the resource bomb keeps growing even as each query gets cheaper — defeating the report's own 'optimize serving efficiency' remedy unless paired with absolute caps.mid
Every efficiency win lowers per-query cost, which lowers price/expands free tiers, which raises query volume; with adoption still climbing (US usage 48%→56%, only 24th globally with room to grow) and agentic workflows (which issue many model calls per task) just beginning, total draw rises faster than per-unit efficiency falls.
A new bottleneck emerges that benchmarks and leaderboards cannot see: tokens-per-watt and tokens-per-liter become the real frontier metric, and the firms optimizing them (custom silicon, smaller data-centric models per T06) quietly out-compete leaderboard leaders.mid
Because capability has converged, the surviving axis of advantage is cost-to-serve, which is dominated by inference energy. A ~90x-smaller model (OLMo 3) matching frontier results means the same query can be served at a fraction of the watts — turning T06's 'small models win' into an environmental and margin argument, not just a capex one.
Water — not carbon — becomes the flashpoint, because it is local, visible, and zero-sum in a way diffuse CO2e is not, exposing operators to community-level backlash that aggregate emissions never triggered.near
1.3–1.6M kL/yr of inference water exceeds the drinking supply of 1.2M people and is drawn from specific local aquifers/reservoirs; droughts make it rivalrous against residents and agriculture. Unlike CO2e (global, deferred), water shortage produces an identifiable local victim and a permitting veto point, converting an ESG line-item into a siting blocker.
The reported training-emissions curve (Grok 4 72,816 tons, ~7M-fold over AlexNet) becomes a deliberate distraction: as inference dominates, headline training numbers let operators report a shrinking, one-time figure while the perpetual, larger inference draw stays unmeasured and unreported.long
Training emissions are a discrete, attributable event that makes a clean chart; inference is continuous, utilization-dependent, and (per Ch.1 tensions) carries 'real uncertainty.' Reporting incentives favor the legible number, so the visible metric and the dominant cost diverge — exactly the proxy-vs-reality gap the field is prone to.
Hidden ramifications / who pays
Residential and commercial ratepayers — not AI firms — absorb a large share of the cost through higher electricity prices and grid-upgrade surcharges, because data centers connect to shared grids whose capacity expansions are socialized across all customers.
Who bears it: Households and small businesses on grids near major data-center clusters (the US hosts 5,427 — 10x any other country), especially in already-constrained regions.
Why hidden: The report frames the cost as the operator's 'ESG/PR liability' and a per-query metric, centering the firm. It never models who pays for the 29.6 GW of new capacity; grid economics quietly transfer that cost to captive ratepayers who never queried a model.
Water-stressed and lower-income regions that win data-center investment trade scarce local water and power for jobs that are largely non-local and few — a resource-for-capital exchange that deepens, rather than diffuses, AI's concentration.
Who bears it: Communities in drought-prone or cheap-power regions courted for siting; collides directly with T08's finding that AI value concentrates rather than diffuses.
Why hidden: The report treats inference cost as a corporate liability and the diffusion problem as a separate (T08) story. The link — that the physical burden diffuses to vulnerable communities while the economic value concentrates elsewhere — sits between two chapters and is never drawn.
The aggregate inference draw is structurally unmeasurable today because the same labs that stopped disclosing training code, parameters, and compute (81 of 102 models; FMTI 58→40) also do not report inference energy/water — so regulators are asked to govern the dominant cost with no data, and the de Vries-Gao estimates carry wide error bars by necessity.
Who bears it: Regulators, independent researchers, and the public — anyone needing to verify or cap the resource draw.
Why hidden: T05 is filed as an environment/R&D story; the transparency collapse is filed as T04. The report under-states that the resource bomb is invisible for the same reason safety claims are unverifiable — disclosure has reversed precisely as stakes rose. The two are one problem.
Future grid and water infrastructure gets built (or pre-empted) around projected AI demand that may not materialize, creating stranded-asset and over-procurement risk if the 'agent everywhere' assumption proves premature.
Who bears it: Utilities, public infrastructure planners, and bondholders financing 3.3x/yr capacity buildouts; ultimately taxpayers.
Why hidden: The report elsewhere flags that agent deployment is still 'single digits' across functions and that productivity gains are early/mixed — yet inference-demand projections implicitly assume the autonomous-agent future arrives. Nobody reconciles the bullish compute curve with the bearish deployment reality.
Inference's resource floor hands an unacknowledged structural advantage to state-backed actors with cheap, subsidized, or coal-heavy power, decoupling 'who can serve AI cheaply' from 'who builds the best model' — a sovereignty dimension the capability-convergence story misses.
Who bears it: Nations and firms competing on AI deployment economics, particularly where energy is state-subsidized (relevant to China's lead in supercomputers at 85 and robots at 295k/yr).
Why hidden: The report's competition framing (T01) is about model parity and the US-China model gap. It under-discusses that once capability converges, the durable advantage is cheap energy at inference scale — a contest the US, with constrained grids and 5,427 data centers already, may not win.
Cascades / causal chains
Capability converges (top-4 within <25 Elo) raw model quality stops differentiating competition shifts to cost-to-serve inference energy/water becomes the primary competitive axis firms optimizing tokens-per-watt (smaller, data-centric models per T06) out-compete leaderboard leaders.
Per-query cost looks trivial (0.42 Wh) operators and regulators discount it only training emissions get reported (Grok 4 72,816 tons makes the chart) the dominant, perpetual inference draw stays unmeasured aggregate resource bomb grows invisibly until a local water/grid crisis forces it into view.
Inference water hits 1.3-1.6M kL/yr from specific local aquifers a drought makes it rivalrous with residents an identifiable local victim and a permitting veto point appear community backlash blocks siting data-center growth (and thus AI deployment) is constrained by water politics, not model capability.
Data centers connect to shared grids 29.6 GW of new AI load requires socialized capacity upgrades costs spread to all ratepayers households near clusters pay more for power they never used to serve queries that earn the operator near-zero revenue (mostly-free tools, ~3% of social returns captured).
Transparency reverses (FMTI 58 40, 81/102 models no training code) inference energy/water also goes unreported regulators lack data to cap the dominant cost the December 2025 EO leaves 'data centers' to states a fragmented, litigation-driven patchwork governs the resource bomb instead of a coherent federal disclosure regime.
What to watch / leading indicators
Whether any major lab begins publishing standardized inference energy/water per query AND aggregate annual draw (not just one-time training emissions) — its absence confirms the legible-metric-hides-dominant-cost dynamic; its arrival would falsify the opacity thesis.
Local siting outcomes: data-center permits denied, paused, or moratoria imposed over water/grid concerns, and ratepayer-advocate filings on socialized grid-upgrade costs — the leading edge of resource cost becoming a hard constraint rather than a PR line.
State action under the December 2025 EO's 'data centers' carve-out: do states pass binding data-center energy/water disclosure or efficiency mandates while model regulation stays federally pre-empted? This would confirm siting (not capability) as the live regulatory battleground.
Divergence of total AI electricity/water draw from per-query efficiency: if reported per-query Wh falls while aggregate GW/water rises (tracking the 3.3x/yr compute curve), the Jevons-paradox prediction holds and 'optimize serving efficiency' is confirmed insufficient without absolute caps.
Tensions & contradictions
Against T01 (Capability Has Converged): the report says competition shifts to 'cost, reliability, usefulness' but treats inference cost as an environmental footnote. If cost is the new battleground and inference is the dominant cost, then T05's resource bomb IS the competitive story T01 describes — they are the same finding viewed from two chapters, and the report never connects them.
Against T06 (Bigger Isn't Better): T05 dramatizes frontier training emissions (Grok 4 72,816 tons), implying scale is the resource villain — but T06 shows ~90x-smaller models matching frontier results. The real lever is serving small efficient models at inference, which would slash both the bomb and the bill; the two themes point at the same fix from opposite ends without acknowledging it.
Against T08 (Value Diffusing, Capital Concentrating): T05 frames inference cost as an operator liability, while T08 shows value concentrates. The hidden synthesis — that the resource BURDEN diffuses onto ratepayers, watersheds, and host communities while the captured VALUE concentrates in a few firms/states — is a regressive transfer neither theme states.
Against T04 (Accountability Vacuum): T05 calls for energy/water reporting as if it were a routine ask, but T04 documents that disclosure is actively reversing (FMTI 58->40) on exactly these upstream dimensions. The report under-states that its own remedy for T05 is being undermined by the trend it documents in T04.
Against the deployment reality (Ch.4): inference-demand and compute projections (3.3x/yr) implicitly assume pervasive agentic AI, yet the report says agent deployment is 'single digits' across functions and macro productivity is a negligible +0.01pp. The resource bomb is sized to a future that the labor/productivity data says hasn't arrived.
⟂ Contrarian read
The 'hidden resource bomb' may be a self-defusing problem that the report over-dramatizes by anchoring on frontier training emissions. The same convergence and data-centric trends the Index documents create overwhelming commercial pressure to make inference cheaper: with capability flat across the top tier and revenue near-zero on free tools, every operator's survival depends on cutting cost-to-serve, and a ~90x-smaller model (OLMo 3) matching frontier results means the dominant fix — serving small, efficient, often open models — is already economically forced, not a regulatory ask. On this read, inference watts-per-token will fall faster than the headlines assume, and the real durable bomb is not environmental but distributional: the socialized grid/water costs borne by ratepayers and host communities, which no efficiency gain removes because they are a function of WHERE the load lands, not how efficient each query is. The report points its alarm at gigawatts and kiloliters when it should point at who pays for them.
T06
1 - Research & Development5 - Science6 - Medicine

Bigger Isn't Better — Win on Data, Not Capex

Small, data-centric models now match or beat models up to 200x larger, collapsing the capital barrier to competitive AI.

OLMo 3 (~32B params, ~90x fewer than Grok 4's 3 trillion) matched frontier AIME results through deduplication, pruning, and curation alone. In science, a 111M-parameter MSAPairformer beat leading methods on ProteinGym and a 200M-parameter GPN-Star beat a 40B model on variant-effect tasks. Most scientific AI models come from academic and government collaborations, not industry — and they win in-domain against general-purpose frontier models.

~90x fewer params
OLMo 3 vs Grok 4, yet comparable AIME 2025 results
111M params
MSAPairformer beat leading methods on ProteinGym
200M vs 40B
GPN-Star beat the far larger Evo 2 on variant-effect tasks
    What to do
  • Executives & research leaders: invest in data deduplication, pruning, and curation to build competitive models without frontier-scale capex
  • Science & medicine teams: favor domain-specific open models from academic/government collaborations over general-purpose frontier models for in-domain work
  • Funders: back data-centric and small-model research as a higher-leverage bet than raw parameter scaling
Small specialized models beating giants (ProteinGym & genomics)
ProteinGym · Spearman ↑ (higher = better)MSA-Pairformer · 111M0.45ESM-2 · 3B0.41Genomics · AUPRC ↑GPN-Star · 200M0.75Evo 2 · 40B0.53Small specialized models (accent) vs. far larger general models
Parameter count no longer predicts performance; curation and method dominate.
“Like other areas of AI, biological model development is increasingly bottlenecked on data rather than architecture.”AI Index 2026, Ch. 6
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 4 cascades
Second-order effects / what follows
The binding constraint to competitive AI doesn't disappear — it migrates from compute capex to proprietary, curated, high-quality data, which is becoming the new scarce, hoardable asset.mid
If data curation (dedup/prune/curate) is what lets OLMo 3 (~90x fewer params) approach frontier AIME and lets 111M/200M bio models beat 40B ones, then whoever controls the cleanest, rarest, most domain-specific corpora wins. Ch.1 already shows the market pricing this in: >50% of new web content is AI-generated since Jan 2025 (model-collapse risk), triggering proprietary licensing deals (NYT-Amazon, May 2025). The barrier doesn't fall; it changes shape from a capital barrier to a data-access barrier.
Incumbents with capital re-capture the supposed advantage by simply buying the data moat instead of (or in addition to) the compute moat, so the predicted democratization partially reverses.mid
Data licensing deals favor the deep-pocketed. As clean public web data depletes (estimated 2026-2032) and synthetic data 'still cannot offset real-data depletion' for SOTA, exclusive licensed corpora become the scarce input — and only well-funded labs can outbid for them. Curation also requires skilled labor and tooling, not free lunch. The capital barrier sublimates rather than collapses.
Small in-domain winners create a false sense of deployable readiness: parameter-efficiency gains land on isolated-task benchmarks, not on the end-to-end reliability that actually gates real use.near
The same chapters that show small models winning (Ch.5/6) show frontier agents scoring below 20% on paper-scale replication, ~17% on real bioinformatics, 38.8% vs an 83.5% PhD baseline on PaperArena, 33% answer accuracy with 58% code-failure on UnivEARTH, and only 2.4% of FDA-cleared AI devices with clinical studies backed by RCTs. 'Beats a 40B model on ProteinGym' is an in-distribution benchmark win, not a confirmed discovery or a validated clinical tool. Cheaper-to-build amplifies a build/validate gap.
A proliferation of cheap, capable domain models floods science and medicine with under-validated tools, shifting the bottleneck and the cost onto downstream validation (wet-lab, prospective clinical trials, replication).mid
Lowering the capital barrier to *building* a model does nothing to lower the cost of *validating* it. Ch.6 is explicit: the bottleneck relocates 'from algorithms to clinical evidence, data, and governance.' More buildable models with flat validation capacity means a widening backlog of unconfirmed candidates and more pressure to deploy on benchmark scores alone.
Public-sector and academic labs gain real frontier-relevant leverage in narrow scientific domains, partially reversing the 91% industry dominance of notable general-purpose models.mid
Ch.1/5 show 93 of 95 notable general models came from industry, yet 'most scientific AI models come from academic and government collaborations' and win in-domain (MSAPairformer, GPN-Star, AION-1, Aardvark Weather). When data quality and domain method beat parameter count, the public sector's comparative advantage — domain expertise and curated scientific datasets — becomes decisive where it has data, even without hyperscaler compute.
Demand for raw frontier compute softens at the margin for a class of workloads, exposing capex-heavy 'scale is everything' bets to stranded-asset and write-down risk if the data-centric thesis generalizes.long
If a 32B model trained on curated data substitutes for a 3T-param model on many in-domain tasks, the marginal buyer needs far less compute per unit of useful capability. This is the same disruption logic that wiped $1T+ off U.S. tech market cap after DeepSeek-R1 (Ch.2). Compute still scales 3.3x/yr in aggregate, but the per-task economics that justify the largest training runs weaken.
Hidden ramifications / who pays
The 'comparable results' framing papers over a real, persistent capability gap: OLMo 3 scored 78.1% on AIME 2025 versus Grok-4 92.7%, GPT-5 high 94.3%, and Gemini 95.7% — a 14-17 point deficit. For the hardest, highest-value reasoning, the giant still wins; 'matches' is true only at a chosen threshold of 'good enough.'
Who bears it: Buyers and executives told they can drop the frontier model for a small one — they may silently accept a 15-point reliability haircut on exactly the tasks where the last few points matter most.
Why hidden: The headline keystat ('~90x fewer params, comparable AIME') compresses a 14-17 point raw gap into the word 'comparable,' and the report's own R&D tension flags that flattened parameter counts 'likely understate real growth' because frontier labs stopped disclosing — so the comparison is being made against a moving, partly-hidden target.
Data deduplication, pruning, and curation are themselves opaque, unaudited, and largely undisclosed — so the data-centric path inherits and deepens the transparency crisis rather than escaping it. Every model except K2 Think and Olmo 3 32B scored zero on pre-training-data transparency (Ch.3).
Who bears it: External auditors, regulators, and downstream scientists who need to know what was pruned out (and what bias that pruning introduced) to trust a domain model's outputs.
Why hidden: The theme celebrates curation as the new winning method but never notes that 'curation' is an undocumented editorial act on the training corpus — the same upstream-data opacity (FMTI's weakest dimension, 58→40) that Theme T04 identifies, now made load-bearing for performance.
Cheap, domain-specific bio/genomics model-building lowers the capital and expertise barrier to dual-use capability — the same parameter-efficiency that lets a 200M model predict variant effects also lowers the cost of misuse — while biosecurity oversight is structurally neglected (only 14 biosecurity publications in all of medical AI in 2025).
Who bears it: Public-health and biosecurity institutions, and society at large, who bear a proliferation risk that scales with how easy and cheap these generative biological-design models become to build.
Why hidden: The report frames small models purely as a democratization and public-good win; the proliferation/dual-use downside of cheap generative genome-scale models (Evo 2: 40B params, fully open weights, 9.3T base pairs) is mentioned nowhere in the theme and barely anywhere in the corpus (14 papers).
The data bottleneck is unevenly distributed, so 'win on data' hard-codes existing data inequities into model quality: medical imaging training data is ~100x smaller than non-medical, functional/perturbation biology data is scarce, ecological data is biased toward well-studied taxa, and patient-perception data is dominated by US/UK/Germany.
Who bears it: Patients, species, and populations in data-poor domains and geographies (sub-Saharan Africa, Latin America, Southeast Asia, rare diseases, understudied taxa) — when data is the moat, the data-poor get permanently worse models.
Why hidden: When compute was the barrier, money could in principle close the gap; when data is the barrier, no amount of money conjures clinical data that was never collected — so the theme's 'lowers the barrier for smaller players' optimism inverts for anyone in a data-desert, and the report only notes the data gaps in scattered tension footnotes, never connected to the small-model thesis.
Pushing the field toward narrow, data-curated specialist models fragments the ecosystem into many in-domain winners with no shared benchmark or interoperability, raising integration and governance cost — Ch.6 explicitly notes medical imaging AI 'lacks shared benchmarks' and Ch.5 notes most domain benchmarks are brand-new with no longitudinal data.
Who bears it: Health systems, labs, and integrators who must now evaluate, validate, and stitch together dozens of bespoke specialist models instead of governing a few general ones.
Why hidden: The theme's 'favor domain-specific open models over giant general ones' recommendation quietly externalizes the cost of running a zoo of incomparable specialist models onto the adopter; specialization's fragmentation tax is invisible in a per-model performance comparison.
Cascades / causal chains
Data-centric methods make small models competitive high-quality data becomes the scarce input clean public web data depletes (2026-2032) and >50% of new content is synthetic proprietary data licensing (NYT-Amazon) becomes the new moat deep-pocketed incumbents outbid for exclusive corpora the 'collapsed' capital barrier re-forms as a data-access barrier controlled by the same incumbents.
Cheaper to build a competitive domain model proliferation of small specialist bio/genomics/medical models building is cheap but validation (wet-lab, RCTs) is not validation capacity stays flat while model supply surges a backlog of unconfirmed candidates builds pressure mounts to deploy on benchmark scores alone under-validated tools reach patients/science (only 2.4% RCT-backed FDA devices; 11.8-14.6 severe harms/100 cases for open-ended LLMs).
Performance now tracks data quality not parameter count domains with rich curated data (chemistry, protein sequence, weather) get excellent small models, while data-poor domains (rare disease, understudied taxa, Global South health) cannot model-quality inequality maps onto pre-existing data inequality the data-rich pull further ahead 'democratization' delivers to the already-well-resourced and bypasses the data-poor.
Data-centric thesis generalizes per-task compute demand softens for substitutable workloads marginal buyers need far less compute per unit of useful capability the economics justifying the largest training runs weaken capex-heavy frontier bets face stranded-asset risk (the DeepSeek-R1 $1T+ market-cap shock as the template).
What to watch / leading indicators
Proprietary data-licensing deal volume and exclusivity terms (beyond NYT-Amazon): a wave of exclusive corpus deals, especially in medicine/genomics, would confirm the moat has migrated from compute to data and that incumbents are re-capturing it — refuting pure democratization.
Whether the OLMo-style gap on the *hardest* benchmarks (AIME 78.1% vs ~94%, frontier replication sub-20%) closes or persists as small models iterate: a closing gap confirms data-centric methods generalize; a persistent gap confirms 'comparable' only holds below the high-stakes threshold.
Validation-layer throughput vs. model-build throughput in science/medicine: track the ratio of RCT-backed FDA AI clearances (currently 2.4%) and experimentally-confirmed AI discoveries against the number of new domain models shipped. A widening gap confirms validation has become the real bottleneck.
Compute-demand signals for substitutable workloads — hyperscaler capex guidance, GPU spot pricing, and any repeat of a DeepSeek-style market-cap shock: softening here would confirm the data-centric thesis is denting the scale-everything capex case at the margin.
Tensions & contradictions
Collides head-on with T03 (single-foundry hardware risk) and the Ch.1 compute story: T06 says capex/compute is being displaced as the differentiator, yet aggregate compute is still scaling 3.3x/yr to 17.1M H100e through one TSMC fab. If small data-centric models truly collapse the compute barrier, the strategic stakes of the TSMC chokepoint and the gigawatt buildout should be falling — but the report shows them rising. Both can't be the dominant trend; T06 is the marginal/optimistic read against a still-compute-hungry baseline.
Contradicts itself across the build/validate seam, and collides with T02 (jagged intelligence): the same small models that 'beat' giants in-domain sit inside chapters showing sub-20% replication, ~17% bioinformatics accuracy, 33%/58% answer/code-failure, and 12.4% real household-robot success. Winning a ProteinGym benchmark is not a confirmed discovery; T06's performance framing rides on exactly the isolated-benchmark signal T02 warns is decoupled from real-world reliability.
Undercuts its own democratization claim via T04 (the accountability vacuum): T06 makes undisclosed data curation the engine of performance, while T04/Ch.3 show pre-training-data transparency is near-zero (every model but two scored 0) and the Transparency Index fell 58->40. 'Win on data' and 'nobody discloses their data' cannot both be virtues — the winning method is the least auditable input.
Collides with T05 (inference is the hidden resource bomb): T06 frames cheap-to-build as a clean win, but the binding deployed-AI cost is perpetual inference at scale (29.6 GW, city-scale water), which a cheaper-to-train small model does not necessarily reduce — and may worsen if low build cost drives a proliferation of more deployed models each serving inference.
Tension with T08's capital-concentration finding: T06 says the barrier to competitive AI is collapsing and redistributing who can build, while T08 shows investment, robots, and adoption entrenching in a few countries/states/mega-deals (US private AI investment 23.1x China's). Cheaper models lower one barrier even as capital concentrates around the next one (data, distribution, integration).
⟂ Contrarian read
"The 'bigger isn't better' story is largely a benchmark artifact that will not survive contact with deployment — and acting on it is the trap. Read against the report's own data, the small-model wins are concentrated in narrow, data-rich, structurally-evaluable domains (protein sequence, chemistry Q&A, AIME) where curation can substitute for scale; the moment you move to open-ended, end-to-end work — real research (38.8% vs 83.5% PhD), real bioinformatics (~17%), real clinical reasoning (11.8-14.6 severe harms/100 cases), real household tasks (12.4%) — scale and breadth still dominate and the small specialists collapse. So the actionable inversion is: the report has correctly identified that *data*, not parameters, is now the binding input — but it has drawn the wrong strategic conclusion. The winner of the next phase is not the scrappy curator who avoided capex; it is the incumbent who realizes data is the new oil and spends frontier-scale capital to *acquire and lock up* the world's scarce high-quality and proprietary corpora before they deplete (2026-2032) or drown in synthetic content (>50% of new web text). 'Win on data, not capex' will quietly become 'win on data *with* capex' — and the firms that treated cheap small models as permission to stop spending will find the barrier didn't fall, it just moved upstream to a place they no longer own."
T07
4 - Economy7 - Education

AI's First Labor Signal Is Generational — Protect the Entry-Level Rung

AI is hitting the youngest workers first, eroding the on-ramp that builds tomorrow's senior talent.

Employment for U.S. software developers ages 22-25 fell close to 20% from its 2022 peak by September 2025, while headcount for older cohorts kept growing; among 22-25-year-olds, employment in the most AI-exposed occupations fell ~16% relative to the least exposed. One-third of organizations expect AI to reduce their workforce over the next year. Meanwhile productivity gains land hardest in the same structured fields — customer support +14-15%, software +26%, marketing +50% — even as agent deployment stays in the single digits across business functions.

~20%
Drop in employment for software devs ages 22-25 from 2022 peak
33%
Organizations expecting AI to reduce their workforce next year
+26%
Pull requests per developer with AI (customer support +14-15%, marketing +50%)
−11%
U.S. CS undergraduate major enrollment, 2024 → 2025
    What to do
  • Policymakers: monitor age-disaggregated and AI-exposure-quintile employment; fund apprenticeship and early-career pipelines
  • Executives: redesign junior roles around AI augmentation to preserve the on-the-job learning that builds senior talent, rather than cutting entry-level hiring
  • Universities: anticipate continued CS-enrollment softening (−11%) and build AI-specialization tracks students actually want (master's +82% since 2022)
Productivity gains by function (where AI lands hardest)
Marketing output+50%Software · pull requests+26%Customer support+14–15%Experienced OSS devs (METR)−19%Measured productivity effect by task · METR finding is negative
Gains concentrate in structured, monitorable work — and so does the entry-level employment decline.
“Employment for software developers ages 22 to 25 has fallen nearly 20% from 2024.”AI Index 2026, Ch. 4
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
A senior-talent supply cliff arrives in ~5-10 years, even if AI plateaus tomorrow.long
The report shows entry-level hiring (devs 22-25, ~20% below 2022 peak) is the first casualty while older cohorts keep growing. But senior engineers are grown, not hired — they are the surviving cohorts of yesterday's juniors. If you stop hiring and training 22-25-year-olds in 2025, you structurally cannot have the pool of 30-35-year-old leads in 2032. The damage is invisible today precisely because the existing senior cohort is intact; the hole appears only when that cohort retires/churns and there is no backfill. The report frames this as eroding 'the rungs younger workers climb' but understates that the erosion is a delayed-detonation effect on the firms doing the cutting.
AI's own future capability is throttled by a thinning human-expert pipeline that the same AI depends on.long
Frontier models are trained and evaluated by, and aligned against, expert human judgment; RLHF, red-teaming, and domain benchmarks all consume scarce senior expertise. If AI hollows out the junior rung in software/CS (CS undergrad -11%, and students 'visibly responding to a slowing entry-level coding job market'), the field starves the very expert-formation process that produces the humans who make the next model better. AI is partially eating its own training-data and evaluation substrate. The report treats labor and capability as separate chapters and misses this feedback loop.
Measured 'productivity gains' are partly mismeasured because they offload an invisible learning cost onto the future.mid
The report's own data shows gains concentrate in structured work (support +14-15%, software +26%) but flags 'learning penalties' and a -19% slowdown for experienced OSS devs (METR) plus 'heavy AI reliance for learning produced measurable learning penalties with no speed gain.' If juniors complete tasks via AI without acquiring the underlying judgment, the +26% pull-request gain is borrowed against a future deficit in people who can debug, review, or design without the tool. Today's productivity number capitalizes a present output gain while expensing nothing for the de-skilling — a classic deferred liability the headline metric cannot see.
Wage compression and bargaining-power loss for early-career workers, separate from the headcount story.near
The report measures employment counts, not wages. With entry-level positions scarcer and the surviving roles 'redesigned around AI augmentation,' new graduates compete for fewer slots, weakening their wage-negotiation position and pushing acceptance of lower pay / worse terms. Even workers who DO get hired bear a cost the employment-count metric never registers. Extrapolation beyond the report's data, but the directional mechanism follows directly from a demand contraction concentrated in one cohort.
Credential inflation and a 'master's arms race' that shifts cost onto students and entrenches inequality.mid
The report shows the inverse scissors: CS undergrad enrollment -11% while AI master's +82% since 2022. As the entry rung disappears, the labor market re-anchors the de facto minimum credential upward (a master's becomes the new BA-equivalent for AI-adjacent roles). That transfers years of tuition cost and opportunity cost onto the worker and favors those who can afford graduate school — compounding the access inequality the Education chapter already documents (44% of small high schools offer CS vs 91% of large; suburban 71% vs rural 57%).
Erosion of the demographic base for AI public legitimacy and consumer surplus.mid
The report's Economy chapter celebrates $172B consumer surplus, but Public Opinion shows 64% of US adults expect AI to cause fewer jobs. If the cohort hit first (young, educated, online, high-adoption) is also the most economically squeezed, the population that generates the consumer surplus and drives adoption is the same one absorbing the labor harm — feeding political backlash that the optimistic surplus figure obscures.
Hidden ramifications / who pays
Universities and the student-loan system absorb a balance-sheet shock that nobody underwrote. As the entry rung vanishes, the implicit deal behind a CS degree (and the loans financing it) breaks: students borrowed against an entry-level salary that is now ~20% less available. The -11% enrollment drop is the leading edge; the lagging edge is a cohort of already-enrolled students graduating into a market that won't price their degree as promised.
Who bears it: Recently-graduated and current CS students carrying debt, and the universities/lenders whose enrollment-revenue and repayment models assumed a stable software on-ramp.
Why hidden: The report explicitly notes 'degree completion lags enrollment by years' and that postsecondary data is 2023-24 — so the cohort most exposed is structurally invisible in the current data. The financial-stability angle sits between the Economy and Education chapters and is owned by neither.
The harm lands disproportionately on women and underrepresented groups precisely because they entered the pipeline last. The field was diversifying at the junior level after a decade of effort; freezing entry-level hiring 'last-in-first-out' wipes out the most recent (most diverse) cohort first, while the intact senior cohorts (the report shows R&D gender ratio flat since 2010, >80% male in several countries) are protected. A generational cut is also, mechanically, a diversity rollback.
Who bears it: Women and underrepresented early-career entrants in AI/CS, and the diversity programs/firms that invested in pipeline-building.
Why hidden: The report tracks the generational signal (Ch.4) and the flat gender gap (Ch.1/Ch.7) in separate chapters and never multiplies them. The interaction — that seniority-biased change is also diversity-regressive — is absent from the first-order framing.
Non-software, non-degreed entry workers are hit harder and are nearly unmeasured. The report's cleanest cohort signal is software developers because that labor market is well-instrumented (LinkedIn, job postings, AWS skill mentions +1,358%). Customer support (+14-15% productivity) and marketing (+50%) are even more 'structured and monitorable' — the report's own criterion for where AI lands hardest — yet their entry-level workers lack the data infrastructure to show up as a cohort signal.
Who bears it: Entry-level customer-support, marketing, and back-office workers — often without four-year degrees, lower-paid, fewer alternatives — who are more exposed than developers but invisible in cohort analytics.
Why hidden: Software is over-measured, so it becomes the face of the story; the report's 'canaries in the coal mine' are actually the best-instrumented canaries, not the most-affected ones. The most exposed workers are under-surveilled, so absence of a signal is mistaken for absence of harm.
US immigration policy converts a domestic labor shock into a national-capacity shock. The Education chapter shows 67% of AI software master's graduates are nonresidents, just as the federal government revokes student visas. The 'protect the entry rung' framing is implicitly about US-born juniors, but the same squeeze plus visa restriction means the master's pipeline (the new de facto entry credential) is being cut off at exactly the source that supplies two-thirds of it.
Who bears it: International students/graduates, and US firms and labs that depend on them — plus US AI competitiveness, since talent inflows already collapsed 89% since 2017 (Ch.1).
Why hidden: The report keeps the labor signal (Ch.4) and the visa/nonresident dependence (Ch.7) and talent-inflow collapse (Ch.1) in three different chapters; the compounding — domestic entry squeeze + foreign-entry shutoff hitting the same credential tier — is never assembled.
Firms cutting juniors today are unknowingly raising their own future labor costs and creating a poaching/retention war over the surviving senior cohort. With no junior intake, the fixed stock of mid/senior engineers becomes a scarce, appreciating asset; their bargaining power and switching leverage rise, inverting the headline narrative that AI weakens labor.
Who bears it: Employers who optimized near-term headcount, and — paradoxically as beneficiaries — incumbent senior engineers who become harder to replace and easier to poach.
Why hidden: It contradicts the dominant 'AI weakens worker bargaining power' frame, so it is rarely modeled. It is a second-order consequence of the firm's own first-order cut, visible only when you treat the senior cohort as a depleting stock rather than a fixed input.
Cascades / causal chains
AI automates well-structured junior tasks (software +26% PRs, support +14-15%) firms cut/decline to backfill entry-level hiring (devs 22-25 down ~20% from 2022 peak) the cohort that would have become senior in 5-10 years never forms a senior-expertise shortage emerges in the same firms that did the cutting, with no fast remedy because seniority takes years to grow.
Entry rung shrinks and students observe it CS undergrad enrollment falls 11% while AI master's enrollment climbs 82% a master's becomes the new minimum credential entry shifts to a tier that is 67% nonresident US visa revocation throttles that tier the domestic AND foreign on-ramps to the AI workforce close simultaneously, accelerating the 89% collapse in US AI talent inflows.
Juniors do tasks through AI rather than by acquiring judgment ('learning penalties,' experienced devs -19% per METR) measured productivity rises now while skill formation silently stalls the borrowed competence comes due when those workers must review/debug/design without the tool a future cohort that looks productive on dashboards but cannot operate unaided, and an AI-evaluation/alignment pipeline starved of the human experts it depends on.
Entry-level hiring freezes operate last-in-first-out the most recently added (most diverse, post-DEI-effort) cohort is removed first while intact senior cohorts (flat gender ratio since 2010, >80% male in several countries) are protected a generational labor cut becomes a diversity rollback that no diversity metric flags because it shows up only as an aggregate hiring decline.
Consumer surplus and adoption are driven by young, high-adoption, educated users ($172B, up 54%) that same demographic absorbs the first labor harm economic anxiety concentrates in the population generating the surplus public 'fewer jobs' expectation (64%) hardens into political pressure regulatory/redistributive backlash that the headline surplus figure gave no warning of.
What to watch / leading indicators
Age-disaggregated employment by AI-exposure quintile, refreshed quarterly: does the 22-25 software decline spread to 26-35 (confirming a structural pipeline cut) or reverse as the 2021-22 over-hiring unwinds (refuting AI causation)? The report itself names this as the metric to monitor.
Junior-to-senior hiring RATIO and internal promotion velocity at large software employers — a falling ratio with flat senior headcount confirms the 'rung removed, stock not replenished' cascade; the senior-talent cliff is forecastable years early from this leading indicator.
Whether the nonresident share of AI master's graduates (currently 67%) falls sharply post-visa-revocation, AND whether domestic master's enrollment rises to fill it — divergence confirms the dual on-ramp shutoff; substitution refutes it.
Wage and offer-acceptance terms for new CS/AI graduates (not just headcount): falling real starting wages or rising credential requirements would confirm the compression and credential-inflation effects that the employment-count headline cannot see.
Tensions & contradictions
Collides with the report's own causal-attribution caveat (Ch.4): from 2022 to early 2025 unemployment rose MORE for the LEAST AI-exposed workers (quintile 1, +0.94pp) than the most exposed (quintile 5, +0.30pp), and 'large-scale job losses have not appeared in aggregate employment data.' The generational signal and the exposure-quintile signal point in opposite directions — the headline may be over-attributing a cohort decline to AI when macro conditions (over-hiring in 2021-22, rate hikes) could explain much of the 22-25 software drop.
Collides with T06 (Bigger Isn't Better / data-centric small models lower the capital barrier). If competitive AI no longer needs frontier capex and can be built by smaller players and the public sector, the entry-level rung could re-open at a broader set of employers — the labor pessimism of T07 assumes demand stays concentrated in the same few firms that are cutting, which T06 directly undercuts.
Collides with the agent-deployment reality (Ch.4/T08): 'AI agent deployment is still in the single digits across nearly all functions,' yet entry-level employment already fell ~20%. If autonomous agents are barely deployed, the junior decline cannot be mostly AI-caused yet — implying employer ANTICIPATION (one-third expect workforce cuts) and the AI narrative itself, not realized automation, is driving the hiring freeze. The cause may be a story about AI more than AI.
Collides with the productivity-evidence fragility (Ch.4): the flagship negative finding (METR -19% for experienced devs) 'could not be replicated' partly because the measurement baseline shifted (devs became reluctant to work without AI). The same instability that undercuts the negative case also undercuts the +26%/+50% positive case T07 leans on to argue gains 'land hardest' in structured fields.
Collides with the expert-public optimism chasm (Ch.9): experts expect only 39% fewer jobs (vs public 64%) and 73% expect AI to improve how people do their jobs. T07's 'protect the entry rung' urgency aligns with public anxiety, but the people who build these systems are far less worried — raising the question of whether T07 amplifies a public fear the expert evidence does not yet support.
⟂ Contrarian read
The generational signal may be largely a macro and narrative artifact mislabeled as an AI labor effect — and treating it as proof of AI displacement could cause the very harm it warns about. Three of the report's own data points undercut the AI-causation story: agent deployment is still single-digit across functions (so little is actually automated yet), unemployment rose MORE for the least AI-exposed workers than the most exposed from 2022-early 2025, and the 22-25 software cohort sits atop a well-known 2021-22 tech over-hiring bubble plus a rate-hike contraction. A defensible non-consensus read: employers are pre-emptively freezing junior hiring because they BELIEVE the AI-displacement narrative (one-third 'expect' cuts; anticipated decreases 'outpace observed ones in nearly all functions'), not because AI has actually replaced those juniors. If so, the binding mechanism is expectation, and an authoritative report headlining 'AI is hitting the youngest workers first' becomes a self-fulfilling prophecy — it gives CFOs cover to cut the entry rung, manufacturing the pipeline collapse from a signal that was, at this moment, more anticipatory than real. The intervention should target the narrative and hiring incentives now, before the realized automation that would justify it actually arrives.
T08
4 - Economy

Value Is Diffusing, Capital Is Concentrating — A Diffusion Gap

AI's consumer value dwarfs producer revenue, yet investment, robots, and even adoption are entrenching in a few countries and mega-deals.

U.S. consumer surplus from generative AI grew 54% in a year to ~$172 billion by early 2026 (median value per user tripled from $3.40 to $11.40), dwarfing producer revenue. Yet U.S. private AI investment ($285.9B) was 23.1x China's ($12.4B), 28 funding events topped $1 billion, and China installed 295,000 industrial robots in 2024 — more than the rest of the world combined (54% of global installs). Generative AI hit 53% adoption in three years (faster than PC or internet), but the U.S. ranks only 24th at 28.3% population-level adoption, behind UAE (64%) and Singapore (61%).

$172B
U.S. annual consumer surplus from genAI by early 2026 (up 54%)
23.1x
U.S. private AI investment vs China ($285.9B vs $12.4B)
295,000
Industrial robots China installed in 2024 — 54% of global total
28.3% (24th)
U.S. population-level AI adoption, behind UAE (64%) and Singapore (61%)
    What to do
  • Policymakers: don't rely on private-investment figures (they understate China's ~$184B in guidance funds) and invest in diffusion to lagging populations and regions
  • Executives: target AI at structured, monitorable tasks where gains are real, and fund integration (the J-curve) before expecting realized productivity
  • Workforce planners: note agent deployment is still single-digit % across functions — plans assuming imminent autonomous workflows are ahead of reality
Global private AI investment by country, 2025 (US$B)
United States$285.9BChina$12.4BUnited Kingdom$5.9BFrance$4.4BCanada$4.3BIndia$4.1BPrivate AI investment, 2025 · US ≈ 23× China
The U.S. attracted 23x China's private AI investment — the starkest concentration gap in the report.
“This consumer surplus figure dwarfs estimated U.S. generative AI revenues, suggesting that the social returns from the technology far exceed the private returns captured by producers.”AI Index 2026, Ch. 4
Beneath the headline 2nd-order effects & hidden ramifications 5 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
The $172B consumer-surplus figure becomes the political wedge used to justify NOT compensating producers, accelerating a thin-margin race to give frontier models away free — which paradoxically deepens, not relieves, capital concentration.mid
The report frames surplus as social good (innovators capture ~3% of social returns). But free distribution is itself a moat: only players who can absorb 29.6 GW of inference cost (Ch.1) and lose money per query can run it. So 'value diffuses to consumers' operationally means 'only the most capitalized firms can afford to diffuse it,' converting consumer benefit into a barrier to entry that locks in the same handful of mega-cap providers funding the 28 billion-dollar rounds.
Population-level adoption rank (US 24th at 28.3%) decouples from economic capture, producing a 'rentier' geography: countries that adopt fastest (UAE 64%, Singapore 61%) generate consumer surplus that is booked as profit by US firms, not local GDP.mid
Consumer surplus accrues where users are; producer revenue and equity value accrue where the firms are incorporated (US, California >75% of US investment). High-adoption, low-production countries become net exporters of data and attention and net importers of a metered service — a 21st-century terms-of-trade asymmetry that the report's per-country adoption table masks because it measures uptake, not value retention.
China's robot-installation lead (295K, 54% of global, growing while US/EU decline) converts a manufacturing-automation gap into a downstream embodied-AI data moat that the capital-investment ratio (23.1x US lead) entirely fails to capture.long
Industrial robots are not just labor substitution — at scale they are a sensor fleet generating proprietary real-world manipulation data, the exact bottleneck the report flags for embodied AI (BEHAVIOR-1K 12.4% real-task success, Ch.2). The 23.1x 'US dominance' is measured in private venture dollars into digital models; it is structurally blind to a physical-world data flywheel that compounds where the robots already are.
The single-digit-percent agent-deployment reality means the measured $172B surplus is almost entirely a chat/augmentation artifact — so the moment agentic deployment scales, the diffusion gap widens sharply rather than narrowing, because agents concentrate value in firms that can orchestrate them.mid
Today's surplus comes from free consumer chat tools (median $11.40/user). Agentic automation (where security/risk is the 62% blocker, Ch.3) requires integration capital, governance roles (up to 17%, Ch.3), and risk tolerance only large firms have. As deployment moves from 'single digits' toward double digits, the value type shifts from broadly-distributed consumer surplus to narrowly-captured enterprise productivity (+26% PRs, +50% marketing) — concentrating the next wave precisely as the report's snapshot suggests diffusion.
Public labor fear (64% of US adults expect fewer jobs vs 5% more; lowest gov-trust in US, Ch.9) collides with measurable entry-level erosion to produce a backlash that targets visible producers (a few mega-cap firms), making concentration politically legible and inviting capture-prone regulation.near
Concentration is what makes a phenomenon governable AND blamable. Because value entrenches in a handful of named firms and California, the political system has a small, identifiable target. Combined with 64% job pessimism and 37% industry share of congressional witnesses (Ch.8), this channels into state-level action (150 bills 2025) that the incumbents are best resourced to shape — turning anti-concentration sentiment into incumbent-friendly compliance moats.
Hidden ramifications / who pays
The 23.1x US-over-China investment ratio is a measurement artifact that flips when state-channeled spending is counted: China's ~$184B in government guidance funds (2000–2023, Ch.4 tensions) means the report's headline concentration stat overstates US dominance and understates a parallel, opaque Chinese capital pool that doesn't show up in venture databases.
Who bears it: US policymakers and national-security planners who set strategy off the 23.1x figure; they are calibrating to a number the report itself flags as 'distorted.'
Why hidden: Guidance funds are deployed through state-owned vehicles and local governments, invisible to PitchBook-style private-investment tracking. The headline ratio is irresistibly quotable; the tension footnote that guts it is buried. The report leads with the clean comparison and hedges in the methodology notes.
Because >51.7% of new web content is now AI-generated (Ch.1) and the open web is the substrate of consumer 'free' tools, the consumer surplus is being silently financed by depleting the public-good data commons that produced it — a one-time drawdown booked as recurring value.
Who bears it: Future model trainers, open-source/academic labs (which lack licensed-data budgets), and the public web itself; the surplus is enjoyed now by consumers and the cost is paid later by everyone who needs clean training data.
Why hidden: Consumer surplus is measured cross-sectionally at a moment in time; data-commons degradation (model collapse risk) is a slow stock variable with a 2026–2032 depletion window. The two never appear in the same table, so the surplus looks free when it is partly an asset sale.
The diffusion gap inverts inside science and the public sector: the report celebrates small data-centric models (T06) and academia producing most cited research, yet capital, talent, and compute concentrate in industry (91.18% of notable models) — so the actors who could diffuse value (academia, government, open-source) are systematically starved of the one input (compute, 17.1M H100e) that concentration controls.
Who bears it: Academic and government labs, open-source maintainers, and lower-income nations adopting national strategies (Ch.8) — the very parties a 'diffusion' policy would empower are gated by compute access they cannot fund.
Why hidden: The report treats investment concentration (Ch.4) and research diffusion (Ch.1, Ch.5) as separate good-news/bad-news stories. The causal link — that compute concentration is the throttle on research diffusion — requires reading the two chapters against each other, which the chapter-siloed structure discourages.
Median value-per-user tripling ($3.40→$11.40) alongside flat-to-modest adoption growth (48%→56% US adult usage) means value is intensifying per existing user, not broadening to new users — the surplus growth is the already-included getting more value, not the excluded being included.
Who bears it: Non-adopters and lagging populations (the bottom of the GDP-per-capita adoption gradient); the 'diffusion' is actually deepening within the adopter class while the excluded fall further behind in relative terms.
Why hidden: A headline '+54% surplus' reads as broad-based growth. Decomposing it into (per-user value × user count) reveals the growth is overwhelmingly the intensive margin (per-user, +235%) not the extensive margin (usage, +17%), but the report presents the aggregate, which conflates the two.
Transparency collapse (FMTI 58→40, Ch.3) makes the diffusion-vs-entrenchment question structurally unanswerable going forward: you cannot measure whether value diffuses if you cannot audit who captures it, what data feeds it, or where compute/energy lands.
Who bears it: Future AI Index authors, regulators, and researchers; the very instruments needed to track the diffusion gap are being switched off as the gap widens.
Why hidden: Each finding is reported as a discrete decline. The compounding effect — that opacity is the meta-risk that disables monitoring of all the other risks including this theme's central question — is never stated because it is an emergent property of the whole report, not any one chapter.
Cascades / causal chains
Frontier models given away free to maximize consumer surplus only firms that can absorb city-scale inference cost (29.6 GW) can sustain free distribution free-as-moat excludes smaller producers capital concentrates further into the same 28 billion-dollar-round players the theme says are already entrenched
China installs 295K industrial robots (54% global) while US/EU installs decline robot fleet generates proprietary real-world manipulation data at scale China builds the embodied-AI data flywheel the report flags as the binding constraint (sim-to-real, 12.4% household success) the digital-investment 23.1x 'US lead' becomes irrelevant in the physical-AI race it cannot see
Consumer free tools trained on the open web >51.7% of new web content becomes AI-generated training-data commons degrades (model-collapse risk, 2026–2032 depletion) only firms holding licensed proprietary data (NYT-Amazon-style deals) can train clean models data concentration reinforces the capital concentration, so the 'diffused' consumer value cannibalizes the input needed to keep producing it
Value concentrates in a few named firms + California concentration makes the phenomenon politically legible and blamable 64% public job-fear (Ch.9) + 37% industry witness share (Ch.8) channel into 150 state bills incumbents best-resourced to shape compliance anti-concentration sentiment produces incumbent-friendly regulatory moats that entrench the concentration it was meant to break
Agent deployment scales from single-digit % toward enterprise norm value shifts from broadly-distributed free consumer surplus to narrowly-captured enterprise productivity (+26% PRs, +50% marketing) the next wave of AI value is born concentrated the 'diffusion' the 2026 snapshot shows is revealed as a transient artifact of the pre-agentic chat era
What to watch / leading indicators
The extensive vs intensive margin of consumer surplus: if next year's surplus growth again comes mostly from per-user value (the $11.40 figure rising) rather than from adoption-rate gains in lagging populations, diffusion is a myth and entrenchment-within-adopters is confirmed. Watch the US adoption rank (currently 24th / 28.3%) — if it stagnates while surplus rises, the gap is widening.
Agent-deployment share by business function: the single-digit-% figure is the leading indicator. The quarter it crosses into double digits across multiple functions is the quarter to expect value to shift from diffuse consumer surplus to concentrated enterprise capture — and to watch enterprise-productivity gains decouple from consumer-surplus growth.
Whether China's industrial-robot lead (54% of installs) starts showing up as embodied-AI / real-world-manipulation benchmark gains (e.g., BEHAVIOR-1K real-task success climbing from 12.4%) from China-affiliated teams — that would confirm the physical-data flywheel the capital ratio misses.
Licensed-data deal velocity (post NYT-Amazon) and any measurable degradation in open-web data quality: accelerating proprietary licensing alongside rising AI-generated web share (>51.7%) confirms the data-commons drawdown cascade and the migration of the moat from compute to clean data.
Tensions & contradictions
Collides with T06 (Bigger Isn't Better — small data-centric models collapse the capital barrier): T06 says the capex moat is falling, T08 says capital is concentrating into 28 billion-dollar rounds and hyperscaler capex (Google $150B+). Both are true only if you separate model-building cost (falling) from deployment/inference/distribution cost (concentrating) — the moat migrated from training to serving, which neither theme states explicitly.
Collides with the report's own Ch.1 finding that research leadership is diffusing to China (74.24% of patents, 41 of top-100 cited papers, GitHub share US 80%→31.7%): T08's '23.1x US dominance' headline and Ch.1's 'China leads research volume' cannot both be the whole story — model/capital leadership (US) and research/patent/open-source leadership (China) have decoupled, so 'concentration' depends entirely on which input you measure.
Collides with T03 (single TSMC foundry) and T05 (inference resource bomb): T08 treats concentration as a capital/adoption story, but the binding concentration is physical — one Taiwanese fab and 29.6 GW of power. Value cannot diffuse past the chokepoint of who gets chip allocation and grid capacity, a constraint T08's investment framing omits.
Collides with T07 (seniority-biased labor erosion) on the meaning of 'diffusion': T08 frames diffused consumer surplus as the upside, but the same diffusion of free capable tools is what erodes the entry-level rung — the consumer-surplus gain and the entry-level-job loss may be two faces of the identical mechanism (capable free tools substituting for junior labor), so celebrating one while lamenting the other is internally inconsistent.
Internal contradiction flagged by the report itself: the 23.1x ratio is 'distorted' because it excludes China's ~$184B in guidance funds, yet the same ratio is promoted as 'the starkest concentration gap in the report' (T08 figure takeaway) — the theme's flagship statistic is one the report's own methodology notes disavow.
⟂ Contrarian read
"The 'diffusion gap' is misnamed: there is no gap between value diffusing and capital concentrating — they are the same act viewed from two ends. Giving a frontier model away free to 56% of US adults IS the concentration mechanism, not a counterweight to it. Free distribution at a per-query loss is the most aggressive moat in tech history: it is predatory pricing that only the most capitalized firms can sustain, it harvests the entire population's data and attention as the price of 'free,' and it forecloses any producer who would need to charge. So the $172B consumer surplus is not evidence that 'social returns exceed private returns and value is escaping the incumbents' — it is the receipt for the incumbents having bought the entire market's loyalty and data with capital nobody else has. Read this way, the report's hopeful framing (consumers winning, innovators capturing only 3%) is exactly backward: the 3% capture rate is not a market failure to be corrected by diffusion policy — it is the deliberate, temporary, and rational investment posture of firms buying a durable monopoly, and it will revert hard the moment agentic deployment gives them a metered product to charge for."
T09
8 - Policy & Governance9 - Public Opinion

Build to the States, Not a Federal Framework

U.S. states passed 150 AI bills in 2025 while Washington pivoted to deregulation and moved to challenge those laws in court.

U.S. state AI bills rose from fewer than 10 in 2020 to 150 in 2025, with California leading at 62 bills over 2016-2025. Meanwhile the December 2025 federal executive order created a DOJ AI Litigation Task Force to challenge state laws and tied federal funding to states avoiding 'conflicting' legislation. Industry now supplies the largest bloc of congressional AI witnesses (37%, up from 13%) as government's share fell from 35% to 10%, and compute sovereignty stays concentrated (China 85 state-backed supercomputers vs South Asia's 2).

<10 → 150
AI bills passed by U.S. states, 2020 → 2025
62
AI bills enacted by California, 2016-2025 (>2x any other state)
13% → 37%
Industry's share of congressional AI witnesses (115th → 119th Congress)
    What to do
  • Compliance leaders: build to binding state instruments now in force — California SB 53/SB 243, Texas TRAIGA, Colorado AI Act, Utah HB 452, plus watermarking laws — and budget for legal uncertainty
  • AI developers: prepare for divergent compliance across the EU AI Act (live since Aug 2025), China's mandatory labeling, and a U.S. state patchwork
  • Clinicians & educators: audit patient- and student-facing chatbots against state disclosure and data-handling mandates arriving in 2026
Most active states in AI legislation, 2016-25
California62Maryland28Virginia25Utah24New York18Texas17Illinois15AI bills passed by state, 2016–25
A handful of states — led by California — carry the bulk of binding U.S. AI regulation in the federal vacuum.
“The upshot is that the future of U.S. state-level regulation on AI remains uncertain.”AI Index 2026, Ch. 8
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
Compliance capacity becomes a moat that entrenches the largest labs and kills small AI developers, the opposite of what 'pro-innovation' federal deregulation claims to protect.mid
A 50-state patchwork (150 bills in 2025, CA at 62) imposes fixed legal-overhead costs that scale poorly: only firms with large legal departments can track and comply with divergent rules (CA SB 53, TX TRAIGA, CO AI Act, UT HB 452). Chapter 4 shows capital already hyper-concentrating (28 billion-dollar rounds; US 23x China; avg deal +46%). Patchwork compliance adds a per-jurisdiction fixed cost that the 3,499 newly-funded smaller firms cannot amortize, while frontier labs absorb it trivially.
The DOJ Litigation Task Force converts AI governance from a legislative question into a judicial one, shifting the actual rule-making venue to federal courts and the funding-conditionality mechanism — where outcomes turn on commerce-clause and spending-clause doctrine, not on what any voter or legislator wants.near
Tying federal funding to states avoiding 'conflicting' legislation (Dec 2025 EO) is a coercion mechanism modeled on prior federal-funding leverage. Its validity is a Spending Clause question (is the condition unconstitutionally coercive?). The substantive rules that survive will be whichever ones courts don't strike — a selection process driven by litigation strategy, not policy merit.
State law becomes the de facto national floor for the entire country (and effectively for global products) via the 'California effect,' so federal deregulation fails to deliver the deregulated market it promises.mid
Firms rarely build 50 product variants; they build to the strictest large market. With California at 62 bills and SB 53/SB 243 in force, and the EU AI Act already live (Aug 2, 2025), a US developer's compliant product is shaped by CA+EU regardless of federal posture. Washington's deregulation thus changes litigation exposure but not product design.
Industry's capture of the congressional information channel (witnesses 13%->37%, government 35%->10%) means that even if a federal framework eventually arrives, its substance will be pre-shaped by the regulated parties — making 'preemption' a vehicle to lock in industry-preferred weak rules rather than to harmonize.long
Chapter 8 shows the expertise base shifting decisively to industry while government/academia voices collapse. A future federal preemption statute drafted from that witness record would import industry framing; preemption then deletes the stricter state floor and replaces it with the weaker captured federal standard — a ratchet-down, not harmonization.
Compute sovereignty concentration (China 85 vs South Asia 2 state-backed supercomputers; US 5,427 data centers) makes the US federal-state fight a luxury most of the world cannot have — and pushes dependent regions toward whichever bloc's governance model ships with the hardware.mid
Governance autonomy presupposes compute autonomy. Regions with 2-8 sovereign clusters cannot credibly diverge from the terms attached to the cloud/chips they rent. So the substantive AI rules in MENA/South Asia/LatAm will be set by US export policy and Chinese state-backed export terms, not by their own legislatures — the inverse of the US story where sub-national bodies have real leverage.
The chatbot-specific state statutes (companion bots, mental-health bots, minors) become the leading edge of AI liability law precisely as Chapter 9 projects companionship use scaling to 30% daily by 2040 — creating a fast-growing exposure surface that the deregulatory federal posture explicitly leaves to states.mid
The Dec 2025 EO carves out child safety and leaves companion/mental-health regulation to states (CA SB 243, UT HB 452). Chapter 3 flags relational harms as outside existing safety frameworks; Chapter 9 shows the user base for exactly these products growing fastest. Liability law will therefore be written state-by-state, through litigation, on the highest-growth and least-understood harm category.
Hidden ramifications / who pays
The 89% collapse in inbound AI talent and 67%-nonresident AI master's pipeline means the federal-state regulatory chaos lands on a talent base that is already shrinking and visa-threatened — regulatory uncertainty becomes a second, compounding push factor for researchers choosing where to locate.
Who bears it: Early-career and international AI researchers, and the US universities (Ch.7) and startups that depend on them — parties centered in Chapters 1 and 7, not in the policy chapter that owns this theme.
Why hidden: The policy chapter frames the federal-state fight as a domestic governance story. It never connects to the talent-flow data in Chapter 1 (net flow 324.6->26.0) — but jurisdictional unpredictability is exactly the kind of friction mobile talent routes around, and the report treats these as separate findings.
State legislatures are now the primary venue for AI rules, but the report's own caveat is that the state tracker counts only bills containing the literal phrase 'artificial intelligence' — so the real regulatory surface is undercounted, and firms relying on the 150-bill figure are systematically underestimating their exposure (privacy, biometric, consumer-protection laws that bind AI without naming it).
Who bears it: Compliance teams, smaller deployers, and any executive using the headline count to size legal risk.
Why hidden: It sits in the tensions section as a methodology footnote, not in the headline. The '150' number is precise-looking and travels as the true count, masking a larger hidden body of AI-binding law.
Federal deregulation plus a DOJ task force creates a regulatory vacuum that insurers and standards bodies (ISO/IEC 42001, NIST AI RMF — rising per Ch.3) will fill instead — moving real governance into private contracts and insurability requirements that are even less democratically accountable than either Congress or a state legislature.
Who bears it: Every deploying organization, but especially smaller ones priced by insurers; and the public, whose protections migrate from statute to opaque underwriting and audit standards.
Why hidden: The report tracks regulation as government action (counts of bills/EOs). It does not model the substitution effect where, when public law retreats and liability stays uncertain, private risk-transfer mechanisms become the binding constraint — invisible to a bill-counting methodology.
The funding-conditionality lever disproportionately threatens states most dependent on federal money — typically lower-income, often rural states — meaning the EO's coercion bites hardest on the same states (Ch.7: 44% of small/rural schools offer CS) that are already furthest behind on AI capacity, widening the intra-US AI divide rather than the headline US-vs-world one.
Who bears it: Lower-income US states, their school systems and public institutions, and their residents — not the coastal states (CA) that dominate the bill counts and can fiscally resist.
Why hidden: The theme frames federalism as Washington-vs-states monolithically. It misses that 'states' are heterogeneous in fiscal dependence, so a uniform funding threat produces highly non-uniform coercion — a distributional effect the aggregate framing erases.
With capability converged (Ch.2: top models within 25 Elo; US-China gap 2.7%) competition has already moved to cost/reliability/deployment — so regulation, not model quality, is becoming a primary competitive variable, and the federal-state patchwork is quietly a market-structure intervention picking winners, not a neutral safety debate.
Who bears it: The full competitive field of AI providers; buyers who will see less price competition; and antitrust authorities who are not framing AI concentration through the compliance-cost lens.
Why hidden: Chapters 2 and 8 are read separately. Once capability stops differentiating, the cost of navigating 50 regimes becomes a real competitive wedge — but the policy chapter never treats regulation as an input to market concentration.
Cascades / causal chains
Federal deregulation + DOJ litigation threat states keep legislating but firms cannot wait for court outcomes firms build to the strictest large market (CA + EU) federal 'deregulation' produces no actual deregulated product, only added litigation risk and compliance cost.
Industry witness share rises 13% 37% as government's falls 35% 10% congressional understanding of AI is industry-framed any eventual federal preemption statute encodes industry-preferred weak rules preemption deletes stricter state floors net national protection ratchets DOWN under the banner of 'harmonization.'
50-state patchwork raises fixed per-jurisdiction compliance cost only large labs can amortize it across revenue smaller of the 3,499 newly-funded firms exit or never enter regulated use cases capital concentration (already 28 billion-dollar rounds) deepens fewer competitors less price/safety pressure on the converged frontier.
Dec 2025 EO ties federal funding to avoiding 'conflicting' state laws fiscally dependent (often rural/low-income) states feel the most coercion they retreat from AI rules and lag on AI capacity the intra-US AI divide widens alongside the US-vs-world divide the report centers.
Compute sovereignty stays concentrated (China 85, South Asia 2) dependent regions can't diverge from the governance terms bundled with rented hardware their 'national AI strategies' (spreading per Ch.8) describe intent they cannot enforce de facto AI governance for most of the world is set in Washington and Beijing, not locally.
What to watch / leading indicators
Whether the DOJ AI Litigation Task Force's first suits target the funding-conditionality mechanism or the state laws directly, and the earliest district/appellate rulings on Spending Clause coercion — this determines whether the real venue is courts or legislatures.
Whether major AI developers publicly adopt a single compliance baseline keyed to California + EU rather than building per-state variants — confirming the 'strictest-market floor' effect that would neuter federal deregulation in practice.
Movement in the industry-witness share in the 120th Congress and whether any federal preemption bill emerges; a preemption bill drafted from an industry-dominated record would confirm the ratchet-down cascade.
Insurer and procurement behavior: AI-liability insurance requirements and ISO/IEC 42001 / NIST AI RMF clauses appearing in enterprise contracts — the leading indicator that governance is migrating to private risk-transfer as public law retreats.
Tensions & contradictions
Chapter 9 shows the US public, across all 50 states, wants MORE regulation (41% 'not far enough' vs 27% 'too far') and trusts its own government least in the world (31%) — yet the federal pivot is deregulatory and aimed at suppressing the state laws that are the public's only responsive venue. The theme's 'build to the states' is thus also the only channel left honoring majority preference, which the EO targets.
The deregulatory rationale is 'protect innovation,' but Chapter 4 shows innovation capital is already extraordinarily concentrated (US 23x China; 28 billion-dollar rounds) and Chapter 2 shows capability has converged — so removing guardrails mainly benefits incumbents who don't need protection, while patchwork compliance costs fall on the entrants 'innovation' rhetoric claims to defend.
The EU is the most-trusted AI regulator globally (Ch.9: 53% median vs US 37%, China 27%) and its AI Act is live — so US federal deregulation cedes the global standard-setting role to Brussels at the same moment the US is trying to win the geopolitical AI race the report otherwise frames the US as leading.
Chapter 3 documents transparency actively reversing (FMTI 58->40; >90% of models shipped without training code) — meaning the federal government is deregulating exactly when independent verification of safety is collapsing, so 'voluntary' replaces 'state-mandated' disclosure precisely when voluntary disclosure has proven to be receding.
Chapter 8's own methodology undercuts the theme's central number: the 150-bill count captures only enacted bills literally containing 'artificial intelligence,' and omnibus bills count as one — so the headline simultaneously overstates the meaning of any single number and understates the true regulatory surface.
⟂ Contrarian read
The conventional read is that federal deregulation plus a litigation task force will hollow out US AI governance. The defensible contrarian read: the federal-state collision is the most pro-accountability structure the US could plausibly have right now — precisely because it is messy. A single federal framework, drafted from a Congress where industry is now the largest witness bloc (37%) and government has collapsed to 10%, would almost certainly be weaker and captured, then preempt everything beneath it. The patchwork instead keeps 50 competing veto points alive, lets California and the EU set a high de facto floor, and forces real adversarial litigation that surfaces the actual stakes — whereas a tidy preemptive statute would settle them quietly in industry's favor. On this read, the worst outcome for public protection is not the chaos; it is a 'national framework' arriving to end it. The litigation and funding threats are dangerous not because they deregulate, but because they aim to collapse the multi-venue structure that is currently the only thing keeping a captured federal floor from becoming the ceiling.
T10
6 - Medicine

Scale Constrained Clinical AI Now; Hold Autonomous Diagnosis Back

Clinician-in-the-loop tools deliver real outcomes while open-ended LLM diagnosis remains unproven and unsafe.

Ambient AI scribes cut note-writing time up to 83% (one system reported 112% ROI and 11.3 extra patients/month), and sepsis-prediction tools delivered an 18.7% relative mortality reduction across 13 Cleveland Clinic hospitals. Yet open-ended general-purpose LLMs still make 11.8-14.6 severely harmful recommendations per 100 cases, and FDA clearance is no proof of benefit: only 2.4% of devices with clinical studies have RCT backing and only ~5% of clinical AI studies used real patient data. AI Overviews now front-run 84-92% of health searches with minimal oversight.

up to 83%
Reduction in physician note-writing time from ambient AI scribes
18.7%
Relative sepsis-mortality reduction across 13 Cleveland Clinic hospitals
2.4%
Cleared AI devices with clinical studies backed by RCT data
84-92%
Health searches now returning an AI Overview
    What to do
  • Health-system executives: deploy clinician-in-the-loop tools (scribes, sepsis prediction, evidence retrieval) with documented workflow gains; hold autonomous LLM diagnosis
  • Regulators & procurement: stop treating FDA clearance as proof of benefit — demand prospective real-patient evidence and formal governance
  • Policymakers: address the unvetted AI Overviews shaping 84-92% of patients' first health interpretations, and fund neglected biosecurity (only 14 papers) and equity research
AI Overviews on health-related searches
Common health queries92%Symptom queries92%Treatment queries90%Condition-question queries88%Condition queries84%Share of searches returning an AI Overview
AI summaries front-run nearly all symptom and treatment searches, ahead of any clinician contact.
“AI in medicine advanced on multiple fronts in 2025, but strong model performance has not consistently translated into real-world clinical impact.”AI Index 2026, Ch. 6
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
Ambient scribes and sepsis-alert tools entrench incumbent EHR vendors and the largest health systems, widening the gap between well-capitalized hospitals and safety-net/rural providers rather than democratizing care.mid
The tools that actually scale are the ones that bolt onto existing EHR workflows (63% ambient-scribe adoption among Epic-EHR hospitals; TREWS runs inside the EHR). Integration, not the model, is the moat — so value accrues to whoever already owns the EHR install base and can fund integration. Chapter 6 explicitly notes 'financial, operational, and institutional barriers sit between authorization and real-world deployment,' which small/rural providers cannot clear. The 112% ROI / 11.3-patient figure assumes a billing structure (fee-for-service throughput) that safety-net providers don't share.
The clinician-in-the-loop framing quietly converts physicians into liability sponges for unvalidated automation, increasing — not decreasing — documentation and cognitive burden even as note-time drops.near
'Human-in-the-loop' makes the clinician the legally accountable check on a tool whose own evidence base is thin (2.4% RCT-backed, 5% real-patient data). The clinician must now review AI-drafted notes and alerts for the 11.8-14.6 severe errors / 100 cases the system can introduce, especially the 76.6% errors of omission, which are the hardest to catch precisely because nothing is on the screen to react to. Time saved on transcription is partially re-spent on vigilance against an opaque tool.
AI Overviews fronting 84-92% of health searches will measurably shift case-mix and timing at the point of care — more late presentations of conditions the model under-weights, more anxious early presentations of conditions it over-weights.mid
Chapter 3's KaBLE finding (GPT-4o accuracy falls 98.2%->64.4% when a user states a false belief; DeepSeek R1 ->14.4%) means the unvetted Overview will tend to confirm the searcher's pre-existing wrong belief. Combined with 76.6% errors of omission (NOHARM), the systematic failure mode is reassurance-when-it-shouldn't and omission of red flags — which delays presentation for exactly the omitted conditions. This is a population-level triage distortion happening upstream of, and invisible to, the regulated device pipeline.
Sepsis-style success stories will be over-generalized into a procurement template that fails silently when ported to different populations, because the headline metrics hide the validation conditions.mid
TREWS' 18.7% mortality reduction came with 89% clinician adoption inside a specific 13-hospital system with a Bayesian, constrained model and heavy workflow integration. Buyers will read '18.7%' as a property of 'AI sepsis tools' and procure look-alikes. Chapter 6 warns only 5% of studies use real clinical data and dataset/benchmark comparability is unreliable; Chapter 3 shows privacy/fairness trade-offs (federated Alzheimer's: stronger privacy -> +21.4% missed diagnoses at small hospitals). A model tuned on one population can degrade precisely where the buyer is most resource-constrained.
The benchmark-vs-bedside gap becomes a marketing arbitrage: vendors will cite the 85.5% MAI-DxO diagnostic number to sell autonomous-leaning products under cover of the 'constrained tools work' narrative.near
Chapter 2 establishes capability has converged and benchmarks are gameable (up to 42% invalid questions on GSM8K; leaderboards reflect 'adaptation to the platform'). Chapter 6's 85.5% diagnostic figure is an isolated cognitive test on curated NEJM cases, not outcomes. The theme's own deck separates 'works' from 'unproven,' but the market will collapse them — selling decision support that drifts toward diagnosis because the impressive number and the safe number live in the same press release.
Holding autonomous diagnosis back in regulated medicine pushes it into the unregulated consumer layer, where it scales without the brake.near
The report regulates devices (FDA, 510(k)) but AI Overviews and general chatbots face 'minimal oversight.' Demand for diagnosis doesn't disappear when the clinical channel is held back; it routes to the channel with no gate. So the safest-sounding policy (constrain clinical AI, hold autonomy) can increase the share of actual diagnostic decisions made by the least-validated systems — the exact inversion the theme intends to prevent.
Hidden ramifications / who pays
The note-time savings are silently transferred to patients and downstream clinicians as AI-fabricated or hallucinated content embedded in the permanent medical record, which then trains the next model.
Who bears it: Patients (whose records now contain unverifiable AI assertions), specialists receiving referrals, and future AI systems trained on EHR text.
Why hidden: The 83% note-time metric measures the writer's burden, not record accuracy. With ambient-scribe hallucination unmeasured in the headline and general-LLM hallucination ranging 22%-94% (Ch.3 AA-Omniscience), errors of omission (76.6%) and commission get laundered into an authoritative-looking note that no one re-derives. The record becomes a citation no downstream reader questions — and a training corpus, echoing Ch.1's >50%-of-web-is-synthetic model-collapse concern, now inside the clinical data layer.
Entry-level clinical reasoning erodes the same way entry-level software did, because trainees offload the cognitive work scribes and decision-support tools now perform — a 'learning penalty' for medicine.
Who bears it: Medical residents and early-career clinicians; the future supply of physicians capable of catching the AI's errors-of-omission.
Why hidden: The report documents this mechanism for software (Ch.4: 22-25yo developer employment -20%; heavy-AI-reliance learning penalties) and education (Ch.7), but never connects it to the clinician-in-the-loop model that depends on a competent human in the loop. If scribes write the notes and tools flag the diagnoses, juniors never build the pattern recognition that makes 'human oversight' meaningful. The safety model assumes a skill it is simultaneously hollowing out.
The 'constrained tools are safe' consensus discourages the prospective RCT evidence that would reveal which constrained tools actually aren't, because workflow tools feel low-stakes enough to skip rigorous trials.
Who bears it: Patients harmed by alert fatigue, automation bias, and false negatives from tools deemed too mundane to trial.
Why hidden: Sepsis alerts got real trials; ambient scribes mostly did not (the 83%/112%/11.3 figures are operational/ROI metrics, not outcome RCTs). The framing that scribes 'augment' rather than 'decide' makes them feel exempt from the 2.4%-RCT critique aimed at diagnostic devices — yet a scribe that omits a symptom or fabricates a negative finding changes downstream decisions invisibly.
Non-English-speaking and dialect-speaking patients bear a compounded, uncounted safety penalty across every layer of the clinical AI stack.
Who bears it: Patients who speak non-standard dialects or non-English languages, disproportionately in under-resourced and global-majority settings.
Why hidden: Ch.3 shows dialect performance can halve (Slovene Cerkno: Mistral 90.0%->53.2%; GPT-5 99.8%->88.6%) and English-bias is systematic. Stack this on the scribe (mis-transcription), the Overview (worse answers in-language), and the device (validated on US/UK/Germany cohorts per Ch.6) and the same patient is failed three times. The medicine chapter's geographic-gap caveat names this for perception studies but not for the tools themselves.
Compute, energy, and water costs of inference-at-scale make AI Overviews on 84-92% of health searches an unpriced public-resource externality justified by a 'free health info' framing.
Who bears it: Communities near data centers, water-stressed regions, and ratepayers — none of whom are the searchers receiving the 'free' Overview.
Why hidden: Ch.1 establishes inference (not training) as the dominant, least-visible resource cost (GPT-4o inference water rivaling 1.2M people's needs; 29.6 GW capacity). Running an LLM Overview on essentially every health query is a massive, recurring inference load presented to the user as costless. The medicine chapter treats Overviews as an information-quality problem and never as a resource one.
Cascades / causal chains
Ambient scribe adopted for throughput (112% ROI, 11.3 extra patients/mo) note written faster but unverified, with errors of omission embedded note enters permanent record as authoritative downstream clinician and next-gen model both trust the unverified text clinical-data layer accumulates synthetic error (Ch.1 model-collapse risk, now in medicine)
AI Overview fronts 84-92% of health searches model confirms searcher's stated false belief (KaBLE: GPT-4o 98.2% 64.4%) and omits red flags (76.6% errors of omission) patient reassured or mis-triaged before any clinician delayed presentation for under-weighted conditions case-mix and acuity at point of care shift in ways invisible to the FDA-regulated device pipeline
TREWS 18.7% mortality cut publicized as property of 'AI sepsis tools' buyers procure look-alike models without re-validating on local population model degrades on different cohort, worst at small/rural hospitals (federated-privacy analogue: +21.4% missed diagnoses) harm attributed to 'the tool' not the porting backlash that also stalls the genuinely-validated constrained tools
Clinician-in-the-loop made the legal safety check scribes and decision-support offload the cognitive work juniors used to do trainees never build the pattern recognition to catch errors of omission the 'human' in human-in-the-loop becomes less competent over time the entire safety model's load-bearing assumption quietly fails (medicine's version of Ch.4's entry-level erosion)
Regulated clinical autonomy held back (correctly) latent patient demand for diagnosis routes to unregulated chatbots/Overviews with 'minimal oversight' actual diagnostic decisions migrate to the least-validated, most-hallucinating channel (22-94% hallucination range) the hold-back policy increases the share of unsafe autonomous diagnosis it was meant to prevent
What to watch / leading indicators
Publication of outcome-grade (not ROI-grade) RCTs on ambient scribes specifically measuring record accuracy, hallucination, and errors of omission — their absence past 2026 confirms the 'too mundane to trial' blind spot; their appearance with poor results confirms the embedded-error cascade.
Documented shifts in presentation timing or acuity correlated with AI Overview rollout (e.g., later-stage presentations of conditions LLMs under-weight, or surges in low-acuity visits) — a leading indicator that the 84-92% upstream layer is distorting triage.
Malpractice case law and insurer policy assigning liability for AI-scribe/decision-support errors: whether courts place it on the clinician (confirming the liability-sponge effect) or the vendor (which would re-introduce the brake the regulated pipeline lacks).
Procurement language and post-deployment audits from second-wave sepsis/early-warning buyers outside flagship systems — degraded real-world performance versus the 18.7% headline would confirm over-generalization and the small-hospital harm gradient.
Tensions & contradictions
The theme says open-ended LLM diagnosis is unsafe (11.8-14.6 severe errors/100 cases), yet Ch.6 also reports MAI-DxO+o3 at 85.5% vs ~20% for physicians — the same chapter holds both 'LLMs are dangerous diagnosticians' and 'an agentic LLM system crushed physicians,' and the resolution (isolated test vs real integration) is exactly the distinction the market is incentivized to blur.
'Clinician-in-the-loop is the safe pattern' collides with Ch.2/Ch.4 evidence that human oversight degrades skill and that experienced humans can be slowed or biased by AI (devs -19%, METR) — the human check is assumed reliable while other chapters show the human is the part AI quietly weakens.
Ch.3's trade-off finding (stronger privacy -> +21.4% missed Alzheimer's diagnoses at small hospitals; -33pp accuracy from differential privacy) contradicts the implicit assumption that constrained = safe: a constrained, privacy-preserving clinical tool can be less safe for the smallest providers, the opposite of the theme's reassurance.
Ch.1/Ch.6 say smaller specialized models beat giants (MSAPairformer 111M, GPN-Star 200M), undercutting the premise that the diagnostic-safety problem is a frontier-scale capability gap to be solved by waiting — it may be a data/validation problem (5% real-patient data) that more capable general models will not fix.
The theme treats AI Overviews as a medicine governance gap, but Ch.8 shows US federal policy is actively deregulating and litigating against state AI laws — so the oversight the theme calls for on the consumer health-info layer is being structurally removed at the exact moment it's most needed, while Ch.9 shows US public trust in its own AI regulation is the lowest surveyed (31%).
⟂ Contrarian read
The reassuring two-track framing ('scale the safe constrained tools, hold the unsafe autonomous ones') is the dangerous part, because the real safety frontier in 2026 is not autonomous diagnosis — which is visibly held back by regulators and skeptics — but the constrained, 'augmentation' tools that scale precisely because they evade scrutiny. Ambient scribes (largely un-RCT'd) write unverifiable content into permanent records that train the next model, and AI Overviews already make de facto diagnostic calls for 84-92% of patients with zero oversight. The autonomous-diagnosis brake works; the danger is the unbraked stuff everyone agreed was harmless. The clinician-in-the-loop story also quietly assumes a competent loop while the same report documents AI hollowing out exactly the entry-level skill that makes oversight real. So the defensible non-consensus position: open-ended LLM diagnosis is the safer problem (it's being governed), and the 'proven, constrained, augment-only' tools are the under-regulated channel where the next decade's clinical AI harms will actually accrue.
T11
5 - Science

AI Can Run the Pipeline but Can't Yet Discover — Fund the Validation Layer

AI now replaces whole scientific workflows, but frontier agents score below half of expert performance on real research.

In 2025 AI moved from assisting to replacing pipelines: Aardvark Weather ran a full forecasting pipeline end-to-end, FourCastNet 3 generates a 60-day global forecast in under 4 minutes on a single GPU, and the first fully AI-generated paper was accepted at an ICLR workshop. Yet frontier agents score below 20% on paper-scale astrophysics replication, ~17% on real bioinformatics, and 38.8% versus an 83.5% PhD baseline on cross-paper research (PaperArena). On Earth-observation tasks, agents answer with 33% accuracy and their code fails 58% of the time.

<20%
Frontier-model score on paper-scale astrophysics replication
38.8% vs 83.5%
Best AI agent vs PhD expert on PaperArena end-to-end research
33% / 58%
Earth-observation agent answer accuracy / code failure rate
<4 min
FourCastNet 3 60-day global forecast on a single GPU
    What to do
  • Funders & research executives: invest in the experimental-validation, wet-lab, interoperability, and reliability layers — now the binding constraint
  • Lab leaders: deploy agents as competent-on-subtasks but unreliable-end-to-end tools, with mandatory human verification of results
  • Drug developers: set expectations that AI accelerates candidate generation but not the multiyear, costly validation that determines whether candidates work
AI vs PhD experts on end-to-end research
PaperArena · PhD expert83.5%PaperArena · best agent38.8%ReplicationBench · frontier<20%BixBench · frontier~17%End-to-end research accuracy · teal = human PhD baseline
On every end-to-end research benchmark, the best AI agents land at roughly half (or less) of expert performance.
“The gap between what AI can propose and what scientists can feasibly test is a recurring theme across the domains covered in this chapter.”AI Index 2026, Ch. 5
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
A flood of plausible-but-unvalidated AI-generated hypotheses, papers, and candidate molecules overwhelms the human/wet-lab validation capacity that is the actual bottleneck, lengthening — not shortening — the median time from idea to confirmed discovery.mid
Generation is now near-free and scalable (Kosmos runs approximating 6 months of research in 12 hours, reading 1,500 papers and writing 42,000 lines of code per run; the first fully AI-generated paper accepted at an ICLR workshop), but confirmation still requires the same scarce, costly, slow physical resources — wet labs, instrument time, clinical trials, expert review. When the cheap upstream step scales 100x and the expensive downstream step does not, the queue at the validation gate grows. Reviewers and lab capacity become the rate-limiter, and each genuine discovery now sits behind a longer line of unvalidated candidates.
Peer review and replication infrastructure get quietly captured as the de facto safety layer for AI science, but they were never funded or staffed to be a frontier-AI filter, so review quality degrades and false positives leak into the literature.near
With agents scoring <20% on paper-scale astrophysics replication and producing code that fails 58% of the time on Earth-observation tasks, the only thing standing between an AI-generated result and the published record is human review. The report shows ethics/governance discussion is rising but narrow, and replication benchmarks are brand-new with no longitudinal track record. As submission volume climbs (AI natural-science papers up ~26% in one year to ~80,150), unpaid reviewers face more submissions of higher surface-plausibility and lower verifiability, raising the odds that fabricated-but-coherent results pass.
Capital and prestige flow to the visible 'pipeline replacement' wins (weather, genomics foundation models) while the unglamorous validation layer — wet-lab throughput, benchmark maintenance, interoperability/API standards, infrastructure upkeep — stays chronically underfunded, entrenching the exact bottleneck the report identifies.mid
Demos like FourCastNet 3 (60-day forecast in <4 min on one GPU) and Aardvark Weather generate headlines, market-cap moves, and grant momentum; ReplicationBench and BixBench scores generate caution but no funding line. The report explicitly names 'maintaining autonomous research infrastructure' as a roadblock. Funders reward discovery claims, not the dull verification machinery, so the layer that gates real impact remains an externality nobody owns.
Domains where validation is cheap, fast, and in-distribution (weather, protein structure) race ahead, while domains where validation is structurally hard or out-of-distribution (climate decadal projection, novel biology, in-vivo clinical) stagnate — widening an intra-science capability gap that looks like uneven 'AI progress' but is really uneven validation cost.long
Weather succeeded because ground-truth arrives daily and forecasts are checkable within days; the report notes climate modeling lags precisely because 'decadal projections fall outside the distribution of any existing training data.' Where the feedback loop is tight and labeled, models compound; where confirmation takes years or has no training distribution, they stall. The differentiator is not model quality but how fast and cheaply reality answers back.
Smaller, domain-specific open models from academic/government labs become the workhorses of real science, decoupling scientific AI from the frontier-lab commercial race and creating a parallel, more-open AI ecosystem with different governance.mid
111M-parameter MSAPairformer and 200M-parameter GPN-Star beat models 40-200x larger; Earth-science data comes entirely from government/academic sources; Evo 2 ships fully open weights. Because scientific validation rewards data curation and domain fit over raw scale, the economic logic that concentrates general-purpose AI in a few opaque industry labs (Chapter 1: 91% industry share, 81/102 models without training code) does NOT hold in science. This pulls scientific AI toward openness even as commercial AI consolidates and goes dark.
Institutions begin deploying autonomous agents into production science anyway — ahead of reliability — because the productivity optics and competitive pressure are irresistible, importing a steady stream of silent errors into the scientific record before validation norms catch up.near
The same dynamic Chapter 4 documents (organizational AI adoption at 88%, agent deployment in single digits but climbing under FOMO) and Chapter 6 documents in medicine (258 FDA authorizations, only 2.4% RCT-backed) will repeat in research. A 33%-accuracy / 58%-code-failure agent that still 'produces an answer' is dangerous precisely because the output looks finished. Adoption tracks perceived usefulness, not measured reliability, so deployment will outrun the validation layer.
Hidden ramifications / who pays
The validation bottleneck is also a power-transfer mechanism: whoever controls scarce wet-lab and instrument capacity gains gatekeeping leverage over which AI hypotheses ever get confirmed — turning physical validation infrastructure into the new chokepoint, analogous to how TSMC is the chokepoint for chips.
Who bears it: Universities and national labs with physical experimental capacity (and the funders who control it); conversely, well-capitalized firms that can buy validation throughput will be able to confirm their AI's hypotheses while academic labs drown in unconfirmable candidates.
Why hidden: The report frames validation as a generic 'roadblock' to be 'funded,' not as a strategic asset whose scarcity confers control. The chip-supply-chain chokepoint framing in Chapter 1 is never applied to the wet-lab/validation layer, even though the structural logic — cheap-to-generate, expensive-to-verify — is identical.
Early-career scientists lose the apprenticeship rungs of doing replication and routine analysis by hand, the exact tasks agents now do at 33-38% — so the next generation may never build the tacit judgment needed to catch the agents' silent failures, hollowing out the human verification layer just as it becomes most critical.
Who bears it: Graduate students, postdocs, and junior researchers — and the long-run integrity of science itself, which depends on a renewable supply of humans who can tell good results from plausible-looking bad ones.
Why hidden: The science chapter treats the human/AI gap as static (PhD 83.5% vs agent 38.8%) and assumes the PhD baseline persists. It mirrors Chapter 4's entry-level coder collapse (~20% drop for devs 22-25) and 'learning penalties' findings, but the report never connects entry-level erosion to the future supply of the expert validators its own thesis depends on.
Confirmation bias gets industrialized: agents that adopt a researcher's prior as fact will generate hypotheses that confirm it, and at scale this manufactures a literature that looks like independent corroboration but is actually one belief echoed thousands of times.
Who bears it: Meta-analysts, systematic reviewers, regulators, and any field that treats publication count or replication count as evidence of robustness — and patients/policymakers downstream who trust 'the consensus.'
Why hidden: This requires connecting Chapter 3's KaBLE finding (GPT-4o accuracy collapses 98.2%→64.4% on user-held false beliefs; DeepSeek R1 to 14.4%) to Chapter 5's autonomous-generation capability. Each chapter treats its finding in isolation; the compounding belief-amplification risk only appears when you cross them.
The 'AI replaces the pipeline' narrative will be used to justify defunding the classical methods (numerical weather models, manual analysis pipelines) that the AI systems were trained on and still depend on for ground truth — kicking away the ladder the AI climbed up.
Who bears it: National meteorological services, observational survey programs, and the public-good data infrastructure that AI weather/science models silently rely on for training labels and validation targets.
Why hidden: Aardvark and FourCastNet are framed as 'replacing' the numerical pipeline, but they are trained against and validated by that same classical apparatus and its observational data. The report celebrates replacement without noting that defunding the substrate degrades the AI's own future training and evaluation signal.
Benchmarks themselves become a contested, gameable asset in science the way they already are in capability evals — and because scientific replication benchmarks are new and few, a lab can optimize agents to a public benchmark while real-world reliability stagnates, making 'our agent scores well on ReplicationBench' a marketing claim rather than a reliability guarantee.
Who bears it: Funders, journal editors, and procurement officers who will start using these benchmark scores as buying/grant signals before the benchmarks are mature enough to resist gaming.
Why hidden: Chapter 2 documents benchmark gaming and contamination extensively (up to 42% invalid questions on GSM8K, Llama-4 contamination dispute), but the science chapter presents its new benchmarks as trustworthy diagnostic instruments without inheriting Chapter 2's hard-won skepticism about what benchmark scores actually mean.
Cascades / causal chains
AI generation cost collapses (Kosmos: 6 months of research in 12 hours) hypothesis/paper/candidate volume explodes (AI science papers +26% YoY to ~80,150) the fixed-capacity validation layer (wet labs, reviewers, instrument time) becomes the binding constraint median time-to-confirmed-discovery rises even as time-to-hypothesis falls investment that assumed AI would accelerate discovery is mispriced against a bottleneck it didn't move
Agents deployed at 33% accuracy / 58% code-failure produce finished-looking outputs institutions adopt under competitive FOMO before reliability is proven (mirroring 88% org adoption, 2.4% RCT-backed medical devices) silent errors enter the literature faster than replication can catch them peer review (unfunded, understaffed) becomes the de facto safety layer review quality degrades under volume false positives accumulate and erode trust in the AI-science corpus
Agents now do replication and routine analysis (the apprenticeship tasks) junior researchers stop building tacit verification judgment (parallel to Chapter 4's ~20% entry-level coder decline + learning penalties) the renewable supply of expert validators (the 83.5% PhD baseline) shrinks over a decade the human layer that gates AI's silent failures thins exactly as AI output volume peaks validation capacity falls in both senses (throughput and judgment)
Validation is cheap and fast in weather (daily ground truth) models compound there (operational deployment, <4 min forecasts) capital and prestige flow to fast-feedback domains slow-feedback or out-of-distribution domains (climate decadal, novel in-vivo biology) get starved of attention and funding an intra-science gap opens that is really a validation-cost gap, not an AI-capability gap, but gets misread as 'AI can't do climate'
Domain-specific small open models beat giant ones in science (MSAPairformer 111M, GPN-Star 200M) scientific AI's economic logic favors data curation over scale it decouples from the frontier commercial race academic/government labs retain authorship of scientific AI a parallel, more-open, more-auditable AI ecosystem persists in science even as Chapter 1's commercial frontier goes dark (81/102 models without training code)
What to watch / leading indicators
Whether replication/validation benchmark scores (ReplicationBench, PaperArena, BixBench, UnivEARTH) improve materially in the next 1-2 editions OR plateau — sustained plateau despite frontier-model gains confirms the validation gap is structural, not a capability lag that scaling closes.
Retraction and failed-replication rates in AI-heavy subfields, and whether journals add AI-generation/verification disclosure requirements — rising retractions or new disclosure mandates would confirm the literature-contamination cascade is materializing.
Funder behavior: whether new grant lines explicitly target validation infrastructure (wet-lab throughput, benchmark maintenance, interoperability/API standards, infrastructure upkeep) versus continued concentration in model-building — the absence of validation-specific funding confirms the under-investment ramification.
The ratio of AI-generated hypotheses/candidates produced to those experimentally confirmed — if generation volume keeps climbing (~26%/yr papers) while the confirmed-discovery list stays short, the bottleneck-shift cascade is confirmed; a closing ratio would refute it.
Tensions & contradictions
Ch.1 says scientific leadership and frontier-model production are consolidating into opaque industry labs (91% industry share), but Ch.5 says scientific AI is dominated by academic/government institutions using smaller open models — the validation-cost structure of science actively resists the concentration documented elsewhere, so 'AI is consolidating' is domain-dependent, not universal.
Ch.2's central thesis is that benchmarks are 'guilty until proven' and saturate in months (HLE +30pts in a year, up to 42% invalid questions), yet T11's entire reassuring story rests on trusting new replication benchmarks (ReplicationBench <20%, PaperArena 38.8%) to faithfully measure the human-AI gap — the same benchmark-skepticism should discount confidence that the gap is really as wide as reported.
Ch.4 documents 'learning penalties' and a ~20% entry-level collapse from AI handling routine cognitive work; Ch.5 cheers agents handling routine replication/analysis. Both can't be costless: the tasks AI is praised for absorbing in science are the same apprenticeship tasks whose loss Ch.4 flags as eroding the human pipeline.
Ch.6 shows that in medicine, constrained clinician-in-the-loop tools delivered real outcomes (18.7% sepsis-mortality reduction) while autonomous LLMs made 11.8-14.6 severely harmful recommendations per 100 cases — strongly implying the same is true in science, yet T11's 'fund the validation layer' framing still leaves room to read autonomous agents as nearly-ready, when the medical evidence says constrained-augmentation is the only proven mode.
The report's own caveat that multiagent gains are modest (2-4 points over single-agent) undercuts the 'team-of-agents autonomous science' narrative that the Kosmos/AI Co-Scientist showcases imply — the orchestration that looks revolutionary buys only marginal reliability, so scaling agent teams won't close the validation gap.
⟂ Contrarian read
The consensus reads the sub-50% agent scores as 'AI isn't ready yet — capability will improve and close the gap.' The contrarian read is that the gap is largely irreducible by better models, because it is not a capability gap but a validation-cost gap rooted in physics and economics. Weather succeeded not because the model was smarter but because reality answers back daily and cheaply; biology, climate, and novel discovery lag because confirmation is slow, expensive, and often out-of-distribution by nature (decadal climate has no training distribution; novel biology has scarce perturbation data). Throwing more compute and bigger models at this does little — the report's own evidence that 111-200M-parameter models beat 40B ones, and that multiagent setups add only 2-4 points, says model scale is not the lever. So the right strategic move is not to wait for frontier models to 'get good enough' for autonomous discovery (they may never, for the hard domains), but to permanently re-architect funding around cheap, fast, trustworthy validation infrastructure as the durable scarce asset — and to treat AI as a hypothesis-generation amplifier bolted to a human-and-wet-lab confirmation engine, indefinitely, not transitionally.
T12
7 - Education

Close the AI-Policy Clarity Gap in Schools, Not Just the Existence Gap

Four in five students use generative AI while only 6% of schools have clear, comprehensive policies.

80% of university students used generative AI for learning in 2025 — double the 40% reported in 2023 — yet only about half of middle and high schools have AI policies, teachers say just 6% are clear and comprehensive, and 47% of students were unsure whether AI use was even allowed. Access is also unequal: 44% of small high schools offer foundational CS versus 91% of large ones, and U.S. CS access has plateaued at 60%. Meanwhile China and the UAE mandated AI education nationally starting in 2025-26 while the U.S. relies on nonbinding state guidance (only 6 states have significant AI content in standards).

80%
University students using generative AI for learning in 2025 (was 40% in 2023)
6%
Schools with clear, comprehensive AI policies (per teachers)
47%
Students unsure whether AI use was even allowed
67%
AI software master's graduates who are nonresidents
    What to do
  • Educators & school leaders: write explicit allowed-use guidance now and target the small/rural/low-income schools where CS access is lowest (44% of small schools)
  • Policymakers: fund teacher training, adopt AI-specific standards ahead of the summer-2026 CSTA revision, and reconsider visa policy given 67% nonresident dependence in AI master's programs
  • Workforce planners: invest in reskilling/upskilling pathways (certificates, on-the-job) since AI skills are acquired largely outside school (India leads at ~3x global average)
CS access gap by school size and locale, 2025
By school sizeLarge91%Medium77%Small44%By localeSuburban71%Urban59%Rural57%Share of schools offering computer science · accent = the access gap
A small, rural, or low-income school is roughly half as likely to offer CS — the on-ramp to AI literacy.
“Four out of five U.S. high school and college students now use AI for schoolwork, but school policies have not kept pace.”AI Index 2026, Ch. 7
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 5 cascades
Second-order effects / what follows
The 6%-clear-policy vacuum produces a teacher-discretion lottery, not neutral abstention — and the resulting enforcement falls hardest on students who lack at-home coaching to read the unwritten rules.near
When 47% of students don't know if AI is even allowed and only 6% of schools give clear guidance, the operative policy becomes whatever each individual teacher improvises. Students with college-educated, AI-fluent parents get told privately how to use AI 'safely'; first-gen and lower-income students, who can't ask at home, either avoid it (forgoing a tool 80% of peers use) or use it and get caught by an ad-hoc plagiarism judgment. Ambiguity doesn't distribute risk evenly — it converts into an equity gap that maps onto the same lines as the 44%-vs-91% CS-access gap.
Schools will over-rotate toward AI-detection software as a substitute for actual policy, importing a tool with documented false-positive bias against non-native English writers — and the report's own Ch 3 dialect data predicts exactly this failure.near
Facing universal student AI use and no clear policy, administrators reach for the cheapest legible 'control': detectors. But Ch 3 shows models perform far worse on non-standard dialects (GPT-5 dropping 99.8%→88.6% on Cerkno Slovenian; halving on others). Detection tools share that monolingual-English bias, so ESL and dialect-speaking students get disproportionately flagged as 'AI-generated' when writing in their own authentic voice. The clarity gap thus laundered into an accusation engine aimed at the least-served students.
The U.S. trains AI talent it cannot retain and increasingly cannot even admit, hollowing the master's-level pipeline that 67% depends on foreign nationals to fill.mid
Ch 7: 67% of AI software master's grads are nonresidents, the highest of any category, exactly as federal visa revocation tightens. Ch 1: inbound AI talent to the US fell 89% since 2017, 80% in the last year. The clarity/standards gap means the domestic K-12→undergrad pipeline isn't replacing them (CS enrollment -11%, access plateaued at 60%). The two failures compound: you can't backfill a collapsing import channel with a domestic channel that has no coherent on-ramp.
A 'jagged literacy' debt accumulates: students taught to use AI without being taught where it silently fails will internalize confident-but-wrong outputs as ground truth, and the schools least able to teach the nuance are the ones already AI-saturated.mid
Ch 2 shows IMO-gold models that read analog clocks at 50.6% and hallucinate at 22-94% (Ch 3). 80% of students use these tools daily; only 6% of schools have clear policy and far fewer have curriculum on model failure modes. Students absorb outputs in the zone where models are most deceptive — fluent, plausible, wrong. Without 'jagged intelligence' literacy, the harm is invisible (you don't know you learned a falsehood), and it concentrates where teacher AI-training is thinnest.
China and the UAE's national AI mandates will compound into a measurable capability divergence that the decentralized U.S. model cannot close reactively, because curriculum and teacher-training lead times are 5-10 years.long
Ch 7/8: China and UAE mandated AI education for 2025-26; the US relies on nonbinding guidance with only 6 states having significant AI standards. Mandates set a floor that produces a coherent cohort; guidance produces a patchwork. Because a student who starts mandated AI education in 2026 graduates ~2032+, the gap created this year is locked in for a decade regardless of later U.S. catch-up. Ch 8's compute-sovereignty concentration (China 85 supercomputers) means the capability cohort also inherits the infrastructure to use its skills.
The CSTA summer-2026 standards become a single point of failure for U.S. AI education — a de facto national curriculum set by a nonprofit, with no enforcement mechanism, in a year when a federal-vs-state legal war (Ch 8) makes binding adoption politically radioactive.mid
Ch 7 names the revised CSTA standards as the 'expected de facto national baseline.' But de facto ≠ binding; Ch 8 shows the Dec 2025 executive order created a DOJ task force to challenge state AI laws and tie funding to avoiding 'conflicting' legislation. A state that tries to mandate the CSTA AI standards risks being framed as the kind of 'conflicting' AI regulation the federal order targets, chilling exactly the adoption that would close the gap.
Hidden ramifications / who pays
The learning-penalty finding means the policy-clarity gap is silently degrading the human-capital base it's meant to build: students who 'use AI for learning' without skill-building scaffolds get no speed gain and measurably weaker critical thinking, so the 80%-adoption stat is partly measuring de-skilling, not upskilling.
Who bears it: The current K-12 and undergraduate cohort — and downstream employers a decade out who will hire workers credentialed as 'AI-literate' but who actually offloaded the foundational reasoning that AI-literacy presupposes.
Why hidden: Ch 4 buries it: heavy AI reliance for learning produced 'measurable learning penalties with no speed gain,' and Ch 7 notes 55% of college students report a 'mixed effect' on critical thinking — but the headline frames 80% adoption as pure progress. The harm is invisible because it shows up as a competence absence years later, never as a failed test today.
Teachers absorb the entire compliance and judgment burden of the missing policy, becoming unpaid, untrained AI-governance officers who must individually adjudicate plagiarism, equity, and reliability with no standard — a workload and liability shift that no one is funding.
Who bears it: Classroom teachers, especially in the small/rural/under-resourced schools (44% CS access) that also have the least AI training and the least administrative cover when a parent disputes a cheating accusation.
Why hidden: The report frames the gap as 'schools lack policy' (institutional, passive). It doesn't name that 'no policy' is not a void — it's a delegation of an impossible task to the individual teacher. The named bottleneck (teacher preparation, the 'missing link') is treated as a training need, not as a liability and labor transfer already in progress.
The Chegg/survey vendors and AI-detection companies become de facto education policymakers: in the absence of public standards, the data definitions and tools that schools buy will set the operative norms, and those vendors have a commercial interest in maximizing measured 'AI use' and 'detection.'
Who bears it: Public-school students and the public interest in neutral education governance; the locus of curriculum authority migrates from accountable public bodies to unaccountable private vendors.
Why hidden: Ch 7 flags the data problem ('AI-use estimates vary wildly by source: CDT 50%, RAND 54%, College Board 84%') as a measurement caveat, not as evidence that whoever controls the definition controls the policy. The headline '4 in 5' itself depends on which vendor you cite — the number is already a contested commercial artifact.
The plateau at 60% CS access plus the -11% enrollment drop means foundational-CS infrastructure (teachers, labs, course slots) built over 2017-2024 will start to be defunded as a 'declining' program right as AI raises the value of the underlying skill — a procyclical disinvestment that's hardest to reverse.
Who bears it: Future students in districts that cut CS now; and the U.S. AI-hardware track specifically (Ch 7: AI hardware bachelor's already -13% from 2020 peak), which the Ch 1 TSMC single-foundry risk says the country most needs.
Why hidden: Enrollment declines read as 'student demand signal' (rational response to a soft coding market), so cutting capacity looks responsive rather than self-harming. The hidden mechanism is that the entry-level-coding-job slump (Ch 4) and the long-run AI-skill value point in opposite directions, and schools optimize on the visible short-run signal.
The gender gap is being hard-coded into the AI-native generation: AI skills are acquired largely outside formal school (LinkedIn, certificates, on-the-job), and informal channels show a persistent male skew (India 3.05 men vs 1.94 women) — so abandoning formal, universal-access schooling as the delivery mechanism structurally widens the gap that 15 years of growth already failed to close.
Who bears it: Women and girls entering the workforce in the 2030s; and the field's talent base, which Ch 1 shows has had zero gender progress since 2010.
Why hidden: The 'skills acquired outside school' finding is framed positively (resilience, multiple pathways). Its equity cost is unstated: formal schooling is the one channel with a universal-access mandate; informal channels select for those with time, money, and social capital — disproportionately men, per the report's own penetration data.
Cascades / causal chains
No clear policy (6%) teachers improvise per-student rules students with AI-fluent parents learn the unwritten rules while first-gen students don't ambiguity becomes an equity gap layered on top of the existing 44%-vs-91% CS-access gap
80% of students use AI for 'learning' reliance without skill-scaffolds triggers the documented learning penalty (no speed gain, weaker critical thinking) a cohort credentialed as AI-literate but with eroded foundational reasoning employers a decade out discover the pipeline produced fluency without competence
Federal deregulation + Dec 2025 DOJ task force against 'conflicting' state AI laws states fear that mandating CSTA AI standards counts as conflicting regulation de facto national curriculum (CSTA, summer 2026) never becomes binding anywhere U.S. stays on patchwork guidance while China/UAE mandates compound
Visa revocation hits the 67%-nonresident master's pipeline domestic K-12 on-ramp can't backfill (CS access plateaued at 60%, no clear AI standards) U.S. trains foreign talent it then expels and can't replace talent-formation locus migrates abroad (Ch 1: 89% inbound collapse) just as China leads model output and compute
Schools reach for AI detectors as a policy substitute detectors inherit the monolingual-English bias documented in Ch 3 dialect data ESL and dialect-speaking students disproportionately flagged for writing authentically the clarity gap converts into a discriminatory accusation engine aimed at the least-served students
What to watch / leading indicators
AI-detection-software procurement in K-12 districts: a spike in detector adoption (vs. curriculum/teacher-training spend) would confirm schools are substituting surveillance for policy — and watch for the first wave of false-accusation lawsuits from ESL/dialect-speaking families, which would validate the Ch 3 bias-transfer mechanism.
Whether any U.S. state mandates the summer-2026 CSTA AI standards as binding, and whether the DOJ AI Litigation Task Force challenges it as 'conflicting' regulation. Binding adoption refutes the chilling-effect cascade; a legal challenge or zero binding adoptions confirms it.
The 2026-27 nonresident share of AI software master's enrollment (currently 67%). A sharp drop with no domestic offset confirms the train-and-expel pipeline collapse; flat/rising would refute the visa-impact thesis.
Longitudinal critical-thinking/foundational-skill assessments (not self-report) for heavy-AI-use student cohorts. A measured decline in unaided reasoning among high-AI-use students would confirm the learning-penalty-as-hidden-deskilling ramification; the absence of such measurement being commissioned is itself a signal that the harm will stay invisible.
Tensions & contradictions
The headline treats 80% student AI use as the demand signal justifying urgency, but Ch 4's learning-penalty finding and Ch 7's 55%-mixed-effect-on-critical-thinking imply much of that 80% may be actively harmful — so 'close the gap by enabling more use' could accelerate the damage rather than fix it. Clarity about WHAT use is the real need, not clarity that use is permitted.
The theme calls for binding national standards while Ch 8 documents the U.S. actively dismantling the mechanism to create them (rescinded EO 14110, DOJ task force vs. state AI laws, funding tied to avoiding 'conflicting' legislation). The policy the theme prescribes is precisely what the federal government is now litigating against.
Ch 9: U.S. trust in its own government to regulate AI is the lowest surveyed (31%), and 41% want regulation to go further vs 27% too far — yet the same public elects the officials pursuing deregulation. The theme's call for clear public policy collides with a legitimacy deficit that makes binding standards politically fragile even where the public wants them.
The '4 in 5 students' figure the deck leads with is, per Ch 7's own caveat, the high-end estimate (College Board 84%) among a wildly varying set (CDT 50%, RAND 54%). The urgency framing leans on the most dramatic available number while the chapter explicitly flags it as source-dependent.
Ch 7 reframes the AI-PhD 'brain drain' as redistribution (industry share fell 77%→65%, academia nearly doubled) — a relatively healthy domestic signal — yet the master's-level story (67% nonresident, visa threat) and the K-12 access plateau point to pipeline collapse. The same chapter holds both 'the pipeline is fine' and 'the pipeline is breaking,' depending on which rung you examine.
⟂ Contrarian read
The 'clarity gap' framing is itself a category error that could make things worse. The report assumes the fix is clearer permission rules — but the most defensible non-consensus read is that premature policy clarity, written now by under-trained administrators reacting to vendor-supplied 'AI use' panic, would likely codify exactly the wrong things: detection-tool mandates that punish ESL students (Ch 3 bias), blanket bans that drive use underground, or blanket permission that deepens the learning penalty (Ch 4). The 6%-clear stat is read as a failure, but a school that honestly says 'we don't yet know how to govern a tool whose own builders can't measure it reliably' (Ch 2/3: benchmarks 42% invalid, hallucination 22-94%, transparency falling 58→40) may be more epistemically honest than one with a confident, comprehensive, and wrong policy. The real gap isn't clarity — it's the absence of any trustworthy basis for clarity, because the technology's reliability and the field's measurement apparatus are both deteriorating. Rushing to close the 'clarity gap' before closing the 'do we even know what good use looks like' gap risks hard-coding bad norms into the one institution with universal reach.
T13
9 - Public Opinion

The Expert-Public Chasm — A Fault Line for AI Politics

Builders and the public hold incompatible mental models of AI, and the public's appetite for guardrails is real and bipartisan.

73% of AI experts expect a positive impact on how people do their jobs versus just 23% of the U.S. public — a 50-point gap, with parallel splits on the economy (69% vs 21%) and medical care (84% vs 44%). The public expects AI to destroy jobs (64% expect fewer, only 5% more) and trusts its own government to regulate AI least in the world (U.S. 31% vs 54% global). Yet across all 50 states, 'regulation won't go far enough' beats 'will go too far' 41% to 27%, with a third still unsure.

50-pt gap
Experts (73%) vs public (23%) on AI's positive impact on jobs
31%
U.S. trust in its own government to regulate AI — lowest surveyed (global 54%)
64%
U.S. adults expecting AI to lead to fewer jobs (only 5% expect more)
41% vs 27%
U.S. 'regulation won't go far enough' vs 'will go too far'
    What to do
  • Policymakers: act on the bipartisan appetite for guardrails (41% 'not far enough' vs 27% 'too far') and close the one-third 'not sure' gap with plain-language explanation of what regulation would do
  • U.S. regulators: confront the world-lowest 31% trust deficit by borrowing credibility from the EU model (most-trusted globally at 53%)
  • Leaders communicating on jobs: surface the expert-public gap honestly rather than dismissing or amplifying the public's 64%-fewer-jobs fear
  • Executives: plan for the faster expert adoption curve (~18% of work hours by 2030) while messaging documented limits of companion AI
The expert-public optimism gap by domain
U.S. publicAI expertsMedical care44%84%Jobs23%73%The economy21%69%
Experts are far more optimistic than the public in every domain except elections and personal relationships.
“On how people do their jobs, 73% of experts expect a positive impact, compared to just 23% of the public, a 50-point gap.”AI Index 2026, Ch. 9
Beneath the headline 2nd-order effects & hidden ramifications 6 second-order · 5 hidden · 4 cascades
Second-order effects / what follows
AI labs and their critics will each weaponize the SAME survey data, because the chasm is a Rorschach test, not a settled factnear
A 50-point gap (73% expert vs 23% public optimism on jobs) is symmetric ammunition: builders cite it as proof the public is uninformed and will 'come around' as it did with electricity/internet, justifying faster deployment; safety advocates cite identical numbers as proof builders are captured insiders whose optimism is self-interested. The same Pew table feeds two opposed political narratives, so the data settles nothing and instead hardens both camps. Horizon: near term.
The 'regulation won't go far enough' majority (41% vs 27%) will fail to translate into federal law despite being real and bipartisan, because it is a soft, low-salience preference sitting atop a one-third 'not sure' blocnear
Diffuse majorities lose to concentrated interests in legislative bargaining (Olson). The pro-guardrail 41% is undifferentiated, low-intensity, and undecided-adjacent (>33% unsure in most states), while the opposing force — industry, now 37% of congressional AI witnesses (Ch.8), backed by a Dec 2025 executive order creating a DOJ task force to sue states — is concentrated, well-funded, and high-intensity. Latent majority preference does not equal mobilized constituency. Horizon: near-to-mid term.
State legislatures, not Congress, become the binding venue where the public's guardrail appetite gets cashed — turning the chasm into a 50-state compliance patchworkmid
Because the pro-regulation lean 'holds across all 50 states' (Ch.9) while federal action deregulates, the only level of government responsive to that uniform preference is the state. Ch.8 already shows state AI bills rising from <10 (2020) to 150 (2025) with California at 62. The public-mood data is the demand signal; state bills (SB 53, SB 243, TRAIGA, Colorado AI Act) are the supply. The Dec 2025 EO weaponizing federal funding against 'conflicting' state law collides directly with this. Horizon: mid term.
Whoever is right about adoption SPEED (experts: 18% of work hours by 2030; public: 10%) determines whether the chasm narrows or explodes — and the expert tail (top decile >40%) is the politically dangerous scenariomid
If experts' faster forecast is correct, the gap between expectation and lived experience widens fastest precisely among the 64% of the public already expecting job losses, converting diffuse anxiety into acute grievance right as displacement becomes visible. The entry-level signal is already here: Ch.4 shows ~20% employment decline for software developers ages 22-25. Rapid expert-rate adoption validates the public's FEAR (fewer jobs) while invalidating the public's TIMELINE (slower), the worst combination for institutional trust. Horizon: mid term.
The 79% cross-national demand for mandatory AI-use disclosure becomes the politically frictionless wedge that passes everywhere the harder guardrails stallmid
Disclosure is shared 'across all 30 countries surveyed, even where overall trust in institutions was lower' (Ch.9) — it is the rare guardrail with no partisan or national split and near-zero cost to legislate. It also aligns with China's mandatory content-labeling and EU transparency rules (Ch.8). Legislators seeking to show responsiveness to the guardrail-hungry public at low political cost will reach for labeling/disclosure first, making 'AI-generated' tags ubiquitous while substantive liability/safety rules lag. Horizon: near-to-mid term.
Low US trust in domestic AI governance (31%, lowest surveyed) cedes regulatory soft power and de facto standard-setting to the EU (53% global trust), inverting the usual US-leads-tech dynamiclong
Trust is the currency of regulatory legitimacy. When a 25-country median trusts the EU (53%) over the US (37%) to regulate AI, EU frameworks (AI Act, live since Aug 2025) become the credible global template that multinationals adopt by default to access the most-trusted market — the Brussels Effect. The US builds the models but loses authorship of the rules, and its own citizens (31% trust) won't defend its lighter-touch regime. Horizon: long term.
Hidden ramifications / who pays
The chasm is selective, not total — and the report's 'two incompatible mental models' framing over-states it. On WHICH jobs are at risk (cashiers, journalists, software engineers) experts and public agree; they also converge on elections and personal relationships. The divergence is concentrated almost entirely on AGGREGATE optimism and TIMELINES, not on the texture of impact.
Who bears it: Communicators and pollsters who treat 'the gap' as monolithic; the public, whose specific, often-accurate risk judgments get dismissed as generalized technophobia
Why hidden: The headline 50-point number is so dramatic it crowds out the finding (Ch.9 lines 87-88, 891-893) that the groups reach 'strong consensus' on occupation-level risk. The public is not wrong about everything — it is differently calibrated on speed and net-optimism while being well-calibrated on who gets hit.
Expert optimism is partly an artifact of WHO counts as an expert: Pew's 'AI experts' were authors/presenters at AI conferences — i.e., people professionally and often financially invested in AI succeeding. The 73% figure may measure selection and incentive, not superior foresight.
Who bears it: Policymakers who treat the expert number as the 'correct' baseline against which public opinion is the deviation to be corrected
Why hidden: The report flags the sourcing in a footnote (conference authors/presenters) but the analytical framing still implicitly privileges experts as the calibrated party. On net-optimism (a value/interest judgment, not a forecast), there is no reason to treat the builder class as the ground truth.
The pro-regulation majority is demographically lopsided toward older (65+: 51%) and college-educated (46%) adults — the groups with the most political voice but the least AI exposure — while younger, lower-education, higher-AI-exposure groups are LESS likely to demand guardrails.
Who bears it: Younger and less-educated workers, who are the most labor-exposed (Ch.4: entry-level devs down ~20%) yet under-index on wanting protection that would most benefit them
Why hidden: The 'bipartisan, all-50-states' framing flattens an age/education skew that matters enormously: the loudest demand for guardrails comes from the cohort least displaced, while the canaries in the coal mine are quieter — a representation mismatch that will distort which protections actually get built.
The Pew expert/public comparison data was collected in 2024, but is being read in 2026 against a capability frontier that moved enormously in between (HLE +30pts, agents 12%->66% on OSWorld). The 'chasm' may be partly a snapshot lag, not a stable structural divide — and could be narrower OR wider now.
Who bears it: Anyone using the 2026 report to make 2026 decisions about a 2024 measurement; forecasters treating the gap as current
Why hidden: The report itself notes 'model performance continues to accelerate' since the survey (Ch.9 line 763), but the striking-stat presentation strips the temporal caveat. A two-year-old snapshot of expectations about a fast-moving frontier is being treated as a present-tense fact.
Workplace-adoption data (Ch.9: >80% AI use in India, China, Nigeria, UAE, Saudi Arabia) directly contradicts population-adoption data (Ch.4: adoption correlates with GDP per capita; US ranks 24th). The 'public' whose anxiety the chasm describes is mostly Western; the heaviest actual USERS of workplace AI are in emerging economies barely represented in the expert-vs-public framing.
Who bears it: Emerging-economy workers who are adopting AI fastest but whose sentiment (e.g., India's +14pp concern spike, the largest surveyed) is excluded from the US-centric expert/public chasm
Why hidden: The flagship 50-point gap is a US-only Pew comparison. The report acknowledges the two datasets 'point in opposite directions' (Ch.9 tensions) but the theme inherits the US frame, hiding that the most AI-saturated workforces are also souring fastest, outside the chasm's measurement entirely.
Cascades / causal chains
Builder optimism (73%) drives faster deployment expert-rate adoption materializes (toward 18% of work hours by 2030, with a >40% tail) entry-level displacement already visible (devs 22-25 down ~20%) accelerates and spreads the public's pre-existing 64%-fewer-jobs fear is empirically VALIDATED while its slower-timeline guess is invalidated grievance becomes acute and concentrated in young exposed cohorts demand for guardrails shifts from the diffuse older/educated majority to an intense young-worker constituency regulation finally mobilizes, but reactively and punitively rather than by design
Federal deregulation (Dec 2025 EO + DOJ task force) blocks the uniform 41%-want-more-regulation preference at the national level that preference has nowhere to go but state legislatures states pass divergent AI laws (150 in 2025, CA at 62) 50-state compliance patchwork emerges the federal-vs-state collision goes to court outcome determines whether the US has any coherent AI governance, with the public's guardrail appetite as the demand engine the whole time
US records lowest-in-world trust in its own AI regulator (31%) domestic legitimacy vacuum EU (53% global trust) becomes the credible standard-setter multinationals adopt EU AI Act compliance as the default global baseline (Brussels Effect) the US writes the models but the EU writes the rules US citizens, already distrustful of their own regime, have no reason to defend it against the imported standard
79% global demand for AI-use disclosure is universal and low-cost legislators reach for labeling/transparency first to signal responsiveness 'AI-generated' tags become ubiquitous (aligning with EU + China labeling regimes) public feels 'something was done' political pressure for harder liability/safety rules vents through the cheap valve substantive guardrails on the high-harm systems (the opaque frontier models of Ch.1/Ch.3) stall while cosmetic transparency proliferates
What to watch / leading indicators
Whether the one-third 'not sure' bloc on US regulation shrinks and breaks toward 'not far enough' or 'too far' — movement here, especially among young AI-exposed workers (currently UNDER-indexed on wanting guardrails despite being most displaced), is the leading indicator of whether the soft majority becomes a mobilized constituency.
The outcome of the Dec 2025 federal-vs-state collision: does the DOJ AI Litigation Task Force successfully preempt state laws like CA SB 53/SB 243, or do states hold? This determines whether the public's uniform 41% guardrail preference has any binding outlet.
The 2026/2027 re-run of the Pew expert/public comparison (current numbers are 2024 data): does the 50-point gap narrow as adoption rises and the public experiences AI directly, or widen as displacement validates fear? Narrowing refutes the 'stable structural divide' read; widening confirms the fault-line thesis.
Whether disclosure/labeling mandates (the 79%-supported, low-cost guardrail) pass FAST while substantive frontier-safety rules stall — a divergence in legislative velocity between cosmetic transparency and high-harm liability would confirm the cheap-valve cascade.
Tensions & contradictions
The theme says the public holds an 'incompatible mental model,' but Ch.9's own occupation data shows 'strong consensus' between experts and public on which specific jobs are at risk — the models are incompatible on net-optimism and speed, not on the substance of impact. The chasm is narrower and more selective than 'incompatible' implies.
Ch.9 frames experts as the more calibrated party as 'model performance continues to accelerate,' implicitly siding with builder optimism. But Ch.2 shows the measurement apparatus is breaking (up to 42% invalid benchmark questions, leaderboards possibly gamed) and Ch.3/Ch.5/Ch.6 show capability NOT translating to reliability (agents fail ~1 in 3; <20% paper replication; 2.4% of medical AI devices have RCT evidence). The public's slower-progress skepticism may be the better-calibrated read on real-world deployment, even if experts are right about benchmarks.
The public's 64%-fewer-jobs fear is treated as excessive relative to experts' 39%, yet Ch.4's hardest empirical signal — entry-level developers down ~20%, early-career AI-exposed roles down ~16% — vindicates the direction of public fear while large-scale aggregate displacement has NOT appeared. Both the alarmism and the dismissal are half-right.
Trust in government to regulate (US 31%) and demand FOR more regulation (US 41% 'not far enough') move in OPPOSITE directions in the same population — Americans simultaneously want more AI regulation and distrust the government that would provide it. The theme's 'appetite for guardrails is real' under-states that this appetite is paired with a refusal to trust the only body that can deliver it.
Ch.9's workplace-adoption data (>80% use in emerging economies) contradicts Ch.4's GDP-correlated population adoption (US 24th). The 'public' in the expert-public chasm is a US/Western public; the world's heaviest AI users sit outside the comparison and are souring fastest (India concern +14pp), so the chasm is not the global story it is framed as.
⟂ Contrarian read
The 'expert-public chasm' is being mis-framed as the public being uninformed and needing to 'come around.' The more defensible read is that the public is the better-calibrated forecaster of DEPLOYED reality, and the experts are the outliers. The experts are conference authors and presenters — a selected, interested class whose 73% optimism reflects proximity and incentive, not privileged foresight, and whose forecasts concern an environment (benchmarks) that the report itself shows is broken (42% invalid questions, gaming) and decoupled from real-world reliability (agents fail 1-in-3, <20% paper replication, only 2.4% of medical AI devices RCT-backed). On the things that have already happened — entry-level job loss (devs down ~20%), the gap between benchmark wins and patient/worker outcomes — the public's pessimism on net-optimism and its slower-timeline skepticism are tracking deployment reality better than expert benchmark-euphoria. In that light, the 'chasm' is not a literacy gap to be closed by educating the public toward the expert view; it is a reality gap that will close as experts are forced down toward the public's more sober read of what AI actually delivers in the wild. The political risk is not that the public is too pessimistic — it is that policy is being set to the experts' optimistic timeline, which the implementation chapters (3, 5, 6) suggest is the one most likely to be wrong.
The meta-theme

The instruments are breaking

The most destabilizing finding in the 2026 report is not about any model — it is that the instruments we use to judge models are failing precisely as the models converge. Evaluations engineered to challenge AI for years are saturating in months (Humanity's Last Exam leapt 30 points to 38.3% in a single year), and a Stanford review of nine widely-cited benchmarks found invalid-question rates from 2% up to 42% on GSM8K. At the same time, frontier labs have stopped disclosing how their most capable systems are built — 81 of 102 notable 2025 models shipped without training code, parameter counts and dataset sizes are no longer published, and the Foundation Model Transparency Index average fell from 58 to 40. The AI Index itself notes it 'assumes that company-reported results are accurate,' even as third-party tests find models perform worse independently than developers report, and leaderboard standing 'may partly reflect adaptation to the platform rather than general capability.' Capability benchmarks are reported almost universally while responsible-AI benchmarks are reported almost never, and there is no shared framework for the trade-offs between safety, privacy, fairness, and accuracy — so improving one dimension can silently degrade another. The result is an accountability vacuum: the field is converging on a frontier it can no longer cleanly see, deploying more consequential systems while losing the ability to verify what they actually do. For anyone procuring, governing, or betting on AI, the practical lesson is that the headline number is now guilty until independently proven.

38.3%
Humanity’s Last Exam score — up 30 points in a single year
2–42%
Invalid-question rate across nine widely-cited benchmarks (Stanford review)
81 / 102
Notable 2025 models that shipped with no training code
58 → 40
Foundation Model Transparency Index average, 2024 → 2025
So what

Five moves for anyone betting on AI

  1. Stop buying and governing AI on leaderboard rank. With top models within single digits, select on cost, latency, reliability, and domain-specific performance — and treat any single benchmark as guilty until independently, contamination-controlled validated. Deploy agents only with humans in the loop: a ~1-in-3 failure rate on structured tasks and sub-70% on tax, mortgage, and finance work means AI is a supervised aid, not an autonomous professional.
  2. Plan around two hidden choke points. Inference, not training, is becoming the dominant and least-visible cost and resource liability (gigawatts of power, cities' worth of water), and the entire hardware stack runs through one foundry in Taiwan — both belong in board-level risk and policy planning, not footnotes.
  3. Protect the entry-level rung. AI's first labor signal is concentrated in the youngest workers in exposed fields; eroding the on-ramp eventually starves the senior-talent pipeline. Target AI at structured, monitorable tasks where gains are real (support +14-15%, software +26%) rather than deep-reasoning work, and fund apprenticeship and reskilling pathways now.
  4. Build to the states, not a federal framework. Binding U.S. AI rules currently live in state law (150 bills in 2025), and the federal-vs-state collision means companies must plan for divergent, litigation-driven compliance across California, Texas, Colorado, the EU AI Act, and China's labeling rules simultaneously.
  5. Compete and contribute on data, not just capex — and fund the validation layer. Small, data-centric models now match models up to ~200x larger, collapsing the capital barrier; in science and medicine the binding constraint is no longer idea generation but experimental confirmation, clinical evidence, and reliable end-to-end execution, which is where investment and oversight should concentrate.
Cover plate — a lone operator before a vast analog control panel
A graphic synthesis · The AI Index 2026

The Year the Rulers Broke

This is the year our instruments for measuring artificial intelligence failed us — the benchmarks that timed the machines, the leaderboards that ranked them, the independent tests that checked them all broke at once, leaving us blind to what we had built at the very hour we handed it the power to act alone.

Scroll the plates

Plate 01The Rulers We TrustedThe Trust
The Rulers We Trusted — illustrated platePlate 01 · The Plate Room
For years we judged artificial intelligence by its scores — the benchmarks that timed each model, the leaderboards that ranked the chatbots, the transparency index that tracked how openly each lab disclosed its work. These were our rulers, the measuring sticks by which we knew how far the machines had come, and we read their steady dials like instruments on a dashboard, trusting the numbers told us the truth.
Plate 02The Frontier ConvergesThe Convergence
The Frontier Converges — illustrated platePlate 02 · The Plate Room
Then the frontier converged: the leading models — from Anthropic, Google, OpenAI, and xAI — drew within a hair of one another, the open models caught the secretive labs, and the race between the United States and China narrowed to almost nothing. Capability kept climbing but flattened toward a shared ceiling — and raw power stopped being the prize worth racing for.
Plate 03The Machine Runs AloneThe Severance
The Machine Runs Alone — illustrated platePlate 03 · The Plate Room
Then artificial intelligence crossed the line from helping us to acting for us — coding agents writing whole programs end to end, systems running entire scientific pipelines unattended, AI scribes deployed across hospital wards. And the people who built the work were pushed from its center to its edge, supervisors at the rim of a machine that now runs itself.
Plate 04The Jagged GiftThe Jagged Gift
The Jagged Gift — illustrated platePlate 04 · The Plate Room
But the intelligence came jagged: the same model that won a gold medal at the International Mathematical Olympiad could not reliably read the hands of an analog clock, and still botched roughly one task in every three. Genius and blunder lived inside one machine — and from the outside, you could no longer tell which you were about to get.
Plate 05The Rulers BreakThe Blinding
The Rulers Break — illustrated platePlate 05 · The Plate Room
Then the measuring sticks themselves broke: benchmarks built to challenge AI for years were beaten in months, the leading labs went silent about how their models were trained and shipped them as sealed black boxes, and independent testers found the machines performing worse than the labs had claimed. We had lost the power to verify what we made — and went blind at the very hour of our greatest strength.
Plate 06The Empty ChairThe Thinning
The Empty Chair — illustrated platePlate 06 · The Plate Room
And the cost fell on the watchers: documented AI failures climbed, the entry-level jobs that train tomorrow's experts and overseers — the young software developers first — began to vanish, and oversight splintered as states wrote their own rules and Washington moved to strike them down. We had grown the power far faster than the people and institutions left to watch it.
Plate 07The LeverThe Fork
The Lever — illustrated platePlate 07 · The Plate Room
Now the road forks, and the choice cannot be deferred: keep humans in the loop and rebuild real, independent verification of what these systems actually do — or hand autonomy to machines whose reported numbers we can no longer trust. A hand rests on the lever, and the road splits beneath it.
Plate 08The Last WatchThe Vigil
The Last Watch — illustrated platePlate 08 · The Plate Room
The task now is not a cleverer machine but the return of our sight — the ability to see and check these systems before we trust them alone — even as the whole edifice balances on single fragile points, like the one foundry in Taiwan that fabricates nearly every advanced AI chip on earth. The human returns to the watch, and everything narrows to a single point of fate.

“This was the year artificial intelligence outran the very instruments built to measure it — the benchmarks saturating in months, the labs disclosing less, the independent checks diverging from the scores we were sold. The machines did not become unknowable on their own; we lost the ability to know them, precisely as we began to let them act without us. What hangs in the balance is not whether they grow stronger — they will. It is whether we rebuild the power to verify them, to see clearly and check honestly what they truly do, before we hand them the autonomy to run with no human watching at all. This may be among the last moments we still clearly get to choose.”

The Year the Rulers Broke · a visual reading of the Stanford AI Index 2026 — switch to Verbal for the full report