The 'comparable results' framing papers over a real, persistent capability gap: OLMo 3 scored 78.1% on AIME 2025 versus Grok-4 92.7%, GPT-5 high 94.3%, and Gemini 95.7% — a 14-17 point deficit. For the hardest, highest-value reasoning, the giant still wins; 'matches' is true only at a chosen threshold of 'good enough.'
Who bears it: Buyers and executives told they can drop the frontier model for a small one — they may silently accept a 15-point reliability haircut on exactly the tasks where the last few points matter most.
Why hidden: The headline keystat ('~90x fewer params, comparable AIME') compresses a 14-17 point raw gap into the word 'comparable,' and the report's own R&D tension flags that flattened parameter counts 'likely understate real growth' because frontier labs stopped disclosing — so the comparison is being made against a moving, partly-hidden target.
Data deduplication, pruning, and curation are themselves opaque, unaudited, and largely undisclosed — so the data-centric path inherits and deepens the transparency crisis rather than escaping it. Every model except K2 Think and Olmo 3 32B scored zero on pre-training-data transparency (Ch.3).
Who bears it: External auditors, regulators, and downstream scientists who need to know what was pruned out (and what bias that pruning introduced) to trust a domain model's outputs.
Why hidden: The theme celebrates curation as the new winning method but never notes that 'curation' is an undocumented editorial act on the training corpus — the same upstream-data opacity (FMTI's weakest dimension, 58→40) that Theme T04 identifies, now made load-bearing for performance.
Cheap, domain-specific bio/genomics model-building lowers the capital and expertise barrier to dual-use capability — the same parameter-efficiency that lets a 200M model predict variant effects also lowers the cost of misuse — while biosecurity oversight is structurally neglected (only 14 biosecurity publications in all of medical AI in 2025).
Who bears it: Public-health and biosecurity institutions, and society at large, who bear a proliferation risk that scales with how easy and cheap these generative biological-design models become to build.
Why hidden: The report frames small models purely as a democratization and public-good win; the proliferation/dual-use downside of cheap generative genome-scale models (Evo 2: 40B params, fully open weights, 9.3T base pairs) is mentioned nowhere in the theme and barely anywhere in the corpus (14 papers).
The data bottleneck is unevenly distributed, so 'win on data' hard-codes existing data inequities into model quality: medical imaging training data is ~100x smaller than non-medical, functional/perturbation biology data is scarce, ecological data is biased toward well-studied taxa, and patient-perception data is dominated by US/UK/Germany.
Who bears it: Patients, species, and populations in data-poor domains and geographies (sub-Saharan Africa, Latin America, Southeast Asia, rare diseases, understudied taxa) — when data is the moat, the data-poor get permanently worse models.
Why hidden: When compute was the barrier, money could in principle close the gap; when data is the barrier, no amount of money conjures clinical data that was never collected — so the theme's 'lowers the barrier for smaller players' optimism inverts for anyone in a data-desert, and the report only notes the data gaps in scattered tension footnotes, never connected to the small-model thesis.
Pushing the field toward narrow, data-curated specialist models fragments the ecosystem into many in-domain winners with no shared benchmark or interoperability, raising integration and governance cost — Ch.6 explicitly notes medical imaging AI 'lacks shared benchmarks' and Ch.5 notes most domain benchmarks are brand-new with no longitudinal data.
Who bears it: Health systems, labs, and integrators who must now evaluate, validate, and stitch together dozens of bespoke specialist models instead of governing a few general ones.
Why hidden: The theme's 'favor domain-specific open models over giant general ones' recommendation quietly externalizes the cost of running a zoo of incomparable specialist models onto the adopter; specialization's fragmentation tax is invisible in a per-model performance comparison.