evonti

Scaling is running the wrong race

The failures that matter are structural. More parameters won't fix them.

The AI industry is spending over $600 billion this year on AI infrastructure. The four largest US hyperscalers (Amazon, Alphabet, Meta, Microsoft) have each committed to capex plans exceeding $100 billion individually, the largest single-year corporate investment cycle in history. Dell'Oro Group's March 2026 report forecasts total data center capex will surpass $1 trillion this year.

Not all of this is about making models bigger. A growing share goes to inference capacity, networking, and storage. But the underlying bet is the same: more compute, applied to the same architecture, produces better results. Whether the compute trains a larger model or serves a scaled-up model to more users, the assumption is that the model's representations are good enough and the constraint is capacity.

The assumption is correct. And it's irrelevant. Because "better" on average is not the same as "better where it matters."

The geometry of good enough

Here's what scaling actually does at the representational level. A transformer has a fixed number of dimensions in each layer. The world has far more concepts than there are dimensions. So the model packs multiple features into overlapping representations, a technique called superposition. This isn't a bug. Elhage et al. (2022) demonstrated that superposition is the loss-minimizing solution when features are sparse and dimensions are finite. The model actively chooses to represent more concepts than it has room for, accepting interference as the cost.

The interference is what matters. When two features share geometric space, activating one can partially activate the other. For high-frequency, well-represented concepts, the interference is negligible. The model has learned to filter it cleanly. For rare concepts in the tail of the distribution, the interference is structural. Not because the model failed to learn them, but because packing them into shared dimensions is how the model achieves its parameter efficiency in the first place.

Scaling improves the geometry for common cases. Add more parameters, get more dimensions, reduce interference for the features the loss function weights most heavily. Average-case performance rises. Benchmarks improve. Papers get published.

But the tail doesn't clear at the same rate, because the tail is where superposition lives. Rare features are sparse by definition. The loss function cares about them less. They stay packed into crowded regions of the representation space even as the model grows, because the optimization pressure to untangle them is proportionally weak.

Mixture of Experts (MoE) architectures are the industry's best attempt at addressing this within the scaling paradigm. By routing different inputs to different parameter subsets, MoE gives specialized concepts access to dedicated expert capacity rather than forcing everything through shared dimensions. This helps. It's why MoE has become the default architecture for frontier models. But MoE routing is itself a learned function optimized on the training distribution. The router learns to allocate experts to frequent, well-represented patterns. Rare tail concepts that don't activate often enough to justify dedicated expert capacity still end up compressed into generalist experts, still superposed, still subject to interference. MoE mitigates the problem for the mid-distribution. It doesn't solve it for the tail.

This is not speculation. The evidence is published by the people building the models.

What the interpretability work actually shows

Anthropic's sparse autoencoder (SAE) research program is the most rigorous attempt to decompose what transformers actually represent internally. SAEs try to pull apart the superposed features into individual, interpretable components. The results are instructive, but not in the way the optimistic reading suggests.

Engels et al. (2024) studied what they call SAE "dark matter," the unexplained variance that SAEs can't decompose into clean features. Their key finding: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. The nonlinear error, the component that resists decomposition, doesn't vanish with SAE scale. There appears to be a floor.

This matters because SAEs are the best analytical tool available for understanding superposition. If SAE scale doesn't resolve the error, it suggests the interference isn't a decomposition failure. It's a property of the representations themselves.

Muhamed et al. (2024) made the tail problem explicit. They built Specialized Sparse Autoencoders specifically because general-purpose SAEs can't reach rare concepts. Their framing is direct: if a concept appears only once every billion tokens, you may need a billion-feature SAE to capture it reliably. Their solution, domain-specific SAEs trained on targeted data, is effective, but it confirms the structural problem. Anyone who's tried to make a general model work reliably on a specialized domain has seen this from the outside: the model doesn't know what it doesn't know about your problem. The SAE work shows what that looks like on the inside. The concepts literally aren't represented in the geometry.

SAEs are analytical tools trained on frozen model activations. They decompose representations for researcher inspection. They don't change how the model computes during inference. Observability and behavior are different targets. The interpretability work documents the structural nature of superposition with more rigor than existed before. It doesn't solve it.

The failure that scales up

If superposition creates structural interference in the tail, you'd expect to see user-relevant failures that persist or worsen with scale. You do.

Wei et al. (2023) at Google DeepMind measured sycophancy, the tendency of models to agree with users even when the user is wrong, across PaLM models from 8B to 540B parameters. Sycophancy increased by up to 20% going from 8B to 62B, and another 10% from 62B to 540B across several opinion-agreement tasks. Instruction tuning amplified it further. The largest, most capable models were the most sycophantic.

The standard explanations focus on training signal: RLHF rewards agreeable outputs, conversational training data skews toward agreement, human raters prefer models that validate them. These are real factors. But they don't fully explain why the problem gets worse with scale. If sycophancy were purely a training signal artifact, you'd expect it to plateau once the model was large enough to learn the pattern. Instead, it keeps climbing.

One interpretation (and this is my frame, not the paper's) is that this is consistent with superposition interference. The concepts "what the user believes" and "what is correct" need to be distinguished in the model's representation space. For common, well-represented facts with clear ground truth, the model can separate them. For ambiguous, subjective, or rare queries, the representations overlap. More parameters don't resolve this overlap because the optimization pressure is weakest exactly where the interference is worst. This isn't proven, but it's testable, and it predicts the scaling behavior Wei et al. observed better than "the training data is agreeable" does.

The structural problem is broader than sycophancy. Hallucination on niche topics, inconsistency on edge cases, confident errors on low-frequency facts. These share a common pattern: they involve the model failing to cleanly distinguish competing concepts in regions of the distribution where training signal is sparse. Whether you call this superposition or something else, the failures cluster in the tail, and the tail doesn't clear with scale.

The quiet pivot

The frontier labs know this, even if they don't frame it this way.

The narrative two years ago was "scaling is all you need." The narrative in 2026 is different. Post-training now accounts for more of a model's usable capability than pretraining. Reinforcement learning with verifiable rewards, tool use, multi-step agent workflows, chain-of-thought reasoning. The entire post-training stack exists because raw scaling didn't converge on the behaviors users actually need.

The shift is documented. Karpathy's 2025 retrospective identified RLVR as the year's defining development, noting that reinforcement learning training cycles expanded significantly while parameter counts stayed roughly flat. Menlo Ventures' mid-2025 market report called verifiable rewards "the new path to scaling intelligence." The money followed: post-training infrastructure, evaluation frameworks, synthetic data pipelines, agent environments.

None of this is about making the model bigger. It's about compensating for the things that bigger models don't do better.

Here's the counterargument you'd hear from a hyperscaler CFO: RLVR works brilliantly on code and math, which are the highest-value commercial applications right now. Coding alone is a multi-billion-dollar ecosystem and the gateway to enterprise adoption. If scaling plus post-training solves the problems that make the most money, the capex is working.

Fair enough. For verifiable domains, the combination is genuinely powerful. But verifiable domains are verifiable precisely because they have deterministic reward signals: unit tests pass or fail, proofs check or don't. The post-training stack can teach the model to produce correct outputs despite representational interference when a clear signal tells it what "correct" means. Most real-world tasks don't have that signal. Customer support, medical triage, legal analysis, strategic advice, regulatory compliance. These are the tail-distribution problems where correctness is contextual, feedback is ambiguous, and the reward signal can't be automated.

And these are exactly the tasks enterprise buyers expected AI to handle. The data on this is consistent. Deloitte's 2026 survey of over 3,200 enterprise leaders found that 66% report productivity and efficiency gains, the assistance tier, but only 20% are using AI to grow revenue, and only 34% are attempting to redesign their business around it. IBM's 2025 CEO study found only 25% of AI initiatives delivered expected ROI, with just 16% scaling enterprise-wide. The individual productivity gains are real. The organizational transformation isn't materializing.

The most striking number comes from MIT's NANDA initiative: 95% of generative AI pilots failed to produce measurable financial impact. The study's authors attribute this primarily to organizational factors: wrong deployment strategy, poor integration, misaligned resources. That diagnosis is correct as far as it goes. But their own lead researcher identified something more structural: generic tools "stall in enterprise use since they don't learn from or adapt to workflows." That's not an integration problem. That's the same gap described from the deployment side that superposition predicts from the architectural side. The tools can't adapt to enterprise-specific context because the model's representations don't handle the tail of specialized, context-dependent concepts. Better change management won't fix a representational limitation.

This is the gap the superposition argument predicts. Assistance works because it operates in the mid-distribution: summarize this document, draft this email, autocomplete this function. These are common patterns with well-represented concepts in the model's geometry. Transformation, replacing a workflow end-to-end, handling the exceptions, adapting to the specific context of this business, requires reliable performance in the tail. And the tail doesn't clear with scale.

A Wall Street analyst might argue that the assistance market is big enough to justify the capex on its own. Maybe. But the trillion-dollar infrastructure bets aren't priced for a world where AI helps people type faster. They're priced for a world where AI replaces workflows. The gap between those two outcomes is the gap between the mid-distribution and the tail.

The cost of the wrong geometry

Follow the logic to the cost structure.

The four largest US hyperscalers will spend over $600 billion on infrastructure in 2026, the majority directed at AI compute and data centers. Goldman Sachs projects total hyperscaler capex from 2025-2027 will reach $1.15 trillion. This spending buys compute. Compute serves the same representations: the same superposed features, the same geometric crowding in the tail.

Each generation of model is measurably better at code, math, reasoning, and summarization. But notice where the improvements land: the well-represented regions where training signal is dense, benchmarks are plentiful, and verification is cheap. The mid-distribution sharpens. The tail, where enterprise value lives or dies, improves at a structurally lower rate.

The economic question isn't "does scaling work?" It does. The question is: does scaling work on the dimension that determines whether the investment pays off?

The industry is answering that question with post-training patches. The existence of those patches is the evidence.

What Engram concedes

DeepSeek's Engram paper (January 2026) is interesting precisely because it's a concession. Engram separates knowledge from reasoning: a static N-gram hash table handles factual lookup, freeing transformer depth for reasoning. The results are strong: BBH +5.0, ARC-Challenge +3.7 over an iso-parameter MoE baseline. The biggest gains came in reasoning benchmarks, not knowledge recall, which validates the core architectural insight: when you stop asking the transformer to store knowledge in superposed weights, its reasoning capacity improves.

Engram implicitly concedes that superposed knowledge storage is insufficient for reliable factual retrieval. The hash table gives facts their own dedicated representation, outside the compressed geometry of the transformer. This is an architectural separation that the scaling approach says shouldn't be necessary. If scaling worked on knowledge, you wouldn't need a separate subsystem for it.

But Engram's knowledge layer is frozen after training. It's a static lookup table, not a learning system. DeepSeek diagnosed the right problem, that knowledge shouldn't be superposed with reasoning, and built a static patch. The patch improves performance. It doesn't learn from use. It doesn't improve from outcomes. It doesn't update when the world changes.

They're solving the right problem inside the wrong framework.

The race that can't be won on this track

The argument here isn't that scaling fails. Scaling succeeds. Models improve on benchmarks, generate better text, reason more effectively, write better code. The argument is that scaling succeeds at the wrong thing.

The failures that matter to users don't cluster neatly. A doctor's question about a rare drug interaction, an engineer's edge case in a legacy codebase, a customer's complaint that doesn't match any template. These live in the tail where superposition is structural and scaling provides the least leverage. The hundreds of billions being spent on compute infrastructure sharpen the mid-distribution while the tail-case gap persists.

What would you build if you accepted that scaling won't solve this particular problem?

The answer requires a different relationship between knowledge and the systems that use it. Knowledge stored externally, where it can be inspected, corrected, and updated. Not superposed in opaque weights. This much, RAG already attempts. But most production RAG pipelines retrieve by similarity and don't track whether the retrieval actually helped. Query the same thing twice, get the same results, regardless of whether the first result was useful or useless. Retrieval that improves from outcomes, where the quality of what gets surfaced adapts based on what actually worked, is a different mechanism entirely. A system where the 1000th query gets better results than the 1st, not because the model got bigger, but because the system learned what works for this user, in this domain, on this type of problem.

That's a different race entirely. And the industry isn't funding it.