Mechanistic interpretability posits neural networks use superposition for over-complete feature bases, inspiring sparse autoencoders, but this is mainly studied in idealized sparse, uncorrelated settings where it introduces interference.
Superposition in neural networks allows for the representation of more features than available dimensions in activation space, achieved through non-orthogonal, overcomplete feature bases that leverage sparsity to minimize interference . This phenomenon enables models to efficiently encode vast amounts of information but introduces polysemanticity, where individual neurons respond to multiple semantically distinct features, complicating interpretability . Sparse autoencoders (SAEs) have emerged as a key tool to disentangle these superposed representations by learning sparse, overcomplete latent codes that are more interpretable and monosemantic than individual neurons .
Data statistics—particularly input correlations—play a critical role in shaping the geometry of these latent feature representations. When training data contains structured correlations, the learned feature geometries become increasingly constrained. In uncorrelated settings, feature arrangements vary significantly across random seeds, resulting in high geometric variability . However, introducing paired correlations—where features in one class co-activate with those in another—partially constrains the geometric configuration . Under global correlations, where activation patterns follow a cyclic structure based on class similarity, the geometry becomes nearly deterministic across seeds, with cosine similarity matrices showing strong alignment (0.88 ± 0.06) .
This geometric organization reflects an interference-avoidance mechanism: frequently co-activating features are arranged to minimize mutual interference, leading to orthogonalization or clustering in structured subspaces . Correlated features tend to be more orthogonal, while anti-correlated or sparsely co-occurring features may reside within the same tegum product or form antipodal pairs (e.g., one direction encoding two opposing features) . These configurations suggest that the network’s representational geometry is not arbitrary but shaped by the statistical structure of the input data.
While much of this understanding comes from idealized models with sparse, uncorrelated inputs, recent work extends these principles to more complex architectures like Mixture-of-Experts (MoE), showing that network-level sparsity—not just feature sparsity—governs representational strategies . Moreover, adversarial vulnerabilities can be predicted from superposition geometry, as interference patterns induced by data correlations create consistent failure modes across models . Thus, the transition from data statistics to feature geometry reveals that superposition is not merely a representational quirk but a structured, efficiency-driven mechanism shaped by input structure and computational constraints
This paper bridges the gap between idealized theoretical models of superposition and the complex statistical realities of natural data. While foundational work in mechanistic interpretability typically assumes features are sparse and independent, this study investigates how feature correlations reshape the geometry of neural representations. The authors formalize how the covariance structure of the dataset dictates the arrangement of feature vectors in a network’s activation space, demonstrating that the "over-complete" basis paradigm behaves fundamentally differently when data statistics deviate from the independent prior. The work provides a rigorous analysis of how these correlations introduce structured interference, diverging from the uniform noise patterns found in uncorrelated settings.
A key contribution of the research is its geometric characterization of feature representation under correlation constraints. The authors show that rather than arranging features isotropically to minimize interference, networks utilize specific geometric configurations—such as forming tight clusters or utilizing antipodal arrangements—to represent correlated features efficiently. This insight reveals that the capacity of a neural network to store features is not merely a function of sparsity but is critically dependent on the correlation matrix of the underlying data distribution. Furthermore, the paper analyzes the implications for Sparse Autoencoders (SAEs), suggesting that current architectures, which often rely on uncorrelated priors, may struggle to recover faithful feature geometry in real-world applications where correlations are pervasive.
This research matters significantly because it moves the field of mechanistic interpretability toward more realistic models of neural computation. As Large Language Models (LLMs) process highly correlated natural language, understanding superposition through the lens of data statistics is crucial for accurate interpretation. By clarifying how correlations shape feature geometry, this work paves the way for developing next-generation dictionary learning tools that can robustly disentangle features in practical scenarios, ultimately enhancing our ability to audit and understand complex AI systems.
Here’s a substantive summary of the research material for your curated collection:
---
Summary
This paper investigates how correlations in input data influence the effectiveness of superposition in neural networks, bridging the gap between statistical properties of data and the geometric structure of learned features. Mechanistic interpretability studies have shown that neural networks can leverage superposition—where multiple features are entangled in a single dimension—to represent over-complete feature bases efficiently. However, most prior work assumes sparse, uncorrelated data, where such representations introduce interference that degrades performance. This work relaxes this assumption by analyzing how correlations in real-world data shape the feasibility and utility of superposition. The authors demonstrate that correlated features can either mitigate or exacerbate interference, depending on their alignment with the network's learned basis. By formalizing these interactions, the paper provides insights into when and why superposition is beneficial, offering a more nuanced understanding of its role in neural representations.
The key contributions include a theoretical framework linking data statistics (e.g., covariance structure) to feature geometry, empirical validation on synthetic and real datasets, and implications for sparse autoencoders and other models relying on superposition. The findings suggest that correlations in data can be harnessed to reduce the dimensions needed for representation, challenging the prevailing view that superposition is primarily advantageous in low-interference regimes. This work matters because it refines our understanding of neural efficiency, with potential applications in designing more robust and interpretable models, optimizing feature extraction in autoencoders, and improving generalization in downstream tasks.
---
This summary highlights the paper's technical depth while making its significance clear to a technically literate audience.