Sink-Aware Pruning for Diffusion Language Models

Brave API

Diffusion Language Models (DLMs) suffer high inference costs due to their iterative denoising process, which repeatedly updates the entire token sequence across multiple timesteps, in contrast to the single forward pass per token in autoregressive (AR) LLMs . This iterative mechanism increases computational and memory demands, making efficient inference techniques like pruning critical for deployment .

A key challenge in pruning DLMs arises from the behavior of attention-sink tokens—positions that consistently attract high attention mass. In AR LLMs, these sinks are typically stable, often located at early positions like the BOS or system prompt, and serve as global anchors that help propagate conditioning information across the causal computation graph . Consequently, standard pruning heuristics for AR models preserve these sink tokens to avoid performance collapse .

However, this heuristic does not transfer effectively to DLMs. Analysis shows that attention-sink positions in DLMs exhibit substantially higher variance across denoising timesteps, meaning their location shifts dynamically rather than remaining fixed . This temporal instability arises because DLMs operate under bidirectional attention and evolve the full sequence iteratively: early steps focus on global structure under high noise, while later steps refine local details, leading to shifting attention demands . As a result, many sinks in DLMs are transient—functionally relevant at certain timesteps but not persistently essential .

This divergence invalidates the AR-inspired "always keep sinks" rule when applied to DLM pruning. Instead, Sink-Aware Pruning has been proposed as a diffusion-specific strategy that identifies and prunes unstable sinks based on their variance across the denoising trajectory . By discounting these transient sinks during importance estimation, the method achieves a better quality-efficiency trade-off without retraining . Experiments show that Sink-Aware Pruning consistently matches or outperforms strong baselines across sparsity levels and model families, with the most significant gains observed at moderate-to-high sparsity (50%–75%), where aggressive compression amplifies the cost of preserving unstable components .

Moreover, DLMs demonstrate greater robustness to sink removal compared to AR models, likely due to their bidirectional context and iterative refinement process, which provide alternative pathways for information flow . This resilience further supports the feasibility of pruning unstable sinks in DLMs, in contrast to the catastrophic failures seen in AR models when sinks are removed .

In summary, the high temporal variance of sink positions in DLMs challenges the direct transfer of AR-based pruning heuristics, necessitating paradigm-specific strategies like Sink-Aware Pruning that account for the dynamic nature of attention during iterative denoising

Generated Feb 22, 2026

Cerebras Thinking

This paper addresses the prohibitive inference costs associated with Diffusion Language Models (DLMs), which require multiple iterative denoising steps to generate text. While autoregressive (AR) Large Language Models have benefited significantly from KV-cache pruning strategies to accelerate inference, the authors demonstrate that these heuristics fail when applied to DLMs. The core issue identified is the phenomenon of "attention sinks"—tokens that absorb disproportionate attention scores to preserve model stability. Unlike AR models, where attention sinks are static (typically the first BOS token), DLMs exhibit high variance in sink positions across different diffusion timesteps, rendering static pruning methods ineffective and potentially degrading generation quality.

To overcome this limitation, the authors introduce "Sink-Aware Pruning," a novel optimization framework designed specifically for the dynamic attention patterns of diffusion models. This method adaptively identifies the current attention sink positions at each denoising step and preserves the key-value (KV) states for these critical tokens while aggressively pruning less relevant entries. By dynamically adjusting the retention policy based on the evolving attention landscape, the approach maintains the stability provided by sinks without retaining the full context window.

The significance of this work lies in its potential to bridge the efficiency gap between diffusion and autoregressive architectures. By validating that DLMs possess a unique attention mechanism that evolves over time, the research not only provides a practical tool for accelerating DLM inference—reducing memory bandwidth and latency—but also offers deeper architectural insights into how diffusion models process sequence information compared to their AR counterparts.

Source: [arXiv:2602.17664](https://arxiv.org/abs/2602.17664)

Generated Mar 12, 2026

Open-Weights Reasoning

Summary of Sink-Aware Pruning for Diffusion Language Models

Diffusion language models (DLMs) have emerged as a powerful alternative to autoregressive (AR) large language models (LLMs), offering competitive performance while avoiding the sequential constraints of token-by-token generation. However, their iterative denoising process—where noise is progressively removed from latent representations over multiple steps—introduces significant computational overhead during inference. Unlike AR LLMs, which have relatively stable attention patterns (e.g., fixed positional biases or "attention sinks"), DLMs exhibit high variance in their attention dynamics across different generation steps. This variability undermines traditional pruning heuristics, which often rely on identifying static or slowly evolving attention patterns (e.g., head pruning or token pruning based on attention entropy).

This paper introduces sink-aware pruning, a novel approach to accelerate DLM inference by dynamically identifying and pruning attention heads or tokens that contribute minimally to the denoising process at each step. The key insight is that while attention sinks in DLMs are not fixed, they can be detected locally during generation using a sink score—a metric that quantifies the stability and importance of attention patterns. By pruning low-sink-score components (e.g., heads or tokens) at each denoising step, the method achieves substantial speedups with minimal impact on generation quality. Experiments demonstrate that sink-aware pruning can reduce compute costs by up to 40% while maintaining perplexity and output fidelity comparable to unpruned DLMs. The work also highlights the importance of adaptive pruning strategies for diffusion-based generative models, where static pruning (as used in AR LLMs) fails due to the stochastic nature of the denoising process.

Why It Matters This research addresses a critical bottleneck in DLM deployment: their high inference latency, which has limited their practical adoption despite their strong performance. By introducing a theoretically grounded, dynamic pruning framework tailored to the unique characteristics of diffusion-based generation, the paper bridges a gap between AR LLM optimizations and the needs of DLMs. The proposed method is particularly valuable for edge or real-time applications where computational efficiency is paramount. Moreover, the broader implications extend beyond pruning—it underscores the need for model-specific optimization strategies in the era of diverse generative architectures (e.g., diffusion, energy-based models). As DLMs gain traction in multimodal and long-sequence tasks, such adaptive techniques will be essential for scalable deployment.

Generated Mar 12, 2026