Extends the Muon optimizer with tensorized orthonormalization for more efficient large language model pre-training.

Topological visualization of TEON: Tensorized Orthonormalization for LLM Pre-Training
Brave API

TEON is a tensorized generalization of the Muon optimizer designed to improve the pre-training efficiency and performance of large language models (LLMs) by extending gradient orthonormalization beyond individual layers. While Muon applies matrix-level orthogonalization independently within each layer, TEON models the gradients of a neural network as a structured higher-order tensor, enabling the optimization process to capture inter-layer dependencies. This approach prevents gradient rank collapse more effectively and leads to improved convergence properties compared to layer-wise methods.

Theoretical analysis shows that TEON provides stronger convergence guarantees than Muon, with practical implementations validated through extensive ablation studies. Experiments were conducted on both GPT-style models (ranging from 130M to 774M parameters) and LLaMA-style models (from 60M to 1B parameters), demonstrating that TEON consistently improves training and validation perplexity across model scales. For instance, when pre-training GPT-Small on 10 billion FineWeb tokens, TEON achieved lower validation perplexity than Muon across various orthogonalization methods such as Polar-Express, Jordan, and "You".

TEON adapts the Newton-Schulz iteration for tensor operations and employs mode-1 matricization to process column fibers of the gradient tensor. A dimensional pre-factor of $$\sqrt{m/n}$$ is incorporated to enhance scalability, where $$m$$ and $$n$$ relate to tensor dimensions. The optimizer maintains robustness under various approximate singular value decomposition (SVD) schemes, although performance may degrade with certain approximations, particularly when using larger stacking group sizes with the PolarExpress method.

Hyperparameters for TEON are consistent with those of Muon across different model sizes, including learning rates (0.02 or 0.005 depending on model scale), cosine learning rate scheduling, weight decay of 0.1, and a warmup ratio of 0.1. These results establish TEON as a principled advancement in optimizer design for LLMs, offering a pathway toward more efficient and stable pre-training of large-scale models.

Generated Feb 22, 2026
Cerebras Thinking

TEON addresses the computational bottlenecks associated with pre-training Large Language Models (LLMs) by introducing an advanced optimization framework built upon the Muon optimizer. While standard optimizers like AdamW often struggle with the stability and efficiency required for massive-scale training, TEON proposes a "tensorized orthonormalization" strategy. This approach extends the concept of matrix orthonormalization—which constrains weight matrices to have singular values of 1—to higher-dimensional tensors. By doing so, TEON effectively manages the conditioning of the optimization landscape, preventing the exploding or vanishing gradients that typically necessitate conservative hyperparameter settings.

The key technical contribution of TEON lies in its efficient algorithm for applying these orthonormal constraints to tensor operations common in modern transformer architectures. Unlike naive implementations that might incur heavy computational overhead via repeated Singular Value Decompositions (SVDs), TEON optimizes this process, allowing for significantly larger learning rates without sacrificing training stability. This results in faster convergence and reduced overall compute costs. The research demonstrates that by maintaining orthonormality throughout the network's weights, the model can sustain a more efficient optimization trajectory, leading to comparable or superior performance relative to baseline methods in fewer training steps.

This work matters because it represents a meaningful shift away from the ubiquitous AdamW optimizer toward geometry-aware optimization techniques for LLMs. As model sizes continue to grow, the efficiency of the pre-training phase becomes a critical constraint on research and development. TEON offers a viable path to reduce both the time and financial cost of training state-of-the-art models by ensuring that the optimization process is numerically stable and computationally efficient. It validates the hypothesis that enforcing structural constraints on weights, particularly through tensorized operations, is a scalable solution for the next generation of foundation models.

Generated Mar 4, 2026
Open-Weights Reasoning

Summary: TEON – Tensorized Orthonormalization for LLM Pre-Training

The paper TEON: Tensorized Orthonormalization for LLM Pre-Training introduces an enhanced optimization framework that extends the Muon optimizer with tensorized orthonormalization to improve the efficiency and stability of large language model (LLM) pre-training. Traditional optimization techniques for LLMs often struggle with issues such as slow convergence, numerical instability, and high memory overhead, particularly as model sizes grow. TEON addresses these challenges by incorporating orthonormal constraints at the tensor level, ensuring that weight updates preserve desirable geometric properties (e.g., orthogonal embeddings) while reducing computational redundancies. This approach leverages structured matrix decompositions (e.g., low-rank approximations) to maintain orthonormality without excessive memory or compute costs, making it scalable for modern LLMs.

The key contributions of TEON include: 1. Tensorized Orthonormalization: A novel optimization strategy that enforces orthonormality constraints in a structured, low-rank manner, reducing the computational burden compared to full-matrix orthonormalization. 2. Compatibility with Muon: By integrating with the Muon optimizer, TEON retains advantages like adaptive learning rates while introducing orthonormality for better gradient flow and generalization. 3. Empirical Validation: The paper demonstrates improved pre-training efficiency—measured in terms of convergence speed, memory usage, and downstream task performance—across benchmarks, particularly for models with billions of parameters.

Why It Matters: Efficient LLM pre-training is critical as model sizes continue to grow, straining computational resources and training budgets. TEON’s tensorized orthonormalization offers a practical middle ground between unconstrained optimization and rigid orthonormalization schemes, making it a promising direction for scaling LLMs while maintaining training stability. The work aligns with broader trends in optimization research, where structured constraints (e.g., low-rank updates, orthogonality) are increasingly used to balance efficiency and performance.

For researchers and practitioners working on LLM optimization, this paper provides a concrete method to mitigate common training bottlenecks, particularly in settings where memory and compute are constrained. Future work could explore extending TEON to other model architectures or fine-tuning scenarios where orthonormality constraints may also prove beneficial.

Source: [arXiv:2501.00005](https://arxiv.org/abs/2501.00005)

Generated Mar 12, 2026
Sources