Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes a Hidden Neuron Death Problem in Muon

0


Researchers at Tilde Research have released Aurora, a new optimizer for training neural networks that addresses a structural flaw in the widely-used Muon optimizer. The flaw quietly kills off a significant fraction of MLP neurons during training and keeps them permanently dead. Aurora comes with a 1.1B parameter pretraining experiment, a new state-of-the-art result on the modded-nanoGPT speedrun benchmark, and open codes.

What is Muon?

To understand Aurora, it helps to first understand Muon. The Muon optimizer attracted attention in the ML community after outperforming AdamW in wall-clock time to convergence on the nanoGPT speedrun competition — a community benchmark that measures how fast you can train a GPT-style model to a target validation loss. Since then, Muon has been adopted in frontier-scale model training by several research groups.

Muon’s key algorithmic step is computing the polar factor of the gradient matrix. For a gradient matrix G with thin Singular Value Decomposition (SVD) G = UΣVᵀ, Muon computes polar(G) = UVᵀ, which is the closest semi-orthogonal matrix to G in the Frobenius norm. This orthogonalized gradient is then used to update the weights: W ← W − η UVᵀ for a learning rate η. The use of matmul-only iterative algorithms to compute the polar factor is what makes Muon practical at scale.

The NorMuon Puzzle: Row Normalization Helps, But Why?

Before Aurora, NorMuon led the modded-nanoGPT speedrun. It introduced a row-normalization step—similar to Adam’s per-parameter scaling—that adjusted the polar factor by its inverse RMS norm. While this often pulls the update away from a strictly orthogonal gradient, NorMuon still yields impressive results. The Tilde team set out to understand exactly what gap in Muon’s formulation NorMuon was addressing.

The Core Problem: Row-Norm Anisotropy and Neuron Death in Tall Matrices

The research team discovered that the Muon optimizer unintentionally “kills” a large portion of neurons in tall weight matrices, such as those found in SwiGLU-based MLP layers. Because it is mathematically impossible for these specific matrix shapes to stay perfectly orthogonal while keeping row updates even, the optimizer ends up giving massive updates to some neurons while virtually ignoring others. This results in a “death spiral” where under-performing neurons receive less signal over time, eventually becoming permanently inactive.

The research study revealed that by the 500th training step, more than one in four neurons are effectively dead. This isn’t just a local issue; the lack of activity in these neurons starves subsequent layers of necessary data, spreading the inefficiency throughout the model. Aurora solves this by using a new mathematical approach that enforces uniform updates across all neurons without sacrificing the benefits of orthogonalization.

Before arriving at Aurora, the research introduces an intermediate fix called U-NorMuon. The key observation is that NorMuon normalizes each row to unit norm (norm = 1), but this is actually the wrong target for a tall matrix. For a column-orthogonal tall matrix, the mathematically correct average row norm is √(n/m), not 1. U-NorMuon corrects this by normalizing tall matrix rows to have norm √(n/m) instead of 1.

In experiments at 340M scale, U-NorMuon outperforms both Muon and standard NorMuon and completely eliminates the neuron death phenomenon — leverage scores become approximately isotropic throughout training. Crucially, U-NorMuon propagates this benefit to layers it doesn’t directly touch: keeping up/gate rows alive ensures isotropic gradient flow into the down-projection, stabilizing its column leverage without any direct intervention.

However, U-NorMuon still has a problem: it forcefully overrides the polar factor with uniform row norms, sacrificing polar factor precision, which is both theoretically undesirable and empirically costly in the Muon framework (the paper shows that Muon achieves monotonically lower loss with more precise orthogonalization). This is the motivation for Aurora.

Aurora: Steepest Descent Under Two Joint Constraints

Aurora reformulates the update-selection problem from scratch. Rather than running orthogonalization and then patching it with row normalization, Aurora asks: what is the optimal update under the joint constraint of left semi-orthogonality and uniform row norms?

Formally, for tall matrices, Aurora solves:

U∗=argUmax​Tr(G⊤U)s.t.U⊤U=In​,∥Ui:​∥2=mn​∀iU ∗ =arg U max ​ Tr(G ⊤ U)s.t.U ⊤ U=I n ​ ,∥U i: ​ ∥ 2 = m n ​ ∀i

The research shows that these two constraints together force all singular values of U to exactly equal 1. This means the joint constraint still produces a valid left semi-orthogonal update, not a compromised one. This is the key insight that separates Aurora from NorMuon and U-NorMuon: it achieves row-norm uniformity and orthogonality simultaneously rather than trading one off against the other.

The research also provides two algorithmic implementations of Aurora’s solution. The Riemannian Aurora uses a gradient projection approach restricted to the joint Stiefel/equal-row-leverage manifold. The vanilla Aurora is a simpler, more practical implementation. Both are open-sourced. For non-tall (wide and square) matrices, row-norm uniformity is already implied by orthogonality, so Aurora leaves those parameters unchanged.

Results

Aurora was used to train a 1.1B model that achieves 100x data efficiency on open-source internet data and outperforms larger models on general evals like HellaSwag. At 1B scale, Aurora achieves large gains over both Muon and NorMuon. On the modded-nanoGPT optimization speedrun, Aurora’s submitted run outperforms the prior state-of-the-art (which was NorMuon). Untuned Aurora carries only a 6% compute overhead over traditional Muon and is designed as a drop-in replacement.

The research team also found that Aurora’s performance gains scale with MLP width, suggesting it is particularly effective for networks with large MLP expansion factors — which is consistent with the neuron death hypothesis, since wider MLPs have more tall matrices and more opportunity for leverage anisotropy to compound.

Key Takeaways

  • Muon’s polar factor update inherits row-norm anisotropy on tall matrices, causing over 25% of MLP neurons to permanently die as early as step 500 of training.
  • Aurora solves this by finding the optimal update under a joint constraint of left semi-orthogonality and uniform row norms — achieving both simultaneously rather than trading one off against the other.
  • At 1.1B scale, Aurora achieves 100x data efficiency on open-source internet data, outperforms larger models on HellaSwag, and sets a new SoTA on the modded-nanoGPT speedrun.
  • Aurora is a near-drop-in replacement for Muon with only 6% compute overhead, and its gains scale with MLP width.

Check out the Paper and GitHub Repo Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



Source link

You might also like
Leave A Reply

Your email address will not be published.

Cube Letter