this post was submitted on 13 Jun 2026
27 points (100.0% liked)

technology

24392 readers
156 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

founded 5 years ago
MODERATORS
 

There’s a really interesting quirk in modern architecture that a lot of people have been noticing lately referred to as the Curse of Depth in the paper. Basically if you look at popular models like Llama or Qwen or DeepSeek you will find that the deeper layers are surprisingly useless. You can completely prune away huge chunks of the later transformer blocks without actually hurting the performance of the model. The representations in these deep layers end up looking practically identical to each other, and it’s a massive waste of GPU hours because we are training billions of parameters that end up doing almost nothing.

The authors trace the root cause directly to Pre-Layer Normalization. Pre-LN makes training massive transformers way more stable than the old Post-LN setups, but the catch is that as you pass data through more and more Pre-LN layers the output variance explodes exponentially. Because of how the math works out this exploding variance forces the derivatives in deep blocks to essentially become an identity matrix turning the layer into a pass-through filter that cannot learn any meaningful new transformations.

And turns out that the problem can be fixed using a remarkably simple tweak called Layer Norm Scaling. They literally just scale the output of the layer norm inversely by the square root of the layer depth. This completely stops the variance from blowing up as you go deeper into the network. Because the variance stays under control the deep layers actually wake up and start contributing to the representation learning.

They tested this trick on models ranging from tiny 130M parameter setups all the way to 7B parameter models. In every case Layer Norm Scaling beat out standard Pre-LN and other normalization tricks. The pre-training loss drops significantly and those gains carry right over into supervised fine-tuning tasks. Best of all it requires zero new hyperparameters or learnable weights. It is just a clean mathematical fix to a fundamental architectural flaw.

you are viewing a single comment's thread
view the rest of the comments
[–] yogthos@lemmygrad.ml 5 points 1 day ago (1 children)

For existing models the paper is basically handing us a free lunch for inference. Because they proved that the deep layers in standard models are doing almost nothing you can literally just chop those layers off. You can take a fully trained model like DeepSeek or Qwen and drop a bunch of the later transformer blocks. The paper shows that doing this without any fine-tuning barely hurts the benchmark performance. Since you can just delete the useless layers that means we can immediately start to save memory and speed up inference massively, which is great news for running local models. And since the scaling forces every single layer to pull its weight you get way more capability out of the exact same parameter count. So now you can also train a significantly smaller model that matches the performance of a much larger baseline model.

[–] RNAi@hexbear.net 1 points 1 day ago

Shit thats good news for AI stonks sicko-wistful