technology

24392 readers

156 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

1. Obviously abide by the sitewide code of conduct. Bigotry will be met with an immediate ban
2. This community is about technology. Offtopic is permitted as long as it is kept in the comment sections
3. Although this is not /c/libre, FOSS related posting is tolerated, and even welcome in the case of effort posts
4. We believe technology should be liberating. As such, avoid promoting proprietary and/or bourgeois technology
5. Explanatory posts to correct the potential mistakes a comrade made in a post of their own are allowed, as long as they remain respectful
6. No crypto (Bitcoin, NFT, etc.) speculation, unless it is purely informative and not too cringe
7. Absolutely no tech bro shit. If you have a good opinion of Silicon Valley billionaires please manifest yourself so we can ban you.

founded 5 years ago

MODERATORS

context@hexbear.net

SexUnderSocialism@hexbear.net

gaycomputeruser@hexbear.net

Wakmrow@hexbear.net

SwitchyandWitchy@hexbear.net

JustSo@hexbear.net

The Curse of Depth in Large Language Models (arxiv.org)

submitted 2 days ago by yogthos@lemmygrad.ml to c/technology@hexbear.net

3 comments fedilink hide all child comments

There’s a really interesting quirk in modern architecture that a lot of people have been noticing lately referred to as the Curse of Depth in the paper. Basically if you look at popular models like Llama or Qwen or DeepSeek you will find that the deeper layers are surprisingly useless. You can completely prune away huge chunks of the later transformer blocks without actually hurting the performance of the model. The representations in these deep layers end up looking practically identical to each other, and it’s a massive waste of GPU hours because we are training billions of parameters that end up doing almost nothing.

The authors trace the root cause directly to Pre-Layer Normalization. Pre-LN makes training massive transformers way more stable than the old Post-LN setups, but the catch is that as you pass data through more and more Pre-LN layers the output variance explodes exponentially. Because of how the math works out this exploding variance forces the derivatives in deep blocks to essentially become an identity matrix turning the layer into a pass-through filter that cannot learn any meaningful new transformations.

And turns out that the problem can be fixed using a remarkably simple tweak called Layer Norm Scaling. They literally just scale the output of the layer norm inversely by the square root of the layer depth. This completely stops the variance from blowing up as you go deeper into the network. Because the variance stays under control the deep layers actually wake up and start contributing to the representation learning.

They tested this trick on models ranging from tiny 130M parameter setups all the way to 7B parameter models. In every case Layer Norm Scaling beat out standard Pre-LN and other normalization tricks. The pre-training loss drops significantly and those gains carry right over into supervised fine-tuning tasks. Best of all it requires zero new hyperparameters or learnable weights. It is just a clean mathematical fix to a fundamental architectural flaw.

you are viewing a single comment's thread
view the rest of the comments

[–] yogthos@lemmygrad.ml 5 points 1 day ago (1 children)

For existing models the paper is basically handing us a free lunch for inference. Because they proved that the deep layers in standard models are doing almost nothing you can literally just chop those layers off. You can take a fully trained model like DeepSeek or Qwen and drop a bunch of the later transformer blocks. The paper shows that doing this without any fine-tuning barely hurts the benchmark performance. Since you can just delete the useless layers that means we can immediately start to save memory and speed up inference massively, which is great news for running local models. And since the scaling forces every single layer to pull its weight you get way more capability out of the exact same parameter count. So now you can also train a significantly smaller model that matches the performance of a much larger baseline model.

[–] RNAi@hexbear.net 1 points 1 day ago

Shit thats good news for AI stonks sicko-wistful