Technology

42722 readers
612 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 7 years ago
MODERATORS
1
 
 

There's a really interesting quirk in modern architecture that a lot of people have been noticing lately referred to as the Curse of Depth in the paper. Basically if you look at popular models like Llama or Qwen or DeepSeek you will find that the deeper layers are surprisingly useless. You can completely prune away huge chunks of the later transformer blocks without actually hurting the performance of the model. The representations in these deep layers end up looking practically identical to each other, and it's a massive waste of GPU hours because we are training billions of parameters that end up doing almost nothing.

The authors trace the root cause directly to Pre-Layer Normalization. Pre-LN makes training massive transformers way more stable than the old Post-LN setups, but the catch is that as you pass data through more and more Pre-LN layers the output variance explodes exponentially. Because of how the math works out this exploding variance forces the derivatives in deep blocks to essentially become an identity matrix turning the layer into a pass-through filter that cannot learn any meaningful new transformations.

And turns out that the problem can be fixed using a remarkably simple tweak called Layer Norm Scaling. They literally just scale the output of the layer norm inversely by the square root of the layer depth. This completely stops the variance from blowing up as you go deeper into the network. Because the variance stays under control the deep layers actually wake up and start contributing to the representation learning.

They tested this trick on models ranging from tiny 130M parameter setups all the way to 7B parameter models. In every case Layer Norm Scaling beat out standard Pre-LN and other normalization tricks. The pre-training loss drops significantly and those gains carry right over into supervised fine-tuning tasks. Best of all it requires zero new hyperparameters or learnable weights. It is just a clean mathematical fix to a fundamental architectural flaw.

2
 
 

Anthropic said it will “abruptly disable” its most advanced AI models for all users after the US government ordered it to suspend access to the models for foreign nationals, citing national security concerns.

The company received the export control directive to suspend access to Fable 5 and Mythos 5 for all foreign nationals, without being given specific details of the national security concern, Anthropic said in a statement.

It is Anthropic’s understanding that the government believes there is a method of bypassing, or “jailbreaking”, a safeguard that would prevent Fable 5 from being used in identifying software vulnerabilities, the company said.

3
4
5
6
7
52
Opensource AI Must Win (opensourceaimustwin.com)
submitted 21 hours ago by yogthos@lemmy.ml to c/technology@lemmy.ml
8
9
10
11
12
13
14
15
16
17
18
19
20
 
 

cross-posted from: https://hexbear.net/post/8729236

A German court has ruled that Google is directly liable for what its AI search overviews say. Previous case law shielding search engine operators from liability doesn't apply to AI overviews.

michael-laugh

21
22
23
24
25
view more: next ›