A new paper from Moonshot AI tackles a key bottleneck in how language models handle depth. Standard residual connections just add up the outputs of all previous layers using fixed uniform weights, and uniform addition creates a problem where hidden states grow uncontrollably as the network gets deeper. As a result, the contributions of early layers end up getting completely buried and diluted by the time the data reaches the end of the model.
This happens to be the exact same issue older recurrent neural networks faced over time before attention mechanisms came along. Naturally, they tackle the problem in a similar way using attention residuals instead of a fixed accumulation and applying a softmax attention mechanism over the outputs of preceding layers. Now, every single layer gets a learned pseudo query vector that lets it selectively pick and choose which earlier representations it actually needs to look at. This allows the network to naturally retrieve information from anywhere in its depth depending on the specific input.
However, applying this over every individual layer is called Full AttnRes and it comes with a massive catch which is that saving all those individual layer outputs creates memory and communication bottlenecks during large scale distributed training because the overhead scales linearly with the number of layers. So, in order to make the architecture actually usable they grouped the layers into chunks and summed up the outputs inside each block. The cross layer attention is then only applied over these compressed block level summaries rather than every single layer drastically reducing the memory and communication footprint.
By combining a block structure with a smart cross stage caching system and a two phase computation strategy the setup becomes a practical drop in replacement with practically zero training overhead. Their experimental results show that the performance boost holds up consistently across different model sizes.