Stability and Advances in Transformer Architecture

The field of Transformer research is moving towards a deeper understanding of the stability and dynamics of these models. Recent studies have focused on the impact of layer normalization on training stability, and the development of new architectural components that improve performance. One key area of innovation is the integration of self-attention and convolutional mechanisms, which has led to the creation of more adaptive and effective models. Additionally, researchers are exploring new methods for positional encoding, which is critical for modeling complex structural relationships in data. Noteworthy papers in this area include: Stability of Transformers under Layer Normalization, which provides a principled study of the forward and backward stability of Transformers under different layer normalization placements. Translution, which introduces a new operation that unifies self-attention and convolution for adaptive and relative modeling. Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation, which proposes a head-wise adaptive extension to Rotary Position Embedding for fine-grained image generation. Discursive Circuits, which investigates the components of transformer language models responsible for discourse understanding. Deconstructing Attention, which systematically deconstructs attention by designing controlled variants that selectively relax key design principles. Chinese ModernBERT with Whole-Word Masking, which introduces a from-scratch Chinese encoder that couples a hardware-aware vocabulary, whole-word masking, and a two-stage pre-training pipeline.

Sources

Stability of Transformers under Layer Normalization

Translution: Unifying Self-attention and Convolution for Adaptive and Relative Modeling

Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

Discursive Circuits: How Do Language Models Understand Discourse Relations?

Deconstructing Attention: Investigating Design Principles for Effective Language Modeling

Chinese ModernBERT with Whole-Word Masking

Built with on top of