The field of multimodal learning is rapidly advancing, with a focus on improving the performance and efficiency of large language models and vision transformers. Recent research has highlighted the importance of aligning visual and language representations, and several papers have proposed innovative methods for achieving this alignment. One key direction is the development of more effective tokenization strategies, such as Subpixel Placement of Tokens (SPoT), which allows for continuous positioning of tokens within images. Another important area of research is the refinement of attention mechanisms, including the use of attention ablation techniques to suppress detrimental attention heads. Additionally, there is a growing interest in exploring alternative pretraining objectives, such as Causal Language Modeling (CLM), which has been shown to be more data-efficient and stable than traditional Masked Language Modeling (MLM) approaches. Noteworthy papers include Grounding-Aware Token Pruning, which proposes a simple yet effective adjustment to position IDs to recover from drastic performance drops in visual grounding, and VisionDrop, which introduces a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal attention. LaCo, a novel framework for layer-wise compression of visual tokens, has also demonstrated superior effectiveness in reducing computational costs while maintaining strong performance.