The field of natural language processing and multimodal learning is moving towards efficient model compression and scalable architectures. Recent works have focused on reducing the computational costs and memory requirements of large language models, while maintaining their performance. This is achieved through techniques such as layer concatenation, token pruning, and knowledge distillation. Additionally, there is a growing interest in multimodal learning, where models are designed to process and integrate multiple forms of input, such as text and images. Noteworthy papers in this area include Layer as Puzzle Pieces, which proposes a progressive layer pruning framework, and ParaFormer, which introduces a shallow parallel Transformer architecture. Other notable works include FrugalPrompt, which reduces contextual overhead in large language models, and VisionSelector, which optimizes embedding efficiency for scalable ID-based models.