The field of large language models is moving towards more efficient multimodal processing, with a focus on reducing computational costs and improving performance on long-context tasks. Recent developments have introduced novel frameworks and techniques for compressing and selecting tokens, such as adaptive token compression, shot-aware token compression, and hierarchical token prepending. These methods have shown significant improvements in efficiency and performance, enabling large language models to handle longer inputs and more complex tasks. Notable papers include Virtual Width Networks, which decouples representational width from backbone width, and TimeAudio, which incorporates unique temporal markers to improve time-sensitive reasoning. Additionally, papers like OmniSparse and CORE have introduced training-aware fine-grained sparse attention and compact object-centric representations, respectively, to further improve efficiency and performance.