The field of multimodal representation learning is moving towards more efficient and effective methods for learning semantic embeddings. Recent work has focused on developing novel optimization frameworks, pre-training paradigms, and demonstration selection methods to improve the performance of vision-language models. Notably, researchers are exploring ways to decouple complementary objectives in contrastive learning, enabling the simultaneous optimization of multiple tasks. Additionally, there is a growing interest in developing more accurate and efficient methods for estimating normalization terms in contrastive loss functions. Overall, these advancements are leading to improved performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. Noteworthy papers include: NeuCLIP, which proposes a novel optimization framework for efficient large-scale CLIP training, and CoMa, which introduces a compressed pre-training phase for transforming VLMs into competitive embedding models. Compression then Matching achieves new state-of-the-art results among VLMs of comparable size on the MMEB. Efficient and Effective In-context Demonstration Selection with Coreset significantly improves the ICL performance compared to existing strategies.