The field of cross-modal research is moving towards a deeper understanding of the relationships between different modalities, such as text, images, and music. Recent studies have focused on developing more robust and effective methods for cross-modal retrieval, generation, and editing. One of the key challenges in this area is achieving semantic consistency and understanding across different modalities. Researchers are exploring new approaches, including the use of multimodal learning, generative models, and semantic-enhanced frameworks. These innovations have the potential to improve the accuracy and effectiveness of cross-modal applications, enabling more intuitive and immersive interactions. Notably, the development of frameworks like SemCORE and methods like SteerMusic are pushing the boundaries of what is possible in cross-modal research. Noteworthy papers include: SteerMusic, which proposes a novel approach to zero-shot text-guided music editing that enhances consistency between the original and edited music. SemCORE, which introduces a semantic-enhanced generative cross-modal retrieval framework that achieves substantial improvements in benchmark datasets.