The field of multimodal learning is moving towards a more unified and coherent understanding of relationships across different modalities. Recent studies have explored the consistency of any-to-any models and the challenges of injecting evolving knowledge into large language and multimodal models. The results have shown that while any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models, they do exhibit weak but observable consistency through structured analyses of the intermediate latent space. Furthermore, the injection of evolving knowledge into multimodal models has been found to be challenging, with existing methods performing poorly and supervised fine-tuning causing catastrophic forgetting. However, text knowledge augmentation and continual learning methods have been shown to be effective in mitigating these challenges. Notable papers include: Seeing What Tastes Good, which investigates how well large-scale models represent semantic feature norms of concrete object concepts. Quantifying Cross-Modality Memorization in Vision-Language Models, which conducts a systematic study of cross-modality memorization in vision-language models and proposes a baseline method to mitigate the challenge.