The field of multimodal language models is moving towards improving the consistency and coherence of text-image plans and dialogue responses. Researchers are exploring new frameworks and approaches to address the challenges of ensuring consistent alignment between different modalities and keeping coherence among visual steps. One direction of research is focused on developing novel architectures that can generate and refine text-image plans step-by-step, while another area of investigation involves integrating multiple modalities, such as text and image, into dialogue response retrieval systems. Additionally, there is a growing interest in using large language models to represent multimodal information in the acoustic domain and to improve the naturalness of generative spoken language models. Noteworthy papers include: Vela, which introduces a novel framework for generating universal multimodal embeddings using voice large language models, and A Variational Framework for Improving Naturalness in Generative Spoken Language Models, which proposes an end-to-end variational approach to automatically learn continuous speech attributes. Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation also presents a novel framework that offers a plug-and-play improvement to various backbone models.