The field of multimodal speech interaction is moving towards seamless and smart speech interaction with adaptive modality-specific approaches. Researchers are focusing on developing frameworks that can integrate speech and text generation capabilities, preserving richer paralinguistic features such as emotion and prosody. Noteworthy papers include DeepTalk, which proposes a framework for adaptive modality expert learning based on a Mixture of Experts architecture, and WildSpeech-Bench, which presents a novel approach to thoroughly evaluate LLMs in practical speech conversations. Other notable works include RELATE, a subjective evaluation dataset for automatic evaluation of relevance between text and audio, and JoyTTS, an end-to-end spoken chatbot that combines large language models with text-to-speech technology and features voice cloning capabilities.