The field of multimodal interaction and generation is rapidly evolving, with a focus on creating more natural and engaging human-computer interactions. Recent developments have centered around improving the ability of models to understand and generate multimodal content, such as speech, text, and visuals, in a way that is contextually relevant and responsive to user needs. A key direction in this field is the development of frameworks and models that can dynamically adapt to changing user contexts and preferences, enabling more effective and personalized interactions. Notable papers in this area include: Sensible Agent, which introduces a framework for unobtrusive interaction with proactive AR agents, reducing perceived interaction effort while maintaining high usability. Kling-Avatar, which presents a novel cascaded framework for grounding multimodal instructions in avatar animation synthesis, achieving superior performance in lip synchronization accuracy and emotion expressiveness.