The field of multimodal intelligence is rapidly advancing, with a focus on developing models that can perceive, understand, and generate across multiple modalities, including vision, text, speech, and action. Researchers are working to improve the consistency and accuracy of these models, particularly in tasks that require modality-invariant reasoning and understanding of complex relationships between different modalities. A key challenge in this area is the lack of high-quality benchmarks and evaluation tools, which is being addressed through the development of new datasets and metrics. Notable papers in this area include XModBench, which introduces a large-scale tri-modal benchmark for evaluating cross-modal consistency, and OmniVinci, which presents a strong, open-source, omni-modal LLM with improved architecture and data curation. Other noteworthy papers include PRISMM-Bench, which introduces a benchmark for detecting and resolving inconsistencies across text, figures, tables, and equations, and ELLSA, which presents an end-to-end model that simultaneously perceives and generates across vision, text, speech, and action.