Multimodal Reasoning and Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a growing focus on multimodal reasoning and generalizability across domains and languages. Recent research has highlighted the importance of developing benchmarks and evaluation frameworks that can assess the capabilities of LLMs in a more comprehensive and nuanced way. One of the key directions in this field is the development of multimodal large language models (MLLMs) that can integrate multiple modalities, such as text, images, and audio, to support complex reasoning capabilities. Another area of focus is the creation of benchmarks and evaluation frameworks that can measure the performance of LLMs and MLLMs in a more accurate and comprehensive way. Noteworthy papers include the introduction of HKMMLU, a multi-task language understanding benchmark that evaluates LLMs' capabilities in Hong Kong's linguistic landscape, and R-Bench, a graduate-level, multi-disciplinary benchmark for assessing the reasoning capability of LLMs and MLLMs. X-Reasoner is also notable, as it introduces a vision-language model post-trained solely on general-domain text for generalizable reasoning, demonstrating strong performance across various benchmarks.

Sources

Measuring Hong Kong Massive Multi-Task Language Understanding

R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation

Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality

X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

On Path to Multimodal Generalist: General-Level and General-Bench

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation

Crosslingual Reasoning through Test-Time Scaling

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Built with on top of