The field of large language models (LLMs) is rapidly advancing, with a growing focus on multimodal reasoning and generalizability across domains and languages. Recent research has highlighted the importance of developing benchmarks and evaluation frameworks that can assess the capabilities of LLMs in a more comprehensive and nuanced way. One of the key directions in this field is the development of multimodal large language models (MLLMs) that can integrate multiple modalities, such as text, images, and audio, to support complex reasoning capabilities. Another area of focus is the creation of benchmarks and evaluation frameworks that can measure the performance of LLMs and MLLMs in a more accurate and comprehensive way. Noteworthy papers include the introduction of HKMMLU, a multi-task language understanding benchmark that evaluates LLMs' capabilities in Hong Kong's linguistic landscape, and R-Bench, a graduate-level, multi-disciplinary benchmark for assessing the reasoning capability of LLMs and MLLMs. X-Reasoner is also notable, as it introduces a vision-language model post-trained solely on general-domain text for generalizable reasoning, demonstrating strong performance across various benchmarks.