The field of multimodal research is moving towards addressing language bias and improving performance in specific domains. Large-scale datasets are being developed to support the development of inclusive vision-language systems, such as datasets for under-served languages. Additionally, there is a focus on creating datasets that integrate multiple tasks and modalities to facilitate comprehensive cross-modal reasoning. Noteworthy papers include: COCO-Urdu, which introduces a large-scale Urdu image-caption dataset to reduce language bias in multimodal research. MITS, which presents a large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance, significantly improving the performance of large multimodal models in this domain. UnifiedVisual, which introduces a framework for constructing unified vision-language datasets, enabling mutual enhancement between multimodal understanding and generation.