The field of multimodal models for face understanding and deepfake detection is rapidly advancing, with a focus on improving the performance and interpretability of these models. Recent research has explored the use of large language models, vision-language models, and meta-domain strategies to enhance the perception of visual input and improve generalization across multiple domains. Notable developments include the introduction of novel frameworks and datasets that leverage weakly supervised pipelines, attribute-driven hybrid strategies, and multi-granularity prompt learning to detect and analyze face forgeries. These advancements have the potential to significantly impact the field, enabling more accurate and transparent face understanding and deepfake detection systems. Some noteworthy papers include: FaceLLM, which introduces a multimodal large language model trained specifically for facial image understanding, and achieves state-of-the-art performance on various face-centric tasks. InstructFLIP, which proposes a novel instruction-tuned framework that leverages vision-language models to enhance generalization via textual guidance, and outperforms existing models in accuracy and reduces training redundancy across diverse domains.