The field of multimodal reasoning and document understanding is moving towards more advanced and innovative approaches. Researchers are exploring ways to improve the accuracy and efficiency of multimodal models, particularly in handling complex document structures and diverse input modalities. One notable direction is the integration of neuro-symbolic reasoning, which enables more robust and structured reasoning over multimodal data. Additionally, there is a growing interest in developing frameworks that can dynamically select and aggregate multiple expert models to enable effective multimodal reasoning across diverse domains. Noteworthy papers in this area include: MEXA, which introduces a training-free framework for modality- and task-aware aggregation of multiple expert models. TableMoE, which proposes a neuro-symbolic Mixture-of-Connector-Experts architecture for robust, structured reasoning over multimodal table data.