The field of object detection and tracking is rapidly advancing with a focus on multimodal approaches, leveraging complementary data such as visible and infrared imagery to improve robustness. Recent developments have addressed challenges such as spatial misalignment between modalities, semantic inconsistency, and modality conflict during feature fusion. Notable papers include Cross-modal Offset-guided Dynamic Alignment and Fusion, which proposes a unified framework for weakly aligned UAV-based object detection, and Lightweight RGB-T Tracking with Mobile Vision Transformers, which introduces a progressive fusion framework for RGB-T tracking. Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking and LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection also demonstrate significant advancements in efficient tracking models and feature fusion networks.