The field of vision-language navigation is moving towards more robust and interpretable methods. Recent developments have focused on improving the ability of agents to imagine and predict future states, with a emphasis on multimodal perception and planning. This includes the integration of audio and visual cues to enhance spatial and temporal understanding. Additionally, new frameworks and techniques have been proposed to address perception errors, reasoning errors, and planning errors in vision-language navigation agents. Noteworthy papers include VISTAv2, which proposes a generative world model for online value map planning, and Audio-Visual World Models, which presents a formal framework for multimodal environment simulation. SeeNav-Agent is also notable for its introduction of a dual-view Visual Prompt technique and a novel step-level Reinforcement Fine-Tuning method.