The field of vision-and-language navigation is rapidly advancing, with a focus on improving the understanding of language instructions and visual cues. Researchers are working to address the limitations of current methods, including the lack of detailed information extraction from language instructions and the neglect of object relationships across different modalities. Recent papers have proposed innovative approaches, such as the use of dual object perception-enhancement networks and 2D-assisted cross-modal understanding, to improve navigation performance. These approaches have shown promising results in enhancing decision-making accuracy and robustness. Noteworthy papers include DOPE, which proposes a dual object perception-enhancement network to improve navigation performance, and AS3D, which introduces a 2D-assisted cross-modal understanding framework for 3D visual grounding. DenseGrounding is also notable for its approach to improving dense language-vision semantics for ego-centric 3D visual grounding.