Advances in Vision-and-Language Navigation

The field of vision-and-language navigation is rapidly advancing, with a focus on improving the understanding of language instructions and visual cues. Researchers are working to address the limitations of current methods, including the lack of detailed information extraction from language instructions and the neglect of object relationships across different modalities. Recent papers have proposed innovative approaches, such as the use of dual object perception-enhancement networks and 2D-assisted cross-modal understanding, to improve navigation performance. These approaches have shown promising results in enhancing decision-making accuracy and robustness. Noteworthy papers include DOPE, which proposes a dual object perception-enhancement network to improve navigation performance, and AS3D, which introduces a 2D-assisted cross-modal understanding framework for 3D visual grounding. DenseGrounding is also notable for its approach to improving dense language-vision semantics for ego-centric 3D visual grounding.

Sources

DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation

3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

Visually-Guided Linguistic Disambiguation for Monocular Depth Scale Recovery

AS3D: 2D-Assisted Cross-Modal Understanding with Semantic-Spatial Scene Graphs for 3D Visual Grounding

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

Built with on top of