The field of autonomous systems is witnessing significant advancements in multi-modal perception and generation, driven by the need for more accurate and robust sensing and understanding of complex environments. Researchers are exploring innovative approaches to fuse data from different modalities, such as vision, lidar, and language, to enable more effective scene understanding, object detection, and navigation. Notably, the development of new frameworks and models for controllable generation of realistic scenes, layouts, and data is gaining traction, with applications in autonomous driving, robotics, and simulation. These advances have the potential to revolutionize the field by enabling more scalable, accurate, and efficient data generation and processing.
Some noteworthy papers in this area include: Opti-Acoustic Scene Reconstruction, which proposes a real-time method for scene reconstruction in turbid underwater environments. Veila, a novel conditional diffusion framework for panoramic lidar generation from monocular RGB images. La La LiDAR, a layout-guided generative framework for controllable lidar scene generation. LiDARCrafter, a unified framework for 4D lidar generation and editing. B4DL, a benchmark for 4D lidar understanding in spatio-temporal reasoning. Follow-Your-Instruction, an MLLM-driven framework for automatically synthesizing high-quality 2D, 3D, and 4D data.