Advancements in Visual Perception and 3D Scene Understanding

The field of visual perception and 3D scene understanding is rapidly advancing with innovative approaches to display assessment, video understanding, and object detection. Recent developments have focused on improving the accuracy and efficiency of these systems, enabling more realistic and immersive experiences. Notable advancements include the use of camera-based reconstruction pipelines, visual difference predictors, and novel evaluation metrics such as Objectness SIMilarity (OSIM). Furthermore, significant progress has been made in swept volume computation, video diffusion transformer training, and scalable training for vector-quantized networks. The introduction of large-scale video datasets like SpatialVID and benchmark datasets like the Australian Supermarket Object Set (ASOS) has also facilitated research in this area. Overall, these advancements are driving the development of more sophisticated visual perception and 3D scene understanding systems. Noteworthy papers include CameraVDP, which proposes a camera-based reconstruction pipeline with a visual difference predictor, and Objectness SIMilarity, which introduces a novel evaluation metric for 3D scenes. Additionally, Swept Volume Computation with Enhanced Geometric Detail Preservation presents a novel approach to swept volume computation, and Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders proposes a new method for training video diffusion models.

Sources

CameraVDP: Perceptual Display Assessment with Uncertainty Estimation via Camera and Visual Difference Prediction

Video Understanding by Design: How Datasets Shape Architectures and Insights

Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation

Swept Volume Computation with Enhanced Geometric Detail Preservation

Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Australian Supermarket Object Set (ASOS): A Benchmark Dataset of Physical Objects and 3D Models for Robotics and Computer Vision

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

On the Geometric Accuracy of Implicit and Primitive-based Representations Derived from View Rendering Constraints

Compressed Video Quality Enhancement: Classifying and Benchmarking over Standards

Image Tokenizer Needs Post-Training

Exploring Metric Fusion for Evaluation of NeRFs

Cumulative Consensus Score: Label-Free and Model-Agnostic Evaluation of Object Detectors in Deployment

Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings

The CCF AATC 2025: Speech Restoration Challenge

PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

Evaluation of Objective Image Quality Metrics for High-Fidelity Image Compression

Temporally Smooth Mesh Extraction for Procedural Scenes with Long-Range Camera Trajectories using Spacetime Octrees

3D Aware Region Prompted Vision Language Model

AToken: A Unified Tokenizer for Vision

Efficient 3D Perception on Embedded Systems via Interpolation-Free Tri-Plane Lifting and Volume Fusion

Realizing Metric Spaces with Convex Obstacles

A Real-Time Multi-Model Parametric Representation of Point Clouds

NeRF-based Visualization of 3D Cues Supporting Data-Driven Spacecraft Pose Estimation

RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching

RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes

WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance