Advancements in Audio and Visual Processing

The field of audio and visual processing is moving towards more robust and efficient methods for tasks such as music analysis, audio fingerprinting, and person re-identification. Recent developments have focused on improving the accuracy and scalability of these methods, with a particular emphasis on leveraging pre-trained models and novel architectures to achieve state-of-the-art performance. Notable papers in this area include: Perceptually Aligning Representations of Music via Noise-Augmented Autoencoders, which demonstrates the emergence of a perceptual hierarchy in music representations. Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis, which proposes a framework for text-to-talking-face synthesis leveraging latent speech representations. Robust Neural Audio Fingerprinting using Music Foundation Models, which develops new neural audio fingerprinting techniques using pre-trained music foundation models. ReIDMamba: Learning Discriminative Features with Visual State Space Model for Person Re-Identification, which proposes a pure Mamba-based person ReID framework. LatentPrintFormer: A Hybrid CNN-Transformer with Spatial Attention for Latent Fingerprint identification, which integrates a CNN backbone and a Transformer backbone to extract both local and global features from latent fingerprints. HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios, which proposes an efficient framework for high-quality zero-shot SVC.

Advancements in Audio and Visual Processing

Sources