Advancements in Audio Representation Learning

The field of audio representation learning is moving towards more biologically inspired and self-supervised approaches. Researchers are exploring new architectures and techniques to improve the efficiency and effectiveness of audio processing models. One notable trend is the use of autoregressive sequence models and conformer-based encoders to generate unique embeddings for audio segments. Another area of focus is the development of general-purpose bioacoustic encoders that can extract useful representations for diverse downstream tasks.

Noteworthy papers include:

  • AuriStream, which introduces a two-stage framework for speech representation learning that achieves state-of-the-art results on diverse downstream speech tasks.
  • Pretrained Conformers for Audio Fingerprinting and Retrieval, which utilizes a self-supervised contrastive learning framework to train conformer-based encoders that achieve state-of-the-art results for audio retrieval tasks.
  • MATPAC++, which proposes a novel enhancement to masked latent prediction by integrating Multiple Choice Learning to explicitly model prediction ambiguity and improve representation quality.
  • ECHO, which introduces a novel foundation model that integrates an advanced band-split architecture with relative frequency positional embeddings, enabling precise spectral localization across arbitrary sampling configurations.
  • CUPE, which develops a lightweight model that captures key phoneme features in just 120 milliseconds, achieving competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages.

Sources

Representing Speech Through Autoregressive Prediction of Cochlear Tokens

Pretrained Conformers for Audio Fingerprinting and Retrieval

What Matters for Bioacoustic Encoding

Exploring Self-Supervised Audio Models for Generalized Anomalous Sound Detection

MATPAC++: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

Mamba2 Meets Silence: Robust Vocal Source Separation for Sparse Regions

ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal

CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

An Enhanced Audio Feature Tailored for Anomalous Sound Detection Based on Pre-trained Models

ASCMamba: Multimodal Time-Frequency Mamba for Acoustic Scene Classification

Built with on top of