The field of audio representation learning is moving towards more biologically inspired and self-supervised approaches. Researchers are exploring new architectures and techniques to improve the efficiency and effectiveness of audio processing models. One notable trend is the use of autoregressive sequence models and conformer-based encoders to generate unique embeddings for audio segments. Another area of focus is the development of general-purpose bioacoustic encoders that can extract useful representations for diverse downstream tasks.
Noteworthy papers include:
- AuriStream, which introduces a two-stage framework for speech representation learning that achieves state-of-the-art results on diverse downstream speech tasks.
- Pretrained Conformers for Audio Fingerprinting and Retrieval, which utilizes a self-supervised contrastive learning framework to train conformer-based encoders that achieve state-of-the-art results for audio retrieval tasks.
- MATPAC++, which proposes a novel enhancement to masked latent prediction by integrating Multiple Choice Learning to explicitly model prediction ambiguity and improve representation quality.
- ECHO, which introduces a novel foundation model that integrates an advanced band-split architecture with relative frequency positional embeddings, enabling precise spectral localization across arbitrary sampling configurations.
- CUPE, which develops a lightweight model that captures key phoneme features in just 120 milliseconds, achieving competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages.