Advances in Audio-Based Machine Learning for Real-World Applications

The field of audio-based machine learning is rapidly advancing, with a focus on developing innovative solutions for real-world applications. Recent research has explored the use of few-shot learning, meta-learning, and multimodal fusion techniques to improve the accuracy and robustness of audio classification models. These approaches have shown promising results in various domains, including speech emotion recognition, audio event detection, and medical diagnosis. Notably, the integration of audio and visual features has been found to enhance performance in tasks such as micro-expression analysis and audiovisual emotion recognition. Furthermore, the development of parameter-efficient adaptation methods has enabled the effective transfer of knowledge from pre-trained models to new tasks and datasets. Overall, the field is moving towards the development of more robust, scalable, and adaptable audio-based machine learning models that can be applied in a wide range of real-world scenarios. Noteworthy papers include: Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis, which achieved 99% fault diagnosis accuracy with only 1% of labeled sample data. AI-enabled tuberculosis screening in a high-burden setting using cough sound analysis and speech foundation models, which demonstrated strong potential as a TB triage tool with 92.1% accuracy. DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition, which achieved state-of-the-art results on the MIntRec and MIntRec2.0 benchmarks with a +10.46% F1-score improvement in out-of-scope detection.

Sources

Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis

Cough Classification using Few-Shot Learning

AI-enabled tuberculosis screening in a high-burden setting using cough sound analysis and speech foundation models

Combining Textual and Spectral Features for Robust Classification of Pilot Communications

Distinguishing Startle from Surprise Events Based on Physiological Signals

DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition

Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

More Similar than Dissimilar: Modeling Annotators for Cross-Corpus Speech Emotion Recognition

MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Improving Anomalous Sound Detection with Attribute-aware Representation from Domain-adaptive Pre-training

Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy

CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition

MMED: A Multimodal Micro-Expression Dataset based on Audio-Visual Fusion

Estimating Respiratory Effort from Nocturnal Breathing Sounds for Obstructive Sleep Apnoea Screening

Built with on top of