Advances in Music Generation and Transcription

The field of music generation and transcription is rapidly evolving, with a focus on developing models that can produce high-quality, coherent music and accurately transcribe musical pieces. Recent research has explored the use of diffusion-based models, transformer architectures, and multi-agent systems to improve the quality and controllability of music generation. Additionally, there has been a push to incorporate more expressive and nuanced aspects of music, such as playing technique and performance style, into transcription and generation models. Noteworthy papers in this area include: MusicWeaver, which presents a music generation model conditioned on a beat-aligned structural plan, enabling professional and localized edits. Noise-to-Notes, which introduces a diffusion-based framework for automatic drum transcription, offering a flexible speed-accuracy trade-off and strong inpainting capabilities. VioPTT, which proposes a lightweight model for transcribing violin playing technique in addition to pitch onset and offset, demonstrating strong generalization to real-world note-level violin technique recordings. Disentangling Score Content and Performance Style for Joint Piano Rendering and Transcription, which presents a unified framework for jointly modeling expressive performance rendering and automatic piano transcription, achieving competitive performance on both tasks. An Agent-Based Framework for Automated Higher-Voice Harmony Generation, which introduces a multi-agent system for generating harmony in a collaborative and modular fashion, effectively mimicking the collaborative process of human musicians. Discovering Words in Music, which presents an unsupervised machine learning algorithm for identifying recurring patterns in symbolic music data, enabling computers to extract basic building blocks from music data and facilitating structural analysis and sparse encoding. SAGE-Music, which proposes a low-latency symbolic music generation model via attribute-specialized key-value head sharing, achieving a 30% inference speedup with only a negligible quality drop.

Advances in Music Generation and Transcription

Sources