Advancements in Speech Processing and Synthesis

The field of speech processing and synthesis is witnessing significant developments, with a focus on improving the capabilities of speech language models to capture the intricate interdependency between content and prosody. Researchers are exploring new tokenization schemes, such as those that incorporate word-level prosody tokens, to enhance the learning of prosody information. Additionally, there is a growing interest in developing more efficient and effective models for text-to-speech synthesis, including those that utilize transformer architectures and speculative decoding techniques. These advancements have the potential to improve the quality and expressiveness of generated speech, as well as enable more robust and accurate speech recognition systems. Noteworthy papers include: ProsodyLM, which introduces a novel tokenization scheme for learning prosody information, and TTS-1 Technical Report, which presents a set of transformer-based autoregressive text-to-speech models that achieve state-of-the-art performance on various benchmarks. Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting is also notable for its proposed technique that eliminates the need for a separate draft model, enabling efficient speculative decoding with minimal overhead. Adaptive Duration Model for Text Speech Alignment and Next Tokens Denoising for Speech Synthesis are also noteworthy for their contributions to improving speech-to-text alignment and generative modeling, respectively.

Sources

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

TTS-1 Technical Report

Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting

Adaptive Duration Model for Text Speech Alignment

Next Tokens Denoising for Speech Synthesis

Built with on top of