Prosody and Voice in Human-Computer Interaction

The field of human-computer interaction is moving towards a deeper understanding of the role of prosody and voice in communication. Researchers are exploring how prosodic features such as pitch, timing, and intonation convey emotion, intent, and discourse structure in spoken communication, and how these cues can be replicated or replaced in text-based settings. The use of emojis as visual surrogates for prosodic cues is being investigated, as well as the manipulative power of voice characteristics in synthetic speech. Additionally, there is a focus on improving the efficiency and performance of automatic speech recognition systems, including the development of new training strategies and the integration of streaming ASR, quantized language models, and real-time text-to-speech synthesis. Notable papers in this area include: The Prosody of Emojis, which examines how emojis influence prosodic realization in speech and how listeners interpret prosodic cues to recover emoji meanings. The Manipulative Power of Voice Characteristics, which investigates deceptive patterns in Mandarin Chinese female synthetic speech. Efficient Scaling for LLM-based ASR, which proposes a new multi-stage LLM-ASR training strategy for improved performance and efficiency. Toward Low-Latency End-to-End Voice Agents for Telecommunications, which introduces a low-latency telecom AI voice agent pipeline for real-time, interactive telecommunications use. Pitch Accent Detection improves Pretrained Automatic Speech Recognition, which shows that pitch accent detection can improve the performance of pretrained ASR systems.

Prosody and Voice in Human-Computer Interaction

Sources