Advances in Event-based Vision and Multimodal Interaction

The field of event-based vision and multimodal interaction is rapidly advancing, with a focus on developing more efficient and effective methods for processing and understanding complex visual and auditory data. One of the key areas of research is the development of novel frameworks and architectures for event-based neural networks, which are designed to handle the unique challenges of event-based data. Another area of focus is the integration of multimodal inputs, such as vision, audio, and text, to enable more natural and engaging human-computer interaction. Notable papers in this area include: EVA, a novel A2S framework that generates highly expressive and generalizable event-by-event representations, outperforming prior A2S methods on recognition tasks. AW-GATCN, an Adaptive Weighted Graph Attention Convolutional Network that achieves superior recognition accuracies on event camera data joint denoising and object recognition tasks.

Sources

Maximizing Asynchronicity in Event-based Neural Networks

AW-GATCN: Adaptive Weighted Graph Attention Convolutional Network for Event Camera Data Joint Denoising and Object Recognition

Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models

Transfer Learning from Visual Speech Recognition to Mouthing Recognition in German Sign Language

Beyond Words: Multimodal LLM Knows When to Speak

Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

Temporal Object Captioning for Street Scene Videos from LiDAR Tracks

Built with on top of