The field of recommendation systems is moving towards incorporating multimodal information to improve recommendation quality. Recent research has focused on developing innovative methods to leverage rich item-side modality information, such as images, text, and audio, to enhance user modeling and recommendation accuracy. One notable direction is the use of graph-based methods to capture complex relationships between users and items, as well as the development of novel attention mechanisms to effectively integrate multimodal information. Additionally, there is a growing interest in addressing challenges such as over-reliance on certain modalities, semantic drift, and noise in user behavior data. Noteworthy papers in this area include EGRA, which proposes a novel bi-level dynamic alignment weighting mechanism to improve modality-behavior representation alignment, and VQL, which introduces a context-aware Vector Quantization Attention framework for ultra-long behavior modeling. ORCA is also notable for its causal-decoupling approach to mitigating over-reliance on certain modalities, while PCR-CA and Progressive Semantic Residual Quantization demonstrate the effectiveness of parallel codebook representations and multimodal interest modeling in music recommendation. Rethinking Purity and Diversity and Deep Multiple Quantization Network also show promising results in multi-behavior sequential recommendation and click-through rate prediction, respectively.
Advances in Multimodal Recommendation Systems
Sources
VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling
PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation
Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation