Multilingual Large Language Models for E-commerce and Beyond

The field of large language models (LLMs) is moving towards embracing multilingual capabilities to address the challenges of emerging e-commerce markets and improving alignment between languages. Researchers are exploring innovative methods to enhance query understanding, mitigate training label errors, and construct high-quality multilingual preference data. The development of scalable methods for building web-based corpora for LLMs is also gaining attention, with a focus on language-specific filtering pipelines and adapting models to target languages. Noteworthy papers include: CSRM-LLM, which presents a framework for cold-start relevance matching in emerging e-commerce markets using multilingual LLMs, resulting in significant online gains. CM-Align proposes a consistency-based method for improving multilingual alignment, demonstrating superiority over existing methods. Building High-Quality Datasets for Portuguese LLMs showcases the importance of language-specific data and preprocessing strategies for LLM performance.

Sources

CSRM-LLM: Embracing Multilingual LLMs for Cold-Start Relevance Matching in Emerging E-commerce Markets

CM-Align: Consistency-based Multilingual Alignment for Large Language Models

Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora

Built with on top of