Advances in Low-Resource Machine Translation and Multilingual Language Modeling

The field of natural language processing is moving towards improving the performance of machine translation and language models in low-resource settings. Recent studies have shown that backtranslation, a widely used technique for generating synthetic training data, may not always be effective in high-quality, low-resource settings. Instead, researchers are exploring new methods for adapting pre-trained language models to low-resource languages and developing more effective techniques for translatingStyle and cultural nuances. Noteworthy papers include the proposal of AdaptGOT, a pre-trained model for adaptive contextual POI representation learning, and the introduction of LIGHT, a novel multi-modal approach for linking text on historical maps. Additionally, the Translation Barrier Hypothesis highlights the importance of addressing implicit translation failure in multilingual generation with large language models.

Sources

The Saturation Point of Backtranslation in High Quality Low Resource English Gujarati Machine Translation

AdaptGOT: A Pre-trained Model for Adaptive Contextual POI Representation Learning

Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children's Literature Translation

Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs

LIGHT: Multi-Modal Text Linking on Historical Maps

The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

MariNER: A Dataset for Historical Brazilian Portuguese Named Entity Recognition

Two Spelling Normalization Approaches Based on Large Language Models

Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

Towards Style Alignment in Cross-Cultural Translation

Natural language processing for African languages

The Cognate Data Bottleneck in Language Phylogenetics

Matching and Linking Entries in Historical Swedish Encyclopedias

MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining