The field of natural language processing (NLP) is moving towards greater inclusivity and support for low-resource languages. Recent research has focused on developing large-scale language models for languages such as Turkish and Hebrew, which have traditionally been underrepresented in NLP research. These models have shown competitive performance with existing multilingual models and have highlighted the importance of corpus quality and diversity in achieving good results. Another area of research has been the investigation of cross-lingual tokenizer inequities and the development of methods to mitigate these inequities. This has led to the creation of more efficient and effective tokenizers that can handle a wide range of languages. Furthermore, research has also explored the use of automated quality control methods for language documentation, such as detecting phonotactic inconsistencies in wordlists. Noteworthy papers include SindBERT, which provides a large-scale RoBERTa-based encoder for Turkish, and HalleluBERT, which sets a new state of the art for Hebrew. Additionally, the paper on Explaining and Mitigating Crosslingual Tokenizer Inequities provides valuable insights into the causes of token premiums and proposes methods to reduce them. The paper on Model-Aware Tokenizer Transfer also presents a novel approach to tokenizer transfer that incorporates model internals into the transfer process.
Advances in Multilingual NLP and Low-Resource Languages
Sources
Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models
A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus
PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets
Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?