Advances in Multilingual NLP and Low-Resource Languages

The field of natural language processing (NLP) is moving towards greater inclusivity and support for low-resource languages. Recent research has focused on developing large-scale language models for languages such as Turkish and Hebrew, which have traditionally been underrepresented in NLP research. These models have shown competitive performance with existing multilingual models and have highlighted the importance of corpus quality and diversity in achieving good results. Another area of research has been the investigation of cross-lingual tokenizer inequities and the development of methods to mitigate these inequities. This has led to the creation of more efficient and effective tokenizers that can handle a wide range of languages. Furthermore, research has also explored the use of automated quality control methods for language documentation, such as detecting phonotactic inconsistencies in wordlists. Noteworthy papers include SindBERT, which provides a large-scale RoBERTa-based encoder for Turkish, and HalleluBERT, which sets a new state of the art for Hebrew. Additionally, the paper on Explaining and Mitigating Crosslingual Tokenizer Inequities provides valuable insights into the causes of token premiums and proposes methods to reduce them. The paper on Model-Aware Tokenizer Transfer also presents a novel approach to tokenizer transfer that incorporates model internals into the transfer process.

Sources

SindBERT, the Sailor: Charting the Seas of Turkish NLP

HalleluBERT: Let every token that has meaning bear its weight

Are the LLMs Capable of Maintaining at Least the Language Genus?

Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Explaining and Mitigating Crosslingual Tokenizer Inequities

Model-Aware Tokenizer Transfer

ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality

Evolution of the lexicon: a probabilistic point of view

Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models

The Tonogenesis Continuum in Tibetan: A Computational Investigation

The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR

A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus

Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study

Cross-Lingual Stability and Bias in Instruction-Tuned Language Models for Humanitarian NLP

AI based signage classification for linguistic landscape studies

PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets

Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Pretraining Strategies using Monolingual and Parallel Data for Low-Resource Machine Translation

Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs

Gaperon: A Peppered English-French Generative Language Model Suite

Revisiting Multilingual Data Mixtures in Language Model Pretraining

Towards Scaling Laws for Symbolic Regression

Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

Hebrew Diacritics Restoration using Visual Representation