Advances in Language Model Interpretability and Multilingual Capabilities

The field of natural language processing is moving towards improving the interpretability and multilingual capabilities of large language models. Recent research has focused on understanding how training samples influence model decisions and auditing large-scale datasets. Additionally, there is a growing interest in developing language-aware tokenization methods for morphologically rich scripts and exploring the role of multi-head self-attention in supporting multilingual processing. The use of positional encodings, morphological complexity, and word order flexibility is also being investigated to better understand their impact on language modeling. Noteworthy papers include: Evaluating Subword Tokenization Techniques for Bengali, which presents a new Byte Pair Encoding tokenizer specifically developed for the Bengali script, and Focusing on Language, which proposes a method to identify attention head importance for multilingual capabilities in large language models. Furthermore, research on low-resource languages such as Bangla is gaining attention, with the introduction of new datasets and models for tasks like authorship attribution and sign language translation. Overall, the field is advancing towards more transparent, efficient, and inclusive language models.

Sources

First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

Order-Level Attention Similarity Across Language Models: A Latent Commonality

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models

BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

On the Interplay between Positional Encodings, Morphological Complexity, and Word Order Flexibility

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

Evaluating DisCoCirc in Translation Tasks & its Limitations: A Comparative Study Between Bengali & English

BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

Boosting Adversarial Transferability via Ensemble Non-Attention

The Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

Built with on top of