Advances in Language Model Training and Evaluation

The field of natural language processing is witnessing significant developments in language model training and evaluation. Researchers are exploring innovative approaches to improve the performance and efficiency of large language models (LLMs). One notable direction is the focus on reducing memorization in LLMs, which raises critical questions about model behavior, privacy risks, and the boundary between learning and memorization. Another area of interest is the development of methods for evaluating and mitigating memorization, including data cleaning, differential privacy, and post-training unlearning. Furthermore, there is a growing emphasis on advancing the field of cross-lingual learning, with studies investigating the challenges of retrieval-augmented generation and proposing novel approaches for improving multilingual retrieval.

Notable papers in this area include: The paper 'PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs' which presents a novel cross-lingual mnemonic generation system that retrieves L1 keyword sequence based on phonological similarity and uses LLMs to generate mnemonics. The paper 'Dynamic Chunking for End-to-End Hierarchical Sequence Modeling' which introduces a collection of new techniques that enable a dynamic chunking mechanism which automatically learns content- and context-dependent segmentation strategies learned jointly with the rest of the model.

Sources

PhoniTale: Phonologically Grounded Mnemonic Generation for Typologically Distant Language Pairs

The Landscape of Memorization in LLMs: Mechanisms, Measurement, and Mitigation

Evolution without Large Models: Training Language Model with Task Principles

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Pun Intended: Multi-Agent Translation of Wordplay with Contrastive Learning and Phonetic-Semantic Embeddings

The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora

Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning

Conditional Unigram Tokenization with Parallel Data

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Built with on top of