The field of language models is moving towards more sophisticated and human-like learning capabilities, with a focus on continual learning and multilingual support. Researchers are exploring new methods for adapting language models to new tasks while avoiding catastrophic forgetting of existing knowledge. This includes investigating the optimal balance between synthetic data generation and replay ratios to achieve strong task adaptation with reduced training costs. Additionally, there is a growing emphasis on creating developmentally plausible training data and benchmarks that mirror human learning patterns, enabling more fine-grained evaluations of language models' ability to progressively acquire new skills. The development of large-scale, openly licensed text corpora for non-English languages is also addressing a critical gap in language model development. Noteworthy papers include:
- BabyBabelLM, which presents a multilingual collection of datasets modeling language acquisition from birth to native language proficiency.
- CurLL, which introduces a comprehensive continual learning dataset and benchmark grounded in human developmental trajectories.
- The German Commons, which compiles 154 billion tokens of openly licensed German text for language model training.