Low-Resource Language Processing

The field of natural language processing is moving towards addressing the challenges faced by low-resource languages. Researchers are working on developing new dictionary tools, such as conceptual dictionaries, and improving the fine-tuning process for language models in these languages. Active learning methods and data clustering are being explored to enhance the performance of language models with limited training data. Additionally, there is a focus on creating multilingual speech datasets and instruction datasets for under-resourced languages, which will support the development of more accurate automatic speech recognition and text generation systems. Noteworthy papers include the introduction of the Slovak Conceptual Dictionary, which is the first linguistic tool of its kind for the Slovak language. The Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages paper proposes a method that can produce annotation savings up to 30% and performance improvements up to four F1 score points. The InstructLR framework is also notable for generating high-quality instruction datasets for low-resource languages. The TriLex framework is a scalable approach for multilingual sentiment analysis in low-resource South African languages, achieving F1-scores above 80% for certain languages.

Sources

Slovak Conceptual Dictionary

Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages

Swivuriso: The South African Next Voices Multilingual Speech Dataset

InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages

Built with on top of