Advances in Specialized Language Models and Retrieval Systems

The field of natural language processing is moving towards the development of specialized language models and retrieval systems that can effectively handle domain-specific terminology and semantics. Recent research has focused on creating models that can learn from large-scale corpora curated for specific domains, such as manufacturing and biomedical domains. These models have shown significant improvements in performance on various tasks, including natural language inference, semantic textual similarity, and retrieval tasks. The use of advanced training techniques, such as contrastive learning and many-to-many InfoNCE objectives, has also been explored to improve the performance of these models. Furthermore, the development of unified evaluation suites and benchmarks has enabled the comparison of different models and techniques, driving progress in the field. Notable papers include ManufactuBERT, which establishes a new state-of-the-art on manufacturing-related NLP tasks, and BiCA, which proposes a novel approach for hard-negative mining using citation links in biomedical articles. Additionally, TurkEmbed4Retrieval achieves SOTA performance for Turkish retrieval tasks, and Unified Work Embeddings demonstrates zero-shot ranking performance on unseen target spaces in the work domain.

Sources

ManufactuBERT: Efficient Continual Pretraining for Manufacturing

TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task

Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives

From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

TurkEmbed: Turkish Embedding Model on NLI & STS Tasks

Pretraining Finnish ModernBERTs

Built with on top of