Advances in Large Language Models

The field of large language models (LLMs) is rapidly advancing, with a focus on improving their performance, interpretability, and reliability. Recent research has explored the use of LLMs in various applications, including natural language processing, Materials Science, and biomedical research. One notable direction is the development of methods to align LLMs with specialized knowledge, such as Balanced Fine-Tuning, which has shown promising results in biomedical tasks. Another area of research is the identification of error slices in LLMs, which is crucial for understanding and improving their performance. Active Slice Discovery has been proposed as a approach to reduce the amount of manual annotation required. Furthermore, there is a growing interest in evaluating the reliability and trustworthiness of LLMs, with studies investigating the use of paired bootstrap protocols and bias-correction methods. Noteworthy papers include 'Vector Arithmetic in Concept and Token Subspaces', which demonstrates the ability to perform coherent semantic structure in LLMs, and 'Toward Trustworthy Difficulty Assessments', which highlights the challenges of using LLMs as judges in programming and synthetic tasks. Additionally, 'CoreEval' proposes a contamination-resilient evaluation strategy for LLMs, and 'Auxiliary Metrics Help Decoding Skill Neurons in the Wild' introduces a method for isolating neurons that encode specific skills in LLMs.

Sources

Vector Arithmetic in Concept and Token Subspaces

Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

From Reviewers' Lens: Understanding Bug Bounty Report Invalid Reasons with LLMs

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation

Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

Training-Free Active Learning Framework in Materials Science with Large Language Models

When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Active Slice Discovery in Large Language Models

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

How to Correctly Report LLM-as-a-Judge Evaluations

Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Best Practices for Machine Learning Experimentation in Scientific Applications

Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Built with on top of