Advancements in Large Language Models and Human Collaboration

The field of natural language processing is witnessing a significant shift towards the integration of large language models (LLMs) and human collaboration. Recent studies have demonstrated the potential of LLMs in accelerating the delivery of context and improving the accuracy of various tasks, such as annotation and stance detection. However, these studies also highlight the importance of human feedback and oversight in ensuring the reliability and trustworthiness of LLMs. The use of confidence thresholds and inter-model disagreement to selectively involve human review has been shown to improve annotation reliability while reducing human effort. Furthermore, the development of leaderboards and evaluation standards for LLMs is crucial in establishing a community-driven approach to advancing the field. Noteworthy papers include:

  • Scaling Human Judgment in Community Notes with LLMs, which proposes a new paradigm for community notes that leverages the strengths of both humans and LLMs.
  • Reliable Annotations with Less Effort, which demonstrates the effectiveness of a human-in-the-loop workflow in improving annotation reliability.
  • VERBA, which introduces a protocol for verbalizing model differences using LLMs, facilitating fine-grained pairwise comparisons among models.

Sources

Shifting Narratives: A Longitudinal Analysis of Media Trends and Public Attitudes on Homelessness

Scaling Human Judgment in Community Notes with LLMs

Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications

La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America

Is External Information Useful for Stance Detection with LLMs?

When LLMs Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search

VERBA: Verbalizing Model Differences Using Large Language Models

Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation

Built with on top of