Advancements in Large Language Models

The field of large language models (LLMs) is rapidly evolving, with a focus on improving their capability to interact with external tools, navigate complex websites, and understand nuanced language. Recent developments have centered around creating more comprehensive benchmarks to evaluate the memory, forecasting, and tool-use capabilities of LLMs. These benchmarks aim to provide a more realistic and hermetic environment for testing, allowing for more accurate assessments of LLM performance. Additionally, there is a growing emphasis on adapting LLMs to enable robust tool use in non-English languages and developing methodologies for continual learning to improve their ability to accumulate experience and transfer knowledge across tasks. Noteworthy papers include Bench to the Future, which introduces a pastcasting benchmark for forecasting agents, and MemBench, which presents a comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Other notable works include DICE-BENCH, which evaluates the tool-use capabilities of LLMs in multi-round, multi-party dialogues, and WebSailor, which introduces a post-training methodology to instill superhuman reasoning capabilities in web agents.

Sources

Bench to the Future: A Pastcasting Benchmark for Forecasting Agents

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

Teaching a Language Model to Speak the Language of Tools

SWE-Bench-CL: Continual Learning for Coding Agents

LineRetriever: Planning-Aware Observation Reduction for Web Agents

MassTool: A Multi-Task Search-Based Tool Retrieval Framework for Large Language Models

Capsule Network-Based Semantic Intent Modeling for Human-Computer Interaction

MemeCMD: An Automatically Generated Chinese Multi-turn Dialogue Dataset with Contextually Retrieved Memes

WebArXiv: Evaluating Multimodal Agents on Time-Invariant arXiv Tasks

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

WebSailor: Navigating Super-human Reasoning for Web Agent

Built with on top of