Advances in Large Language Model Reasoning

The field of large language models is moving towards improving reasoning capabilities, with a focus on distinguishing between genuine capability acquisition and superficial memorization. Researchers are exploring novel evaluation frameworks, such as TrinEval, and benchmarking latent-space reasoning abilities to quantify model-internal reasoning. Weight-of-Thought reasoning is a new approach that examines neural network weights to identify reasoning pathways, demonstrating superior performance on diverse reasoning tasks. Another area of research is the development of formal reasoning-driven exploration paradigms, such as Kimina-Prover, which shows strong performance in formal theorem proving. Noteworthy papers include:

  • Beyond Chains of Thought, which introduces a benchmark to quantify model-internal reasoning and shows significant performance variations among LLMs.
  • Weight-of-Thought Reasoning, which achieves superior performance on diverse reasoning tasks compared to traditional methods.
  • Kimina-Prover Preview, which pioneers a novel reasoning-driven exploration paradigm for formal theorem proving and sets a new state-of-the-art on the miniF2F benchmark.

Sources

Large language models could be rote learners

Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

Weight-of-Thought Reasoning: Exploring Neural Network Weights for Enhanced LLM Reasoning

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Teaching Large Language Models to Reason through Learning and Forgetting

Replicating ReLM Results: Validating Large Language Models with ReLM

Memorization vs. Reasoning: Updating LLMs with New Knowledge

Sleep-time Compute: Beyond Inference Scaling at Test-time

Built with on top of