The field of large language models is moving towards improving reasoning capabilities, with a focus on distinguishing between genuine capability acquisition and superficial memorization. Researchers are exploring novel evaluation frameworks, such as TrinEval, and benchmarking latent-space reasoning abilities to quantify model-internal reasoning. Weight-of-Thought reasoning is a new approach that examines neural network weights to identify reasoning pathways, demonstrating superior performance on diverse reasoning tasks. Another area of research is the development of formal reasoning-driven exploration paradigms, such as Kimina-Prover, which shows strong performance in formal theorem proving. Noteworthy papers include:
- Beyond Chains of Thought, which introduces a benchmark to quantify model-internal reasoning and shows significant performance variations among LLMs.
- Weight-of-Thought Reasoning, which achieves superior performance on diverse reasoning tasks compared to traditional methods.
- Kimina-Prover Preview, which pioneers a novel reasoning-driven exploration paradigm for formal theorem proving and sets a new state-of-the-art on the miniF2F benchmark.