Advances in Reasoning Capabilities of Large Language Models

The field of large language models (LLMs) is moving towards developing stronger reasoning capabilities to solve complex problems effectively. Recent studies have highlighted the limitations of existing methods, such as Chain-of-Thought (CoT) reasoning, and have proposed alternative approaches, including explicit high-level plan generation and bi-level frameworks for structured reasoning. These innovative methods have shown significant improvements in accuracy and generalizability across various domains, including mathematical reasoning, code generation, and financial question answering. Notably, the use of multi-domain datasets, such as CRISP, has enabled fine-tuning of small models to generate higher-quality plans than larger models using few-shot prompting. Furthermore, the introduction of cache steering methods has improved the qualitative structure of model reasoning and quantitative task performance. Noteworthy papers include:

  • CRISP, which introduces a multi-domain dataset for high-level plan generation and demonstrates its effectiveness in improving plan quality.
  • From Language to Logic, which proposes a bi-level framework for structured reasoning and achieves significant accuracy gains on realistic reasoning benchmarks.
  • KV Cache Steering, which presents a lightweight method for implicit steering of language models and improves both the qualitative structure of model reasoning and quantitative task performance.

Sources

CRISP: Complex Reasoning with Interpretable Step-based Plans

What Factors Affect LLMs and RLLMs in Financial Question Answering?

From Language to Logic: A Bi-Level Framework for Structured Reasoning

KV Cache Steering for Inducing Reasoning in Small Language Models

Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation

Opus: A Prompt Intention Framework for Complex Workflow Generation

Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?

Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited

Agentar-DeepFinance-300K: A Large-Scale Financial Dataset via Systematic Chain-of-Thought Synthesis Optimization

Built with on top of