Scaling Large Language Models with Inference-Time Compute

The field of large language models (LLMs) is rapidly advancing with a focus on scaling inference-time compute to improve performance. Recent developments have highlighted the importance of efficient inference-time methods, such as Best-of-N sampling and generative reward models, to enhance the reasoning capabilities of LLMs. Notably, the introduction of adaptive layer-skipping methods, like FlexiDepth, and novel algorithms, such as Entropy-Guided Sequence Weighting and PromptDistill, have demonstrated significant improvements in efficiency and performance. Furthermore, the exploration of inference-time scaling laws and the development of scalable reward modeling approaches, like DeepSeek-GRM, have shown promise in optimizing test-time compute. Overall, the field is moving towards more efficient and effective inference-time methods to unlock the full potential of LLMs. Noteworthy papers include: Is Best-of-N the Best of Them, which introduces InferenceTimePessimism, a new algorithm that mitigates reward hacking through deliberate use of inference-time compute. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning, which proposes a generative process reward model that performs explicit Chain-of-Thought reasoning with code verification. Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model, which introduces the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.

Scaling Large Language Models with Inference-Time Compute

Sources