The field of large language models is moving towards exploring the potential of test-time scaling (TTS) to improve reasoning capabilities. Researchers are investigating various aspects of TTS, including the role of temperature sampling, inference scaling strategies, and the impact of training data on TTS performance. A key finding is that TTS can unlock the latent potential of base models, enabling them to reach performance comparable to reinforcement learning-trained counterparts. Additionally, studies have shown that TTS can be effective in specific applications such as machine translation, particularly when combined with task-specialized models. Noteworthy papers include:
- One paper proposes temperature scaling along the temperature dimension, which enlarges the reasoning boundary of large language models, yielding an additional 7.3 points over single-temperature TTS.
- Another paper introduces the Best-of-Majority strategy, a minimax-optimal inference scaling approach that outperforms majority voting and Best-of-N strategies.