The field of large language models (LLMs) is rapidly advancing, with a focus on improving their reasoning and mathematical capabilities. Recent research has highlighted the importance of auxiliary information in shaping LLM reasoning, and the need for models to critically evaluate the information upon which their reasoning is based. Another area of focus is the development of new training methods, such as diffusion-based approaches and reinforcement learning, which have shown promise in improving LLM performance on mathematical and logical tasks. Additionally, there is a growing interest in exploring the potential of LLMs to learn from pre-training data and to develop more generalizable reasoning skills. Noteworthy papers in this area include: Thinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning, which introduces the SciAux dataset to test the robustness of LLMs against misleading information. DSFT: Inspiring Diffusion Large Language Models to Comprehend Mathematical and Logical Patterns, which proposes a simple yet effective diffusion strategy to guide models in understanding mathematical and logical patterns. Reinforcement Learning on Pre-Training Data, which introduces a new training-time scaling paradigm for optimizing LLMs using reinforcement learning on pre-training data. VCRL: Variance-based Curriculum Reinforcement Learning for Large Language Models, which proposes a curriculum reinforcement learning framework that dynamically controls the difficulty of training samples based on the variance of group rewards. Future Policy Aware Preference Learning for Mathematical Reasoning, which proposes a new preference learning method that enables safer training by preemptively regularizing potentially problematic gradients. Thinking Augmented Pre-training, which introduces a simple and scalable approach to improve the data efficiency of LLM training by augmenting existing text data with thinking trajectories. Language Models that Think, Chat Better, which introduces a new reinforcement learning paradigm that requires LMs to generate long chains of thought before response, and optimizes them with online RL against a preference-based reward model.