The field of large language models (LLMs) is rapidly evolving, with a focus on improving their ability to reason and interact with external tools. Recent developments have highlighted the potential of LLMs to leverage tools to enhance their problem-solving capabilities, but also raised concerns about the reliability and trustworthiness of their outputs. Researchers are exploring new approaches to address these challenges, including the development of frameworks that enable LLMs to select the most reliable and easy-to-troubleshoot solution paths, and the creation of datasets that support the evaluation of LLMs' tool-based reasoning abilities. Notable papers in this area include: From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models, which introduces the concept of Tool-Induced Myopia (TIM) and proposes a framework to realign TaLMs to use tools as assistive evidence. Conformal Constrained Policy Optimization for Cost-Effective LLM Agents, which presents a novel strategy for combining multiple LLM models with varying cost/accuracy tradeoffs to minimize cost subject to a user-specified level of reliability. InData: Towards Secure Multi-Step, Tool-Based Data Analysis, which proposes a security-motivated alternative to restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Genomic Next-Token Predictors are In-Context Learners, which provides evidence of organically emergent in-context learning in genomic sequences, supporting the hypothesis that in-context learning arises as a consequence of large-scale predictive modeling over rich data. Cost-Driven Synthesis of Sound Abstract Interpreters, which investigates whether modern LLMs can reduce the burden of constructing abstract interpreters that provide global soundness guarantees by leveraging them to synthesize sound, non-trivial abstract interpreters. Weight-sparse transformers have interpretable circuits, which trains models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers, which develops a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. SkillGen: Learning Domain Skills for In-Context Sequential Decision Making, which introduces a skill-based in-context learning framework for structured sequential reasoning. It's LIT! Reliability-Optimized LLMs with Inspectable Tools, which presents a framework built on the tool-calling capabilities of existing LLMs to enable them to select the most reliable and easy-to-troubleshoot solution path. ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset, which introduces a large-scale, high-quality tool-agentic dataset with 160k synthetic data instances generated using over 20k tools and 200k augmented open-source data instances.