The field of large language models (LLMs) is rapidly advancing, with a focus on improving their ability to understand and reason about complex scientific concepts. Recent developments have seen the introduction of new benchmarks and evaluation methods, such as live benchmarks that can continuously evolve with scientific advancement and model progress. These benchmarks are designed to test the limits of LLMs in various domains, including condensed matter physics, biology, and finance. Noteworthy papers in this area include MAC, which introduces a live benchmark for multimodal large language models in scientific understanding, and OwkinZero, which develops specialized models that substantially outperform larger, state-of-the-art commercial LLMs on biological benchmarks. Other notable papers include XFinBench, which benchmarks LLMs in complex financial problem solving and reasoning, and Pandora, which introduces a novel framework for unified structured knowledge reasoning. These advancements have the potential to accelerate AI-driven biological discovery and improve the accuracy of LLMs in various scientific domains.