The field of software engineering is witnessing significant advancements with the integration of large language models (LLMs). Recent developments indicate a shift towards evaluating the practical capabilities of LLMs in real-world scenarios, such as bootstrapping development environments, optimizing code performance, and generating code compliant with specific library versions. Researchers are introducing new benchmarks and evaluation frameworks to assess the effectiveness of LLMs in these areas, highlighting substantial capability gaps between current models and expert-level performance. Noteworthy papers include SetupBench, which provides a rigorous benchmark for evaluating LLMs' environment-bootstrap capabilities, and SWE-Perf, which systematically evaluates LLMs on code performance optimization tasks within authentic repository contexts. Overall, the field is moving towards more realistic and comprehensive evaluations of LLMs in software engineering, paving the way for the development of more dependable and adaptable AI-powered tools.