Advancements in Software Engineering with Large Language Models

The field of software engineering is witnessing significant advancements with the integration of large language models (LLMs). Recent developments indicate a shift towards evaluating the practical capabilities of LLMs in real-world scenarios, such as bootstrapping development environments, optimizing code performance, and generating code compliant with specific library versions. Researchers are introducing new benchmarks and evaluation frameworks to assess the effectiveness of LLMs in these areas, highlighting substantial capability gaps between current models and expert-level performance. Noteworthy papers include SetupBench, which provides a rigorous benchmark for evaluating LLMs' environment-bootstrap capabilities, and SWE-Perf, which systematically evaluates LLMs on code performance optimization tasks within authentic repository contexts. Overall, the field is moving towards more realistic and comprehensive evaluations of LLMs in software engineering, paving the way for the development of more dependable and adaptable AI-powered tools.

Sources

SetupBench: Assessing Software Engineering Agents' Ability to Bootstrap Development Environments

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

Enhancing NeuroEvolution-Based Game Testing: A Branch Coverage Approach for Scratch Programs

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Built with on top of