Advances in Code Generation and Editing

The field of code generation and editing is rapidly evolving, with a focus on improving the quality and diversity of training data. Researchers are exploring new methods for generating high-quality datasets, such as leveraging open-source language models and software engineering agents. These efforts aim to address the limitations of traditional commit-based datasets, which are often noisy and lack diversity. Notable papers in this area include AgentPack, which presents a large corpus of code edits co-authored by humans and agents, and Bridging Developer Instructions and Code Completion, which introduces an instruction-aware fill-in-the-middle paradigm for code completion models. Generating High-Quality Datasets for Code Editing via Open-Source Language Models is another significant contribution, presenting an open-source pipeline for synthesizing realistic code-edit triplets. A Multi-Language Object-Oriented Programming Benchmark for Large Language Models proposes a new benchmark for evaluating intelligent code generation, covering six popular programming languages. CodeChemist and Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration present novel frameworks for functional knowledge transfer and multi-stage performance-guided LLM orchestration, respectively.

Sources

AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans

Bridging Developer Instructions and Code Completion Through Instruction-Aware Fill-in-the-Middle Paradigm

Generating High-Quality Datasets for Code Editing via Open-Source Language Models

A Multi-Language Object-Oriented Programming Benchmark for Large Language Models

CodeChemist: Functional Knowledge Transfer for Low-Resource Code Generation via Test-Time Scaling

Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration

Built with on top of