Advances in Large Language Models for Code Intelligence

The field of code intelligence is rapidly advancing with the development of large language models (LLMs). Recent research has focused on improving the performance of LLMs in various code-related tasks, such as code generation, code repair, and compliance checking. One of the key challenges in this area is evaluating the performance of LLMs in a comprehensive and reliable manner. To address this, several benchmarks have been proposed, including CodeAlignBench, GDPR-Bench-Android, and CompliBench, which assess the ability of LLMs to follow instructions, detect compliance violations, and perform code edits.

Notable papers in this area include CodeAlignBench, which introduces a multi-language benchmark for evaluating LLM instruction-following capabilities, and CompliBench, which proposes a novel evaluation framework for assessing LLMs' ability to detect compliance violations. Another significant contribution is the development of EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage.

Overall, the field of code intelligence is moving towards more comprehensive and reliable evaluation of LLMs, with a focus on real-world applications and tasks. The development of new benchmarks and evaluation frameworks is expected to drive further advancements in this area.

Noteworthy papers: CodeAlignBench introduces a multi-language benchmark for evaluating LLM instruction-following capabilities. CompliBench proposes a novel evaluation framework for assessing LLMs' ability to detect compliance violations.

Sources

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

DocPrism: Local Categorization and External Filtering to Identify Relevant Code-Documentation Inconsistencies

MH-1M: A 1.34 Million-Sample Comprehensive Multi-Feature Android Malware Dataset for Machine Learning, Deep Learning, Large Language Models, and Threat Intelligence Research

GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android

Can Large Language Models Detect Real-World Android Software Compliance Violations?

A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI

DPO-F+: Aligning Code Repair Feedback with Developers' Preferences

An Empirical Study of LLM-Based Code Clone Detection

Hidden in Plain Sight: Where Developers Confess Self-Admitted Technical Debt

Open the Oyster: Empirical Evaluation and Improvement of Code Reasoning Confidence in LLMs

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Built with on top of