arXiv NEWS (2024-06-19) #142

github-actions · 2024-06-19T09:05:42Z

cs.SE updates on arXiv.org, Wed, 19 Jun 2024 04:00:36 +0000

miniCodeProps: a Minimal Benchmark for Proving Code Properties [pdf]

Authors: Evan Lohn, Sean Welleck

Tags: LLM

Neural networks have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable a machine learning system to output provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present miniCodeProps, a benchmark of 177 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. miniCodeProps contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof difficulty. Despite its simplicity, miniCodeProps is challenging for current LLM-based provers, which succeed in proving about 25 percent of the specifications. We publicly release miniCodeProps as a benchmark for furthering automated theorem proving in the context of formally verified code.

DocCGen: Document-based Controlled Code Generation [pdf]

Authors: Sameer Pimparkhede, Mehant Kammakomati, Srikanth G. Tamilselvam, Prince Kumar, Ashok Pon Kumar, Pushpak Bhattacharyya

Tags: LLM

Recent developments show that Large Language Models (LLMs) produce state-of-the-art performance on natural language (NL) to code generation for resource-rich general-purpose languages like C++, Java, and Python. However, their practical usage for structured domain-specific languages (DSLs) such as YAML, JSON is limited due to domain-specific schema, grammar, and customizations generally unseen by LLMs during pre-training. Efforts have been made to mitigate this challenge via in-context learning through relevant examples or by fine-tuning. However, it suffers from problems, such as limited DSL samples and prompt sensitivity but enterprises maintain good documentation of the DSLs. Therefore, we propose DocCGen, a framework that can leverage such rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process. First, it detects the correct libraries using the library documentation that best matches the NL query. Then, it utilizes schema rules extracted from the documentation of these libraries to constrain the decoding. We evaluate our framework for two complex structured languages, Ansible YAML and Bash command, consisting of two settings: Out-of-domain (OOD) and In-domain (ID). Our extensive experiments show that DocCGen consistently improves different-sized language models across all six evaluation metrics, reducing syntactic and semantic errors in structured code. We plan to open-source the datasets and code to motivate research in constrained code generation.

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark [pdf]

Authors: Nam Le Hai, Dung Manh Nguyen, Nghi D. Q. Bui

Tags: LLM

The ability of CodeLLMs to generate executable and functionally correct code at the \textit{repository-level scale }remains largely unexplored. We introduce \methodnamews, a novel benchmark for evaluating code generation at the repository-level scale, emphasizing executability and correctness. \methodnamews provides an automated system that verifies requirements and incorporates a mechanism for dynamically generating high-coverage test cases to assess the functionality of generated code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuning models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. \methodnamews aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios.

A Critical Study of What Code-LLMs (Do Not) Learn [pdf]

Authors: Abhinav Anand, Shweta Verma, Krishna Narasimhan, Mira Mezini

Tags: LLM

Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters.

Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents [pdf]

Authors: Zehao Wang, Dong Jae Kim, Tse-Hsun Chen

Tags: LLM

Configuration settings are essential for tailoring software behavior to meet specific performance requirements. However, incorrect configurations are widespread, and identifying those that impact system performance is challenging due to the vast number and complexity of possible settings. In this work, we present PerfSense, a lightweight framework that leverages Large Language Models (LLMs) to efficiently identify performance-sensitive configurations with minimal overhead. PerfSense employs LLM agents to simulate interactions between developers and performance engineers using advanced prompting techniques such as prompt chaining and retrieval-augmented generation (RAG). Our evaluation of seven open-source Java systems demonstrates that PerfSense achieves an average accuracy of 64.77% in classifying performance-sensitive configurations, outperforming both our LLM baseline (50.36%) and the previous state-of-the-art method (61.75%). Notably, our prompt chaining technique improves recall by 10% to 30% while maintaining similar precision levels. Additionally, a manual analysis of 362 misclassifications reveals common issues, including LLMs' misunderstandings of requirements (26.8%). In summary, PerfSense significantly reduces manual effort in classifying performance-sensitive configurations and offers valuable insights for future LLM-based code analysis research.

Iterative or Innovative? A Problem-Oriented Perspective for Code Optimization [pdf]

Authors: Tong Ye, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji, Wenhai Wang

Tags: LLM

Large language models (LLMs) have demonstrated strong capabilities in solving a wide range of programming tasks. However, LLMs have rarely been explored for code optimization. In this paper, we explore code optimization with a focus on performance enhancement, specifically aiming to optimize code for minimal execution time. The recently proposed first PIE dataset for performance optimization constructs program optimization pairs based on iterative submissions from the same programmer for the same problem. However, this approach restricts LLMs to local performance improvements, neglecting global algorithmic innovation. Therefore, we adopt a completely different perspective by reconstructing the optimization pairs into a problem-oriented approach. This allows for the integration of various ingenious ideas from different programmers tackling the same problem. Experimental results demonstrate that adapting LLMs to problem-oriented optimization pairs significantly enhances their optimization capabilities. Meanwhile, we identified performance bottlenecks within the problem-oriented perspective. By employing model merge, we further overcame bottlenecks and ultimately elevated the program optimization ratio ($51.76%\rightarrow76.65%$) and speedup ($2.65\times\rightarrow5.09\times$) to new levels.

Should AI Optimize Your Code? A Comparative Study of Current Large Language Models Versus Classical Optimizing Compilers [pdf]

Authors: Miguel Romero Rosas, Miguel Torres Sanchez, Rudolf Eigenmann

Tags: LLM

In the contemporary landscape of computer architecture, the demand for efficient parallel programming persists, needing robust optimization techniques. Traditional optimizing compilers have historically been pivotal in this endeavor, adapting to the evolving complexities of modern software systems. The emergence of Large Language Models (LLMs) raises intriguing questions about the potential for AI-driven approaches to revolutionize code optimization methodologies.
This paper presents a comparative analysis between two state-of-the-art Large Language Models, GPT-4.0 and CodeLlama-70B, and traditional optimizing compilers, assessing their respective abilities and limitations in optimizing code for maximum efficiency. Additionally, we introduce a benchmark suite of challenging optimization patterns and an automatic mechanism for evaluating performance and correctness of the code generated by such tools. We used two different prompting methodologies to assess the performance of the LLMs -- Chain of Thought (CoT) and Instruction Prompting (IP). We then compared these results with three traditional optimizing compilers, CETUS, PLUTO and ROSE, across a range of real-world use cases.
A key finding is that while LLMs have the potential to outperform current optimizing compilers, they often generate incorrect code on large code sizes, calling for automated verification methods. Our extensive evaluation across 3 different benchmarks suites shows CodeLlama-70B as the superior optimizer among the two LLMs, capable of achieving speedups of up to 2.1x. Additionally, CETUS is the best among the optimizing compilers, achieving a maximum speedup of 1.9x. We also found no significant difference between the two prompting methods: Chain of Thought (Cot) and Instructing prompting (IP).

CodeNav: Beyond tool-use to using real-world codebases with LLM agents [pdf]

Authors: Tanmay Gupta, Luca Weihs, Aniruddha Kembhavi

Tags: LLM

We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require ``registration'' of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to iteratively generate a solution with execution feedback. To highlight the core-capabilities of CodeNav, we first showcase three case studies where we use CodeNav for solving complex user queries using three diverse codebases. Next, on three benchmarks, we quantitatively compare the effectiveness of code-use (which only has access to the target codebase) to tool-use (which has privileged access to all tool names and descriptions). Finally, we study the effect of varying kinds of tool and library descriptions on code-use performance, as well as investigate the advantage of the agent seeing source code as opposed to natural descriptions of code. All code will be made open source under a permissive license.

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [pdf]

Authors: Federico Errica, Giuseppe Siracusano, Davide Sanvito, Roberto Bifulco

Tags: LLM

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want to include these models in their software stack, however, face a dreadful challenge: debugging their inconsistent behavior across minor variations of the prompt. We therefore introduce two metrics for classification tasks, namely sensitivity and consistency, which are complementary to task performance. First, sensitivity measures changes of predictions across rephrasings of the prompt, and does not require access to ground truth labels. Instead, consistency measures how predictions vary across rephrasings for elements of the same class. We perform an empirical comparison of these metrics on text classification tasks, using them as guideline for understanding failure modes of the LLM. Our hope is that sensitivity and consistency will be powerful allies in automatic prompt engineering frameworks to obtain LLMs that balance robustness with performance.

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review [pdf]

Authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley

Tags: LLM

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to evaluate such LLMs for this task is still an open problem despite of the great amount of research efforts that have been made and reported to evaluate and compare them. This paper provides a critical review of the existing work on the testing and evaluation of these tools with a focus on two key aspects: the benchmarks and the metrics used in the evaluations. Based on the review, further research directions are discussed.

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning [pdf]

Authors: Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang

Tags: LLM

Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively, and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 146.54%, and outperforms the latest binary code similarity detection techniques by up to 6.17%, showing promising abilities on both assembly generation and understanding tasks.

User Centric Evaluation of Code Generation Tools [pdf]

Authors: Tanha Miah, Hong Zhu

Tags: LLM

With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool.
The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.

From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs [pdf]

Authors: Aaron Conrardy, Jordi Cabot

Tags: LLM

In software engineering processes, systems are first specified using a modeling language such as UML. These initial designs are often collaboratively created, many times in meetings where different domain experts use whiteboards, paper or other types of quick supports to create drawings and blueprints that then will need to be formalized. These proper, machine-readable, models are key to ensure models can be part of automated processes (e.g. input of a low-code generation pipeline, a model-based testing system, ...). But going from hand-drawn diagrams to actual models is a time-consuming process that sometimes ends up with such drawings just added as informal images to the software documentation, reducing their value a lot. To avoid this tedious task, we explore the usage of Large Language Models (LLM) to generate the formal representation of (UML) models from a given drawing. More specifically, we have evaluated the capabilities of different LLMs to convert images of UML class diagrams into the actual models represented in the images. While the results are good enough to use such an approach as part of a model-driven engineering pipeline we also highlight some of their current limitations and the need to keep the human in the loop to overcome those limitations.

Development of a Real-Time Simulator Using EMTP-ATP Foreign models for Testing Relays [pdf]

Authors: Renzo Fabian, Rommel Romero

This paper reports the PC implementation of a real-time simulator for testing protective relays, based on the widely used EMTP-ATP software. The proposed simulator was implemented using the GNU/Linux OS with a real-time kernel. In order to generate the waveforms corresponding to simulated voltages and currents, a PCI card was used. This card also includes digital I/O interface. Via foreign models programmed in standard C, ATP was recompiled to include waveform generation at each simulation time step and digital I/O. Additionally, an IEC-61850 open source library was used, in order to use Sampled Values and GOOSE protocols. The resulting tool is a real-time simulator that can interact with protective relays by means of HiL tests. The performance of the simulator was analyzed via an interaction with an actual relay.

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology [pdf]

Authors: Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, Nghi D. Q. Bui

Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify software development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally developing software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi-agent systems in advanced software engineering environments. Our source code can be found at https://github.com/FSoft-AI4Code/AgileCoder.

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence [pdf]

Authors: DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, Wenfeng Liang

We present DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks. Specifically, DeepSeek-Coder-V2 is further pre-trained from an intermediate checkpoint of DeepSeek-V2 with additional 6 trillion tokens. Through this continued pre-training, DeepSeek-Coder-V2 substantially enhances the coding and mathematical reasoning capabilities of DeepSeek-V2, while maintaining comparable performance in general language tasks. Compared to DeepSeek-Coder-33B, DeepSeek-Coder-V2 demonstrates significant advancements in various aspects of code-related tasks, as well as reasoning and general capabilities. Additionally, DeepSeek-Coder-V2 expands its support for programming languages from 86 to 338, while extending the context length from 16K to 128K. In standard benchmark evaluations, DeepSeek-Coder-V2 achieves superior performance compared to closed-source models such as GPT4-Turbo, Claude 3 Opus, and Gemini 1.5 Pro in coding and math benchmarks.

Systematic literature review on forecasting and prediction of technical debt evolution [pdf]

Authors: Adekunle Ajibode, Yvon Apedo, Temitope Ajibode

Context: Technical debt (TD) refers to the additional costs incurred due to compromises in software quality, providing short-term advantages during development but potentially compromising long-term quality. Accurate TD forecasting and prediction are vital for informed software maintenance and proactive management. However, this research area lacks comprehensive documentation on the available forecasting techniques. Objective: This study aims to explore existing knowledge in software engineering to gain insights into approaches proposed in research and industry for forecasting TD evolution. Methods: To achieve this objective, we conducted a Systematic Literature Review encompassing 646 distinct papers published until 2023. Following established methodology in software engineering, we identified and included 14 primary studies for analysis. Result: Our analysis unveiled various approaches for TD evolution forecasting. Notably, random forest and temporal convolutional networks demonstrated superior performance compared to other methods based on the result from the primary studies. However, these approaches only address two of the fifteen identified TD types, specifically Code debt and Architecture debt, while disregarding the remaining types. Conclusion: Our findings indicate that research on TD evolution forecasting is still in its early stages, leaving numerous challenges unaddressed. Therefore, we propose several research directions that require further investigation to bridge the existing gaps. Keywords: Systematic literature review, Technical debt, Technical debt prediction, Technical debt forecasting, Technical debt metrics

CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [pdf]

Authors: Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Shiwei Wang, Chao Shen

With deep learning (DL) technology becoming an integral part of the new intelligent software, tools of DL framework testing and bug-finding are in high demand. Existing DL framework testing tools have limited coverage on bug types. For example, they lack the capability of finding performance bugs, which are critical for DL model training and inference regarding performance, economics, and the environment. This problem is challenging due to the difficulty of getting test oracles of performance bugs. Moreover, existing tools are inefficient, generating hundreds of test cases with few trigger bugs. In this paper, we propose CITADEL, a method that accelerates the finding of bugs in terms of efficiency and effectiveness. We observe that many DL framework bugs are similar due to the similarity of operators and algorithms belonging to the same family (e.g., Conv2D and Conv3D). Orthogonal to existing bug-finding tools, CITADEL aims to find new bugs that are similar to reported ones that have known test oracles. It works by first collecting existing bug reports and identifying problematic APIs. CITADEL defines context similarity to measure the similarity of DL framework API pairs and automatically generates test cases with oracles for APIs that are similar to the problematic APIs in existing bug reports. CITADEL respectively covers 1,436 PyTorch and 5,380 TensorFlow APIs and effectively detects 79 and 80 API bugs, among which 58 and 68 are new, and 36 and 58 have been confirmed, many of which, e.g., the 11 performance bugs cannot be detected by existing tools. Moreover, a remarkable 35.40% of the test cases generated by CITADEL can trigger bugs, which significantly transcends the ratios of 0.74%, 1.23%, and 3.90% exhibited by the state-of-the-art methods, DocTer, DeepREL, and TitanFuzz.

W2E (Workout to Earn): A Low Cost DApp based on ERC-20 and ERC-721 standards [pdf]

Authors: Do Hai Son, Nguyen Danh Hao, Tran Thi Thuy Quynh, Le Quang Minh

Decentralized applications (DApps) have gained prominence with the advent of blockchain technology, particularly Ethereum, providing trust, transparency, and traceability. However, challenges such as rising transaction costs and block confirmation delays hinder their widespread adoption. In this paper, we present our DApp named W2E - Workout to Earn, a mobile DApp incentivizing exercise through tokens and NFT awards. This application leverages the well-known ERC-20 and ERC-721 token standards of Ethereum. Additionally, we deploy W2E into various Ethereum-based networks, including Ethereum testnets, Layer 2 networks, and private networks, to survey gas efficiency and execution time. Our findings highlight the importance of network selection for DApp deployment, offering insights for developers and businesses seeking efficient blockchain solutions. This is because our experimental results are not only specific for W2E but also for other ERC-20 and ERC-721-based DApps.

A Step Towards a Universal Method for Modeling and Implementing Cross-Organizational Business Processes [pdf]

Authors: Gerhard Zeisler, Tim Tobias Braunauer, Albert Fleischmann, Robert Singer

The widely adopted Business Process Model and Notation (BPMN) is a cornerstone of industry standards for business process modeling. However, its ambiguous execution semantics often result in inconsistent interpretations, depending on the software used for implementation. In response, the Process Specification Language (PASS) provides formally defined semantics to overcome these interpretational challenges. Despite its clear advantages, PASS has not reached the same level of industry penetration as BPMN.
This feasibility study proposes using PASS as an intermediary framework to translate and execute BPMN models. It describes the development of a prototype translator that converts specific BPMN elements into a format compatible with PASS. These models are then transformed into source code and executed in a bespoke workflow environment, marking a departure from traditional BPMN implementations.
Our findings suggest that integrating PASS enhances compatibility across different modeling and execution tools and offers a more robust methodology for implementing business processes across organizations. This study lays the groundwork for more accurate and unified business process model executions, potentially transforming industry standards for process modeling and execution.

Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models [pdf]

Authors: Jiayi Lin, Yutao Xie, Yue Yu, Yibiao Yang, Lei Zhang

Recently, large code generation models trained in a self-supervised manner on extensive unlabeled programming language data have achieved remarkable success. While these models acquire vast amounts of code knowledge, they perform poorly on code understanding tasks, such as code search and clone detection, as they are specifically trained for generation. Pre-training a larger encoder-only architecture model from scratch on massive code data can improve understanding performance. However, this approach is costly and time-consuming, making it suboptimal. In this paper, we pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks, significantly reducing training costs. We examine effective strategies for enabling decoder-only models to acquire robust code representations. Furthermore, we introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance in understanding tasks such as code search and clone detection. Our analysis shows that our method effectively reduces the distance between semantically identical samples in the representation space. These findings suggest the potential for unifying code understanding and generation tasks using a decoder-only structured model.

Measuring Information Diffusion in Code Review at Spotify [pdf]

Authors: Michael Dorner, Daniel Mendez, Ehsan Zabardast, Nicole Valdez, Marcin Floryan

Background: As a core practice in software engineering, the nature of code review has been frequently subject to research. Prior exploratory studies found that code review, the discussion around a code change among humans, forms a communication network that enables its participants to exchange and spread information. Although popular in software engineering, there is no confirmatory research corroborating this theory and the actual extent of information diffusion in code review is not well understood.
Objective: In this registered report, we propose an observational study to measure information diffusion in code review to test the theory of code review as communication network.
Method: We approximate the information diffusion in code review through the frequency and the similarity between (1) human participants, (2) affected components, and (3) involved teams of linked code reviews. The measurements approximating the information diffusion in code review serve as a foundation for falsifying the theory of code review as communication network.

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation [pdf]

Authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley

In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset.
The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation.
Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive methods. The Java code generated by ChatGPT tends to be much shorter than reference solution in terms of number of lines, while it is more likely to be more complex in both cyclomatic and cognitive complexity metrics, if the generated code is correct. However, the generated code is more likely to be less complex than the reference solution if the code is incorrect.

Runtime Verification on Abstract Finite State Models [pdf]

Authors: KP Jevitha, Bharat Jayaraman, M Sethumadhavan

Finite-state models are ubiquitous in the study of concurrent systems, especially controllers and servers that operate in a repetitive cycle. In this paper, we show how to extract finite state models from a run of a multi-threaded Java program and carry out runtime verification of correctness properties. These properties include data-oriented and control-oriented properties; the former express correctness conditions over the data fields of objects, while the latter are concerned with the correct flow of control among the modules of larger software. As the extracted models can become very large for long runs, the focus of this paper is on constructing reduced models with user-defined abstraction functions that map a larger domain space to a smaller one. The abstraction functions should be chosen so that the resulting model is property preserving, i.e., proving a property on the abstract model carries over to the concrete model. The main contribution of this paper is in showing how runtime verification can be made efficient through online property checking on property-preserving abstract models. The property specification language resembles a propositional linear temporal logic augmented with simple datatypes and operators. Classic concurrency examples and larger case studies (Multi-rotor Drone Controller, OAuth Protocol) are presented in order to demonstrate the usefulness of our proposed techniques, which are incorporated in an Eclipse plug-in for runtime visualization and verification of Java programs.

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation [pdf]

Authors: Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, Shaohua Wang

We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. We adopt advanced tools to ensure the extracted code integrality and enrich the code with four different transformed representations. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. Thus, MegaVul can be used for a variety of software security-related tasks including detecting vulnerabilities and assessing vulnerability severity. All information is stored in the JSON format for easy usage. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages.

The text was updated successfully, but these errors were encountered:

github-actions bot added the news label Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv NEWS (2024-06-19) #142

arXiv NEWS (2024-06-19) #142

github-actions bot commented Jun 19, 2024

arXiv NEWS (2024-06-19) #142

arXiv NEWS (2024-06-19) #142

Comments

github-actions bot commented Jun 19, 2024

cs.SE updates on arXiv.org, Wed, 19 Jun 2024 04:00:36 +0000

miniCodeProps: a Minimal Benchmark for Proving Code Properties [pdf]

DocCGen: Document-based Controlled Code Generation [pdf]

REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark [pdf]

A Critical Study of What Code-LLMs (Do Not) Learn [pdf]

Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents [pdf]

Iterative or Innovative? A Problem-Oriented Perspective for Code Optimization [pdf]

Should AI Optimize Your Code? A Comparative Study of Current Large Language Models Versus Classical Optimizing Compilers [pdf]

CodeNav: Beyond tool-use to using real-world codebases with LLM agents [pdf]

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [pdf]

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review [pdf]

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning [pdf]

User Centric Evaluation of Code Generation Tools [pdf]

From Image to UML: First Results of Image Based UML Diagram Generation Using LLMs [pdf]

Development of a Real-Time Simulator Using EMTP-ATP Foreign models for Testing Relays [pdf]

AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology [pdf]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence [pdf]

Systematic literature review on forecasting and prediction of technical debt evolution [pdf]

CITADEL: Context Similarity Based Deep Learning Framework Bug Finding [pdf]

W2E (Workout to Earn): A Low Cost DApp based on ERC-20 and ERC-721 standards [pdf]

A Step Towards a Universal Method for Modeling and Implementing Cross-Organizational Business Processes [pdf]

Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models [pdf]

Measuring Information Diffusion in Code Review at Spotify [pdf]

ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation [pdf]

Runtime Verification on Abstract Finite State Models [pdf]

MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation [pdf]