You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs), yet there has been no work evaluating LLMs with these constraints. We propose two novel tasks to assess the controllability of LLMs using hard and soft constraints represented as code across five representations. Our findings suggest that LLMs struggle to comprehend constraints in all representations irrespective of their portions in the pre-training data. While models are better at comprehending constraints in JSON, YAML, and natural language representations, they struggle with constraints represented in XML and the resource-rich language Python.
Building Management System (BMS) through a data-driven method always faces data and model scalability issues. We propose a methodology to tackle the scalability challenges associated with the development of data-driven models for BMS by using Large Language Models (LLMs). LLMs' code generation adaptability can enable broader adoption of BMS by "automating the automation," particularly the data handling and data-driven modeling processes. In this paper, we use LLMs to generate code that processes structured data from BMS and build data-driven models for BMS's specific requirements. This eliminates the need for manual data and model development, reducing the time, effort, and cost associated with this process. Our hypothesis is that LLMs can incorporate domain knowledge about data science and BMS into data processing and modeling, ensuring that the data-driven modeling is automated for specific requirements of different building types and control objectives, which also improves accuracy and scalability. We generate a prompt template following the framework of Machine Learning Operations so that the prompts are designed to systematically generate Python code for data-driven modeling. Our case study indicates that bi-sequential prompting under the prompt template can achieve a high success rate of code generation and code accuracy, and significantly reduce human labor costs.
Authors: Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, Son Nguyen
Tags: LLM
Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, "whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks". In this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models' responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to non-equivalent ones for all SE tasks. Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model's capabilities of understanding code semantics, especially the sensitivity property.
Authors: Jun Liu, Jiwei Yan, Yuanyuan Xie, Jun Yan, Jian Zhang
Tags: LLM
During software evolution, it is advocated that test code should co-evolve with production code. In real development scenarios, test updating may lag behind production code changing, which may cause the project to fail to compile or bring other troubles. Existing techniques based on pre-trained language models can be adopted to repair obsolete tests caused by such unsynchronized code changes, especially syntactic-related ones. However, the lack of target-oriented contextual information affects repair accuracy on large-scale projects. Starting from an obsoleted test, the key challenging task is precisely identifying and constructing Test-Repair-Oriented Contexts (TROCtx) from the whole repository within a limited token size.
In this paper, we propose SynBCIATR (Syntactic-Breaking-Change-Induced Automated Test Repair), a novel approach to automatically repair obsolete test cases via precise and concise TROCtx construction. Inspired by developers' programming practices of the task, we design three types of TROCtx: class contexts, usage contexts, and environment contexts. For every type of TROCtx, SynBCIATR automatically collects the changed-token-related code information through static analysis techniques. Then it generates reranking queries to identify the most relevant TROCtxs, which will be taken as the repair-required key context and be input to the Large Language Model for the final test repair.
To evaluate the effectiveness of SynBCIATR, we construct a benchmark dataset that contains diverse syntactic breaking changes. The experimental results show that SynBCIATR outperforms baseline approaches both on textual- and intent-matching metrics. With the augmentation of TROCtx constructed by SynBCIATR, hallucinations are reduced by 57.1%.
The rapid rise of Large Language Models (LLMs) has changed software development, with tools like Copilot, JetBrains AI Assistant, and others boosting developers' productivity. However, developers now spend more time reviewing code than writing it, highlighting the importance of Code Readability for code comprehension. Our previous research found that existing Code Readability models were inaccurate in representing developers' notions and revealed a low consensus among developers, highlighting a need for further investigations in this field.
Building on this, we surveyed 10 Java developers with similar coding experience to evaluate their consensus on Code Readability assessments and related aspects. We found significant agreement among developers on Code Readability evaluations and identified specific code aspects strongly correlated with Code Readability. Overall, our study sheds light on Code Readability within LLM contexts, offering insights into how these models can align with developers' perceptions of Code Readability, enhancing software development in the AI era.
Authors: Ruidi Qiu, Grace Li Zhang, Rolf Drechsler, Ulf Schlichtmann, Bing Li
Tags: LLM
In digital circuit design, testbenches constitute the cornerstone of simulation-based hardware verification. Traditional methodologies for testbench generation during simulation-based hardware verification still remain partially manual, resulting in inefficiencies in testing various scenarios and requiring expensive time from designers. Large Language Models (LLMs) have demonstrated their potential in automating the circuit design flow. However, directly applying LLMs to generate testbenches suffers from a low pass rate. To address this challenge, we introduce AutoBench, the first LLM-based testbench generator for digital circuit design, which requires only the description of the design under test (DUT) to automatically generate comprehensive testbenches. In AutoBench, a hybrid testbench structure and a self-checking system are realized using LLMs. To validate the generated testbenches, we also introduce an automated testbench evaluation framework to evaluate the quality of generated testbenches from multiple perspectives. Experimental results demonstrate that AutoBench achieves a 57% improvement in the testbench pass@1 ratio compared with the baseline that directly generates testbenches using LLMs. For 75 sequential circuits, AutoBench successfully has a 3.36 times testbench pass@1 ratio compared with the baseline. The source codes and experimental results are open-sourced at this link: https://github.com/AutoBench/AutoBench
Authors: Zhimin Zhao, Abdul Ali Bangash, Filipe Roseiro C^ogo, Bram Adams, Ahmed E. Hassan
Tags: LLM
Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.
Language models of code have demonstrated state-of-the-art performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce these models' computational overhead. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size up to $\times 3$ less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 by up to $44.85$%. Importantly, it achieves the reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of adopting language models in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.
Authors: Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang
Tags: LLM
Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
Authors: Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, Yi Cai
Tags: LLM
Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.
Authors: Alex Mathai, Chenxi Huang, Petros Maniatis, Aleksandr Nogikh, Franjo Ivancic, Junfeng Yang, Baishakhi Ray
Tags: LLM
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching the code base. We use kGym to facilitate evaluation on kBench, a crash resolution benchmark drawn from real-world Linux kernel bugs. An example bug in kBench contains crashing stack traces, a bug-reproducer file, a developer-written fix, and other associated data. To understand current performance, we conduct baseline experiments by prompting LLMs to resolve Linux kernel crashes. Our initial evaluations reveal that the best performing LLM achieves 0.72% and 5.38% in the unassisted and assisted (i.e., buggy files disclosed to the model) settings, respectively. These results highlight the need for further research to enhance model performance in SE tasks. Improving performance on kBench requires models to master new learning skills, including understanding the cause of crashes and repairing faults, writing memory-safe and hardware-aware code, and understanding concurrency. As a result, this work opens up multiple avenues of research at the intersection of machine learning and systems software.
Despite deep learning's transformative impact on various domains, the reliability of Deep Neural Networks (DNNs) is still a pressing concern due to their complexity and data dependency. Traditional software fault localization techniques, such as Spectrum-based Fault Localization (SBFL), have been adapted to DNNs with limited success. Existing methods like DeepFault utilize SBFL measures but fail to account for fault propagation across neural pathways, leading to suboptimal fault detection. Addressing this gap, we propose the NP-SBFL method, leveraging Layer-wise Relevance Propagation (LRP) to identify and verify critical neural pathways. Our innovative multi-stage gradient ascent (MGA) technique, an extension of gradient ascent (GA), activates neurons sequentially, enhancing fault detection efficacy. We evaluated the effectiveness of our method, i.e. NP-SBFL-MGA, on two commonly used datasets, MNIST and CIFAR-10, two baselines DeepFault and NP- SBFL-GA, and three suspicious neuron measures, Tarantula, Ochiai, and Barinel. The empirical results showed that NP-SBFL-MGA is statistically more effective than the baselines at identifying suspicious paths and synthesizing adversarial inputs. Particularly, Tarantula on NP-SBFL-MGA had the highest fault detection rate at 96.75%, surpassing DeepFault on Ochiai (89.90%) and NP-SBFL-GA on Ochiai (60.61%). Our approach also yielded results comparable to those of the baselines in synthesizing naturalness inputs, and we found a positive correlation between the coverage of critical paths and the number of failed tests in DNN fault localization.
Authors: Bruno Farias, Rafael Menezes, Eddie B. de Lima Filho, Youcheng Sun, Lucas C. Cordeiro
This paper introduces a tool for verifying Python programs, which, using type annotation and front-end processing, can harness the capabilities of a bounded model-checking (BMC) pipeline. It transforms an input program into an abstract syntax tree to infer and add type information. Then, it translates Python expressions and statements into an intermediate representation. Finally, it converts this description into formulae evaluated with satisfiability modulo theories (SMT) solvers. The proposed approach was realized with the efficient SMT-based bounded model checker (ESBMC), which resulted in a tool called ESBMC-Python, the first BMC-based Python-code verifier. Experimental results, with a test suite specifically developed for this purpose, showed its effectiveness, where successful and failed tests were correctly evaluated. Moreover, it found a real problem in the Ethereum Consensus Specification.
Authors: Frank Reyes, Benoit Baudry, Martin Monperrus
Dependency updates often cause compilation errors when new dependency versions introduce changes that are incompatible with existing client code. Fixing breaking dependency updates is notoriously hard, as their root cause can be hidden deep in the dependency tree. We present Breaking-Good, a tool that automatically generates explanations for breaking updates. Breaking-Good provides a detailed categorization of compilation errors, identifying several factors related to changes in direct and indirect dependencies, incompatibilities between Java versions, and client-specific configuration. With a blended analysis of log and dependency trees, Breaking-Good generates detailed explanations for each breaking update. These explanations help developers understand the causes of the breaking update, and suggest possible actions to fix the breakage. We evaluate Breaking-Good on 243 real-world breaking dependency updates. Our results indicate that Breaking-Good accurately identifies root causes and generates automatic explanations for 70% of these breaking updates. Our user study demonstrates that the generated explanations help developers. Breaking-Good is the first technique that automatically identifies causes of a breaking dependency update and explains the breakage accordingly.
Authors: Johannes St"umpfle, Nasser Jazdi, Michael Weyrich
Software product lines (SPL) have emerged as a pivotal paradigm in software engineering, enabling the efficient development of variant-rich software systems. Consistently updating these systems, often through over-the-air updates, enables the continuous integration of new features and bug fixes, ensuring the system remains up to date throughout its entire lifecycle. However, evolving such complex systems is an error prone task, leading to a phenomenon known as erosion. This phenomenon significantly impacts the efficiency and longevity of software systems, presenting a formidable challenge for manufacturers of variant-rich software systems, such as in the automotive domain. While existing studies concentrate on the evolutionary planning of variant-rich software systems, there is a noticeable lack of research addressing the problem of erosion. In this paper, we conduct an in-depth exploration of the erosion phenomena within variant-rich software systems. We begin by highlighting the significance of controlling erosion in extensive variant-rich software systems. Subsequently, we address the current challenges regarding tackling erosion, including issues such as the lack of a consensus on understanding and defining erosion, as well as the early detection and elimination. Finally, we outline a first approach aimed at tackling erosion in variant-rich software systems.
Authors: Kamalkumar Rathinasamy, Balaji A J, Ankush Kumar, Gagan Gayari, Harshini K, Rajab Ali Mondal, Sreenivasa Raghavan K S, Swayam Singh
This paper presents NT-Java-1.1B, an open-source specialized code language model built on StarCoderBase-1.1B, designed for coding tasks in Java programming. NT-Java-1.1B achieves state-of-the-art performance, surpassing its base model and majority of other models of similar size on MultiPL-E Java code benchmark. While there have been studies on extending large, generic pre-trained models to improve proficiency in specific programming languages like Python, similar investigations on small code models for other programming languages are lacking. Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. This paper addresses this research gap by focusing on the development of a small Java code model, NT-Java-1.1B, and its quantized versions, which performs comparably to open models around 1.1B on MultiPL-E Java code benchmarks, making them ideal for desktop deployment. This paper establishes the foundation for specialized models across languages and sizes for a family of NT Models.
Authors: Tim Kr"auter, Patrick St"unkel, Adrian Rutle, Harald K"onig, Yngve Lamo
Many business process models have control-flow errors, such as deadlocks, which can hinder proper execution. In this paper, we introduce our new soundness-checking tool that can instantaneously identify errors in BPMN models, make them comprehensible for modelers, and even suggest corrections to resolve them automatically. We demonstrate that our tool's soundness checking is instantaneous, i.e., it takes less than 500ms, by benchmarking our tool against synthetic BPMN models with increasing size and state space complexity, as well as realistic models provided in the literature. Moreover, the tool directly displays possible soundness violations in the model and provides an interactive counterexample visualization of each violation. Additionally, it provides fixes to resolve the violations found, which are not currently available in other tools. The tool is open-source, modular, extensible, and integrated into a popular BPMN modeling tool.
Authors: Wesley K. G. Assun\c{c}~ao, Luciano Marchezan, Alexander Egyed, Rudolf Ramler
Software modernization is an inherent activity of software engineering, as technology advances and systems inevitably become outdated. The term "software modernization" emerged as a research topic in the early 2000s, with a differentiation from traditional software evolution. Studies on this topic became popular due to new programming paradigms, technologies, and architectural styles. Given the pervasive nature of software today, modernizing legacy systems is paramount to provide users with competitive and innovative products and services. Despite the large amount of work available in the literature, there are significant limitations: (i) proposed approaches are strictly specific to one scenario or technology, lacking flexibility; (ii) most of the proposed approaches are not aligned with the current modern software development scenario; and (iii) due to a myriad of proposed modernization approaches, practitioners may be misguided on how to modernize legacies. In this work, our goal is to call attention to the need for advances in research and practices toward a well-defined software modernization domain. The focus is on enabling organizations to preserve the knowledge represented in legacy systems while taking advantages of disruptive and emerging technologies. Based on this goal, we put the different perspectives of software modernization in the context of contemporary software development. We also present a research agenda with 10 challenges to motivate new studies.
Authors: Sean Kauffman, Carlos Moreno, Sebastian Fischmeister
Control flow coverage criteria are an important part of the process of qualifying embedded software for safety-critical systems. Criteria such as modified condition/decision coverage (MC/DC) as defined by DO-178B are used by regulators to judge the adequacy of testing and by QA engineers to design tests when full path coverage is impossible.
Despite their importance, these coverage criteria are often misunderstood. One problem is that their definitions are typically written in natural language specification documents, making them imprecise. Other works have proposed formal definitions using binary predicate logic, but these definitions are difficult to apply to the analysis of real programs. Control-Flow Graphs (CFGs) are the most common model for analyzing program logic in compilers, and seem to be a good fit for defining and analyzing coverage criteria. However, CFGs discard the explicit concept of a decision, making their use for this task seem impossible.
In this paper, we show how to annotate a CFG with decision information inferred from the graph itself. We call this annotated model a Control-Flow Decision Graph (CFDG) and we use it to formally define several common coverage criteria. We have implemented our algorithms in a tool which we show can be applied to automatically annotate CFGs output from popular compilers.
Authors: Asif Kamal Turzo, Sayma Sultana, Amiangshu Bosu
Attracting and retaining a steady stream of new contributors is crucial to ensuring the long-term survival of open-source software (OSS) projects. However, there are two key research gaps regarding recommendations for onboarding new contributors to OSS projects. First, most of the existing recommendations are based on a limited number of projects, which raises concerns about their generalizability. If a recommendation yields conflicting results in a different context, it could hinder a newcomer's onboarding process rather than help them. Second, it's unclear whether these recommendations also apply to experienced contributors. If certain recommendations are specific to newcomers, continuing to follow them after their initial contributions are accepted could hinder their chances of becoming long-term contributors. To address these gaps, we conducted a two-stage mixed-method study. In the first stage, we conducted a Systematic Literature Review (SLR) and identified 15 task-related actionable recommendations that newcomers to OSS projects can follow to improve their odds of successful onboarding. In the second stage, we conduct a large-scale empirical study of five Gerrit-based projects and 1,155 OSS projects from GitHub to assess whether those recommendations assist newcomers' successful onboarding. Our results suggest that four recommendations positively correlate with newcomers' first patch acceptance in most contexts. Four recommendations are context-dependent, and four indicate significant negative associations for most projects. Our results also found three newcomer-specific recommendations, which OSS joiners should abandon at non-newcomer status to increase their odds of becoming long-term contributors.
Smart contracts are autonomous and immutable pieces of code that are deployed on blockchain networks and run by miners. They were first introduced by Ethereum in 2014 and have since been used for various applications such as security tokens, voting, gambling, non-fungible tokens, self-sovereign identities, stock taking, decentralized finances, decentralized exchanges, and atomic swaps. Since smart contracts are immutable, their bugs cannot be fixed, which may lead to significant monetary losses. While many researchers have focused on testing smart contracts, our recent work has highlighted a gap between test adequacy and test data generation, despite numerous efforts in both fields. Our framework, Griffin, tackles this deficiency by employing a targeted symbolic execution technique for generating test data. This tool can be used in diverse applications, such as killing the survived mutants in mutation testing, validating static analysis alarms, creating counter-examples for safety conditions, and reaching manually selected lines of code. This paper discusses how smart contracts differ from legacy software in targeted symbolic execution and how these differences can affect the tool structure, leading us to propose an enhanced version of the control-flow graph for Solidity smart contracts called CFG+. We also discuss how Griffin can utilize custom heuristics to explore the program space and find the test data that reaches a target line while considering a safety condition in a reasonable execution time. We conducted experiments involving an extensive set of smart contracts, target lines, and safety conditions based on real-world faults and test suites from related tools. The results of our evaluation demonstrate that Griffin can effectively identify the required test data within a reasonable timeframe.
Authors: Vittunyuta Maeprasart, Ali Ouni, Raula Gaikovina Kula
Although using third-party libraries has become prevalent in contemporary software development, developers often struggle to update their dependencies. Prior works acknowledge that due to the migration effort, priority and other issues cause lags in the migration process. The common assumption is that developers should drop all other activities and prioritize fixing the vulnerability. Our objective is to understand developer behavior when facing high-risk vulnerabilities in their code. We explore the prolific, and possibly one of the cases of the Log4JShell, a vulnerability that has the highest severity rating ever, which received widespread media attention. Using a mixed-method approach, we analyze 219 GitHub Pull Requests (PR) and 354 issues belonging to 53 Maven projects affected by the Log4JShell vulnerability. Our study confirms that developers show a quick response taking from 5 to 6 days. However, instead of dropping everything, surprisingly developer activities tend to increase for all pending issues and PRs. Developer discussions involved either giving information (29.3%) and seeking information (20.6%), which is missing in existing support tools. Leveraging this possibly-one of a kind event, insights opens up a new line of research, causing us to rethink best practices and what developers need in order to efficiently fix vulnerabilities.
Authors: Robert M"uller, Mathis Wei{\ss}, Malte Lochau
Cardinality-based feature models permit to select multiple copies of the same feature, thus generalizing the notion of product configurations from subsets of Boolean features to multisets of feature instances. This increased expressiveness shapes a-priori infinite and non-convex configuration spaces, which renders established solution-space mappings based on Boolean presence conditions insufficient for cardinality-based feature models. To address this issue, we propose weighted automata over featured multiset semirings as a novel behavioral variability modeling formalism for cardinality-based feature models. The formalism uses multisets over features as a predefined semantic domain for transition weights. It permits to use any algebraic structure forming a proper semiring on multisets to aggregate the weights traversed along paths to map accepted words to multiset configurations. In particular, tropical semirings constitute a promising sub-class with a reasonable trade-off between expressiveness and computational tractability of canonical analysis problems. The formalism is strictly more expressive than featured transition systems, as it enables upper-bound multiplicity constraints depending on the length of words. We provide a tool implementation of the behavioral variability model and present preliminary experimental results showing applicability and computational feasibility of the proposed approach.
Authors: Rodrigo Falc~ao, Andreas Jedlitschka, Frank Elberzhager, Dieter Rombach
The pervasive role played by software in virtually all industries has fostered ever-increasing development of applied research in software engineering. In this chapter, we contribute our experience in using the V-Model as a framework for teaching how to conduct applied research in empirical software engineering. The foundational idea of using the V-Model is presented, and guidance for using it to frame the research is provided. Furthermore, we show how the framework has been instantiated throughout nearly two decades of PhD theses done at the University of Kaiserslautern (RPTU Kaiserslautern) in partnership with Fraunhofer IESE, including the most frequent usage patterns, how the different empirical methods fit into the framework, and the lessons we have learned from this experience.
Authors: Yvonne Dittrich, Helen Sharp, Cleidson de Souza
Ethnography has become one of the established methods for empirical research on software engineering. Although there is a wide variety of introductory books available, there has been no material targeting software engineering students particularly, until now. In this chapter we provide an introduction to teaching and learning ethnography for faculty teaching ethnography to software engineering graduate students and for the students themselves of such courses.
The contents of the chapter focuses on what we think is the core basic knowledge for newbies to ethnography as a research method. We complement the text with proposals for exercises, tips for teaching, and pitfalls that we and our students have experienced.
The chapter is designed to support part of a course on empirical software engineering and provides pointers and literature for further reading.
Authors: Yvonne Dittrich, Johan Bolmsten, Catherine Seidelin
Action research provides the opportunity to explore the usefulness and usability of software engineering methods in industrial settings, and makes it possible to develop methods, tools and techniques with software engineering practitioners. However, as the research moves beyond the observational approach, it requires a different kind of interaction with the software development organisation. This makes action research a challenging endeavour, and it makes it difficult to teach action research through a course that goes beyond explaining the principles.
This chapter is intended to support learning and teaching action research, by providing a rich set of examples, and identifying tools that we found helpful in our action research projects. The core of this chapter focusses on our interaction with the participating developers and domain experts, and the organisational setting.
This chapter is structured around a set of challenges that reoccurred in the action research projects in which the authors participated. Each section is accompanied by a toolkit that presents related techniques and tools. The exercises are designed to explore the topics, and practise using the tools and techniques presented. We hope the material in this chapter encourages researchers who are new to action research to further explore this promising opportunity.
In this chapter, we share an experience report of teaching a master course on empirical research methods at Eindhoven University of Technology in the Netherlands. The course is taught for ten weeks to a mix of students from different study programs and combines both practical assignments with a closed-book exam. We discuss the challenges of teaching a course on research methods and explain how we address these challenges in the course design. Additionally, we share our lessons learned and the do's and don'ts we learned over several iterations of teaching the course.
Authors: Italo Santos, Katia Romero Felizardo, Marco A. Gerosa, Igor Steinmacher
Contributing to OSS projects can help students to enhance their skills and expand their professional networks. However, novice contributors often feel discouraged due to various barriers. Gamification techniques hold the potential to foster engagement and facilitate the learning process. Nevertheless, it is unknown which game elements are effective in this context. This study explores students' perceptions of gamification elements to inform the design of a gamified learning environment. We surveyed 115 students and segmented the analysis from three perspectives: (1) cognitive styles, (2) gender, and (3) ethnicity (Hispanic/LatinX and Non-Hispanic/LatinX). The results showed that Quest, Point, Stats, and Badge are favored elements, while competition and pressure-related are less preferred. Across cognitive styles (persona), gender, and ethnicity, we could not observe any statistical differences, except for Tim's GenderMag persona, which demonstrated a higher preference for storytelling. Conversely, Hispanic/LatinX participants showed a preference for the Choice element. These results can guide tool builders in designing effective gamified learning environments focused on the OSS contributions process.
Authors: Ponkoj Chandra Shill, Md. Azizul Hakim, Muhammad Jahanzeb Khan, Bashira Akter Anima
As robots grow more and more integrated into numerous industries, it is critical to comprehend how humans respond to their failures. This paper systematically studies how trust dynamics and system design are affected by human responses to robot failures. The three-stage survey used in the study provides a thorough understanding of human-robot interactions. While the second stage concentrates on interaction details, such as robot precision and error acknowledgment, the first stage collects demographic data and initial levels of trust. In the last phase, participants' perceptions are examined after the encounter, and trust dynamics, forgiveness, and propensity to suggest robotic technologies are evaluated. Results show that participants' trust in robotic technologies increased significantly when robots acknowledged their errors or limitations to participants and their willingness to suggest robots for activities in the future points to a favorable change in perception, emphasizing the role that direct engagement has in influencing trust dynamics. By providing useful advice for creating more sympathetic, responsive, and reliable robotic systems, the study advances the science of human-robot interaction and promotes a wider adoption of robotic technologies.
Prognosis and Health Management (PHM), critical for ensuring task completion by complex systems and preventing unexpected failures, is widely adopted in aerospace, manufacturing, maritime, rail, energy, etc. However, PHM's development is constrained by bottlenecks like generalization, interpretation and verification abilities. Presently, generative artificial intelligence (AI), represented by Large Model, heralds a technological revolution with the potential to fundamentally reshape traditional technological fields and human production methods. Its capabilities, including strong generalization, reasoning, and generative attributes, present opportunities to address PHM's bottlenecks. To this end, based on a systematic analysis of the current challenges and bottlenecks in PHM, as well as the research status and advantages of Large Model, we propose a novel concept and three progressive paradigms of Prognosis and Health Management Large Model (PHM-LM) through the integration of the Large Model with PHM. Subsequently, we provide feasible technical approaches for PHM-LM to bolster PHM's core capabilities within the framework of the three paradigms. Moreover, to address core issues confronting PHM, we discuss a series of technical challenges of PHM-LM throughout the entire process of construction and application. This comprehensive effort offers a holistic PHM-LM technical framework, and provides avenues for new PHM technologies, methodologies, tools, platforms and applications, which also potentially innovates design, research & development, verification and application mode of PHM. And furthermore, a new generation of PHM with AI will also capably be realized, i.e., from custom to generalized, from discriminative to generative, and from theoretical conditions to practical applications.
Authors: Xiaokun Luan, Xiyue Zhang, Jingyi Wang, Meng Sun
Model reuse techniques can reduce the resource requirements for training high-performance deep neural networks (DNNs) by leveraging existing models. However, unauthorized reuse and replication of DNNs can lead to copyright infringement and economic loss to the model owner. This underscores the need to analyze the reuse relation between DNNs and develop copyright protection techniques to safeguard intellectual property rights. Existing white-box testing-based approaches cannot address the common heterogeneous reuse case where the model architecture is changed, and DNN fingerprinting approaches heavily rely on generating adversarial examples with good transferability, which is known to be challenging in the black-box setting. To bridge the gap, we propose NFARD, a Neuron Functionality Analysis-based Reuse Detector, which only requires normal test samples to detect reuse relations by measuring the models' differences on a newly proposed model characterization, i.e., neuron functionality (NF). A set of NF-based distance metrics is designed to make NFARD applicable to both white-box and black-box settings. Moreover, we devise a linear transformation method to handle heterogeneous reuse cases by constructing the optimal projection matrix for dimension consistency, significantly extending the application scope of NFARD. To the best of our knowledge, this is the first adversarial example-free method that exploits neuron functionality for DNN copyright protection. As a side contribution, we constructed a reuse detection benchmark named Reuse Zoo that covers various practical reuse techniques and popular datasets. Extensive evaluations on this comprehensive benchmark show that NFARD achieves F1 scores of 0.984 and 1.0 for detecting reuse relationships in black-box and white-box settings, respectively, while generating test suites 2 ~ 99 times faster than previous methods.
Authors: Taylor R. Schorlemmer, Ethan H. Burmane, Kelechi G. Kalu, Santiago Torres-Arias, James C. Davis
Software engineers integrate third-party components into their applications. The resulting software supply chain is vulnerable. To reduce the attack surface, we can verify the origin of components (provenance) before adding them. Cryptographic signatures enable this. This article describes traditional signing, its challenges, and changes introduced by next generation signing platforms
Authors: Vishnu Asutosh Dasu, Ashish Kumar, Saeid Tizpaz-Niari, Gang Tan
This paper investigates the neural dropout method as a post-processing bias mitigation for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While neural networks are exceptionally good at finding statistical patterns from data, they are notorious for overfitting to the training datasets that may encode and amplify existing biases from the historical data. Existing bias mitigation algorithms often require either modifying the input dataset or modifying the learning algorithms. We posit that the prevalent dropout methods that prevent over-fitting during training by randomly dropping neurons may be an effective and less intrusive approach to improve fairness of pre-trained DNNs. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs. Our randomized search is guided by an objective to minimize discrimination while maintaining the model utility. We show that our design of randomized algorithms provides statistical guarantees on finding optimal solutions, and we empirically evaluate the efficacy and efficiency of NeuFair in improving fairness, with minimal or no performance degradation. Our results show that NeuFair improves fairness by up to 69% and outperforms state-of-the-art post-processing bias techniques.
Authors: Tong Wang, Taotao Gu, Huan Deng, Hu Li, Xiaohui Kuang, Gang Zhao
As autonomous driving systems (ADS) advance towards higher levels of autonomy, orchestrating their safety verification becomes increasingly intricate. This paper unveils ScenarioFuzz, a pioneering scenario-based fuzz testing methodology. Designed like a choreographer who understands the past performances, it uncovers vulnerabilities in ADS without the crutch of predefined scenarios. Leveraging map road networks, such as OPENDRIVE, we extract essential data to form a foundational scenario seed corpus. This corpus, enriched with pertinent information, provides the necessary boundaries for fuzz testing in the absence of starting scenarios. Our approach integrates specialized mutators and mutation techniques, combined with a graph neural network model, to predict and filter out high-risk scenario seeds, optimizing the fuzzing process using historical test data. Compared to other methods, our approach reduces the time cost by an average of 60.3%, while the number of error scenarios discovered per unit of time increases by 103%. Furthermore, we propose a self-supervised collision trajectory clustering method, which aids in identifying and summarizing 54 high-risk scenario categories prone to inducing ADS faults. Our experiments have successfully uncovered 58 bugs across six tested systems, emphasizing the critical safety concerns of ADS.
This research investigates the potential use of a blockchain-based Public Key Infrastructure (PKI) within an organization and compares it to conventional PKI systems. The goal is to assess the advantages and disadvantages of both approaches in order to determine the feasibility of employing blockchain technology for a decentralized PKI. The study will also evaluate the impact of current legal frameworks, such as the Cyber Resilience Act (CRA) and NIS-2 Directive. The study will examine various implementations of blockchain PKIs based on factors such as security, performance, and platform. The results indicate that blockchain-based PKIs can overcome the limitations of conventional PKIs by decentralizing the trust anchor, providing greater security. Blockchain technology allows for the immutable and transparent management of certificates, making tampering significantly more challenging. Additionally, blockchain-based PKIs offer enhanced mechanisms for identifying and addressing certificate misconduct.
Authors: Francisco Zanartu, Christoph Treude, Markus Wagner
Online social networks have become an integral aspect of our daily lives and play a crucial role in shaping our relationships with others. However, bugs and glitches, even minor ones, can cause anything from frustrating problems to serious data leaks that can have farreaching impacts on millions of users. To mitigate these risks, fuzz testing, a method of testing with randomised inputs, can provide increased confidence in the correct functioning of a social network. However, implementing traditional fuzz testing methods can be prohibitively difficult or impractical for programmers outside of the social network's development team. To tackle this challenge, we present Socialz, a novel approach to social fuzz testing that (1) characterises real users of a social network, (2) diversifies their interaction using evolutionary computation across multiple, non-trivial features, and (3) collects performance data as these interactions are executed. With Socialz, we aim to put social testing tools in everybody's hands, thereby improving the reliability and security of social networks used worldwide. In our study, we came across (1) one known limitation of the current GitLab CE and (2) 6,907 errors, of which 40.16% are beyond our debugging skills.
Since Google introduced Kotlin as an official programming language for developing Android apps in 2017, Kotlin has gained widespread adoption in Android development. However, compared to Java, there is limited support for Kotlin code dependency analysis, which is the foundation to software analysis. To bridge this gap, we develop Depends-Kotlin to extract entities and their dependencies in Kotlin source code. Not only does Depends-Kotlin support extracting entities' dependencies in Kotlin code, but it can also extract dependency relations between Kotlin and Java. The extraction of such cross-language dependencies can help developers understand the migration process from Java to Kotlin. Using three open-source Kotlin-Java mixing projects as our subjects, Depends-Kotlin demonstrates high accuracy and performance in resolving Kotlin-Kotlin and Kotlin-Java dependencies relations. The source code of Depends-Kotlin and the dataset used have been made available at https://github.com/XYZboom/depends-kotlin. We also provide a screencast presenting Depends-Kotlin at https://youtu.be/ZPq8SRhgXzM.
Context: The growing size of graph-based modeling artifacts in model-driven engineering calls for techniques that enable efficient execution of graph queries. Incremental approaches based on the RETE algorithm provide an adequate solution in many scenarios, but are generally designed to search for query results over the entire graph. However, in certain situations, a user may only be interested in query results for a subgraph, for instance when a developer is working on a large model of which only a part is loaded into their workspace. In this case, the global execution semantics can result in significant computational overhead.
Contribution: To mitigate the outlined shortcoming, in this paper we propose an extension of the RETE approach that enables local, yet fully incremental execution of graph queries, while still guaranteeing completeness of results with respect to the relevant subgraph.
Results: We empirically evaluate the presented approach via experiments inspired by a scenario from software development and an independent social network benchmark. The experimental results indicate that the proposed technique can significantly improve performance regarding memory consumption and execution time in favorable cases, but may incur a noticeable linear overhead in unfavorable cases.
The verification of Multi-Agent Systems (MAS) poses a significant challenge. Various approaches and methodologies exist to address this challenge; however, tools that support them are not always readily available. Even when such tools are accessible, they tend to be hard-coded, lacking in compositionality, and challenging to use due to a steep learning curve. In this paper, we introduce a methodology designed for the formal verification of MAS in a modular and versatile manner, along with an initial prototype, that we named VITAMIN. Unlike existing verification methodologies and frameworks for MAS, VITAMIN is constructed for easy extension to accommodate various logics (for specifying the properties to verify) and models (for determining on what to verify such properties).
The text was updated successfully, but these errors were encountered:
cs.SE updates on arXiv.org, Mon, 08 Jul 2024 04:00:09 +0000
ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages [pdf]
Authors: Mehant Kammakomati, Sameer Pimparkhede, Srikanth Tamilselvam, Prince Kumar, Pushpak Bhattacharyya
Tags:
LLM
Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs), yet there has been no work evaluating LLMs with these constraints. We propose two novel tasks to assess the controllability of LLMs using hard and soft constraints represented as code across five representations. Our findings suggest that LLMs struggle to comprehend constraints in all representations irrespective of their portions in the pre-training data. While models are better at comprehending constraints in JSON, YAML, and natural language representations, they struggle with constraints represented in XML and the resource-rich language Python.
Scaling Data-Driven Building Energy Modelling using Large Language Models [pdf]
Authors: Sunil Khadka, Liang Zhang
Tags:
LLM
Building Management System (BMS) through a data-driven method always faces data and model scalability issues. We propose a methodology to tackle the scalability challenges associated with the development of data-driven models for BMS by using Large Language Models (LLMs). LLMs' code generation adaptability can enable broader adoption of BMS by "automating the automation," particularly the data handling and data-driven modeling processes. In this paper, we use LLMs to generate code that processes structured data from BMS and build data-driven models for BMS's specific requirements. This eliminates the need for manual data and model development, reducing the time, effort, and cost associated with this process. Our hypothesis is that LLMs can incorporate domain knowledge about data science and BMS into data processing and modeling, ensuring that the data-driven modeling is automated for specific requirements of different building types and control objectives, which also improves accuracy and scalability. We generate a prompt template following the framework of Machine Learning Operations so that the prompts are designed to systematically generate Python code for data-driven modeling. Our case study indicates that bi-sequential prompting under the prompt template can achieve a high success rate of code generation and code accuracy, and significantly reduce human labor costs.
An Empirical Study on Capability of Large Language Models in Understanding Code Semantics [pdf]
Authors: Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, Son Nguyen
Tags:
LLM
Large Language Models for Code (code LLMs) have demonstrated remarkable performance across various software engineering (SE) tasks, increasing the application of code LLMs in software development. Despite the success of code LLMs, there remain significant concerns about the actual capabilities and reliability of these models, "whether these models really learn the semantics of code from the training data and leverage the learned knowledge to perform the SE tasks". In this paper, we introduce EMPICA, a comprehensive framework designed to systematically and empirically evaluate the capabilities of code LLMs in understanding code semantics. Specifically, EMPICA systematically introduces controlled modifications/transformations into the input code and examines the models' responses. Generally, code LLMs must be robust to semantically equivalent code inputs and be sensitive to non-equivalent ones for all SE tasks. Specifically, for every SE task, given an input code snippet c and its semantic equivalent variants, code LLMs must robustly produce consistent/equivalent outputs while they are expected to generate different outputs for c and its semantic non-equivalent variants. Our experimental results on three representative code understanding tasks, including code summarization, method name prediction, and output prediction, reveal that the robustness and sensitivity of the state-of-the-art code LLMs to code transformations vary significantly across tasks and transformation operators. In addition, the code LLMs exhibit better robustness to the semantic preserving transformations than their sensitivity to the semantic non-preserving transformations. These results highlight a need to enhance the model's capabilities of understanding code semantics, especially the sensitivity property.
Augmenting LLMs to Repair Obsolete Test Cases with Static Collector and Neural Reranker [pdf]
Authors: Jun Liu, Jiwei Yan, Yuanyuan Xie, Jun Yan, Jian Zhang
Tags:
LLM
During software evolution, it is advocated that test code should co-evolve with production code. In real development scenarios, test updating may lag behind production code changing, which may cause the project to fail to compile or bring other troubles. Existing techniques based on pre-trained language models can be adopted to repair obsolete tests caused by such unsynchronized code changes, especially syntactic-related ones. However, the lack of target-oriented contextual information affects repair accuracy on large-scale projects. Starting from an obsoleted test, the key challenging task is precisely identifying and constructing Test-Repair-Oriented Contexts (TROCtx) from the whole repository within a limited token size.
In this paper, we propose SynBCIATR (Syntactic-Breaking-Change-Induced Automated Test Repair), a novel approach to automatically repair obsolete test cases via precise and concise TROCtx construction. Inspired by developers' programming practices of the task, we design three types of TROCtx: class contexts, usage contexts, and environment contexts. For every type of TROCtx, SynBCIATR automatically collects the changed-token-related code information through static analysis techniques. Then it generates reranking queries to identify the most relevant TROCtxs, which will be taken as the repair-required key context and be input to the Large Language Model for the final test repair.
To evaluate the effectiveness of SynBCIATR, we construct a benchmark dataset that contains diverse syntactic breaking changes. The experimental results show that SynBCIATR outperforms baseline approaches both on textual- and intent-matching metrics. With the augmentation of TROCtx constructed by SynBCIATR, hallucinations are reduced by 57.1%.
Assessing Consensus of Developers' Views on Code Readability [pdf]
Authors: Agnia Sergeyuk, Olga Lvova, Sergey Titov, Anastasiia Serova, Farid Bagirov, Timofey Bryksin
Tags:
LLM
The rapid rise of Large Language Models (LLMs) has changed software development, with tools like Copilot, JetBrains AI Assistant, and others boosting developers' productivity. However, developers now spend more time reviewing code than writing it, highlighting the importance of Code Readability for code comprehension. Our previous research found that existing Code Readability models were inaccurate in representing developers' notions and revealed a low consensus among developers, highlighting a need for further investigations in this field.
Building on this, we surveyed 10 Java developers with similar coding experience to evaluate their consensus on Code Readability assessments and related aspects. We found significant agreement among developers on Code Readability evaluations and identified specific code aspects strongly correlated with Code Readability. Overall, our study sheds light on Code Readability within LLM contexts, offering insights into how these models can align with developers' perceptions of Code Readability, enhancing software development in the AI era.
AutoBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design [pdf]
Authors: Ruidi Qiu, Grace Li Zhang, Rolf Drechsler, Ulf Schlichtmann, Bing Li
Tags:
LLM
In digital circuit design, testbenches constitute the cornerstone of simulation-based hardware verification. Traditional methodologies for testbench generation during simulation-based hardware verification still remain partially manual, resulting in inefficiencies in testing various scenarios and requiring expensive time from designers. Large Language Models (LLMs) have demonstrated their potential in automating the circuit design flow. However, directly applying LLMs to generate testbenches suffers from a low pass rate. To address this challenge, we introduce AutoBench, the first LLM-based testbench generator for digital circuit design, which requires only the description of the design under test (DUT) to automatically generate comprehensive testbenches. In AutoBench, a hybrid testbench structure and a self-checking system are realized using LLMs. To validate the generated testbenches, we also introduce an automated testbench evaluation framework to evaluate the quality of generated testbenches from multiple perspectives. Experimental results demonstrate that AutoBench achieves a 57% improvement in the testbench pass@1 ratio compared with the baseline that directly generates testbenches using LLMs. For 75 sequential circuits, AutoBench successfully has a 3.36 times testbench pass@1 ratio compared with the baseline. The source codes and experimental results are open-sourced at this link: https://github.com/AutoBench/AutoBench
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards [pdf]
Authors: Zhimin Zhao, Abdul Ali Bangash, Filipe Roseiro C^ogo, Bram Adams, Ahmed E. Hassan
Tags:
LLM
Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards, especially those hosted on cloud platforms, have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential leaderboard pitfalls and areas for improvement ("leaderboard smells"). In this regard, we perform a multivocal literature review to collect up to 721 FM leaderboards, after which we examine their documentation and engage in direct communication with leaderboard operators to understand their workflow patterns. Using card sorting and negotiated agreement, we identify 5 unique workflow patterns and develop a domain model that outlines the essential components and their interaction within FM leaderboards. We then identify 8 unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.
ALPINE: An adaptive language-agnostic pruning method for language models for code [pdf]
Authors: Mootez Saad, Jos'e Antonio Hern'andez L'opez, Boqi Chen, D'aniel Varr'o, Tushar Sharma
Tags:
FL
Language models of code have demonstrated state-of-the-art performance across various software engineering and source code analysis tasks. However, their demanding computational resource requirements and consequential environmental footprint remain as significant challenges. This work introduces ALPINE, an adaptive programming language-agnostic pruning technique designed to substantially reduce these models' computational overhead. The proposed method offers a pluggable layer that can be integrated with all Transformer-based models. With ALPINE, input sequences undergo adaptive compression throughout the pipeline, reaching a size up to$\times 3$ less their initial size, resulting in significantly reduced computational load. Our experiments on two software engineering tasks, defect prediction and code clone detection across three language models CodeBERT, GraphCodeBERT and UniXCoder show that ALPINE achieves up to a 50% reduction in FLOPs, a 58.1% decrease in memory footprint, and a 28.1% improvement in throughput on average. This led to a reduction in CO2 by up to $44.85$ %. Importantly, it achieves the reduction in computation resources while maintaining up to 98.1% of the original predictive performance. These findings highlight the potential of ALPINE in making language models of code more resource-efficient and accessible while preserving their performance, contributing to the overall sustainability of adopting language models in software development. Also, it sheds light on redundant and noisy information in source code analysis corpora, as shown by the substantial sequence compression achieved by ALPINE.
EffiBench: Benchmarking the Efficiency of Automatically Generated Code [pdf]
Authors: Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M. Zhang
Tags:
LLM
Code generation models have increasingly become integral to aiding software development. Although current research has thoroughly examined the correctness of the code produced by code generation models, a vital aspect that plays a pivotal role in green computing and sustainability efforts has often been neglected. This paper presents EffiBench, a benchmark with 1,000 efficiency-critical coding problems to assess the efficiency of code generated by code generation models. EffiBench contains a diverse set of LeetCode coding problems. Each problem is paired with an executable human-written canonical solution, which obtains the SOTA efficiency on the LeetCode solution leaderboard. With EffiBench, we empirically examine the ability of 42 large language models (35 open-source and 7 closed-source) to generate efficient code. Our evaluation results demonstrate that the efficiency of the code generated by LLMs is generally worse than the efficiency of human-written canonical solutions. For example, GPT-4 generated code has an average \textbf{3.12} times execution time that of the human-written canonical solutions. In the most extreme cases, the execution time and total memory usage of GPT-4 generated code are \textbf{13.89} and \textbf{43.92} times that of the canonical solutions. The source code of EffiBench is released on https://github.com/huangd1999/EffiBench. We also provide the LeaderBoard at https://huggingface.co/spaces/EffiBench/effibench-leaderboard.
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval [pdf]
Authors: Jiexin Wang, Xitong Luo, Liuwen Cao, Hongkui He, Hailin Huang, Jiayuan Xie, Adam Jatowt, Yi Cai
Tags:
LLM
Large language models (LLMs) have brought significant advancements to code generation and code repair, benefiting both novice and experienced developers. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. Despite numerous studies investigating the safety of code LLMs, there remains a gap in comprehensively addressing their security features. In this work, we aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs. To support our research, we introduce CodeSecEval, a meticulously curated dataset designed to address 44 critical vulnerability types with 180 distinct samples. CodeSecEval serves as the foundation for the automatic evaluation of code models in two crucial tasks: code generation and code repair, with a strong emphasis on security. Our experimental results reveal that current models frequently overlook security issues during both code generation and repair processes, resulting in the creation of vulnerable code. In response, we propose different strategies that leverage vulnerability-aware information and insecure code explanations to mitigate these security vulnerabilities. Furthermore, our findings highlight that certain vulnerability types particularly challenge model performance, influencing their effectiveness in real-world applications. Based on these findings, we believe our study will have a positive impact on the software engineering community, inspiring the development of improved methods for training and utilizing LLMs, thereby leading to safer and more trustworthy model deployment.
KGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution [pdf]
Authors: Alex Mathai, Chenxi Huang, Petros Maniatis, Aleksandr Nogikh, Franjo Ivancic, Junfeng Yang, Baishakhi Ray
Tags:
LLM
Large Language Models (LLMs) are consistently improving at increasingly realistic software engineering (SE) tasks. In real-world software stacks, significant SE effort is spent developing foundational system software like the Linux kernel. Unlike application-level software, a systems codebase like Linux is multilingual (low-level C/Assembly/Bash/Rust); gigantic (>20 million lines); critical (impacting billions of devices worldwide), and highly concurrent (involving complex multi-threading). To evaluate if ML models are useful while developing such large-scale systems-level software, we introduce kGym (a platform) and kBench (a dataset). The kGym platform provides a SE environment for large-scale experiments on the Linux kernel, including compiling and running kernels in parallel across several virtual machines, detecting operations and crashes, inspecting logs, and querying and patching the code base. We use kGym to facilitate evaluation on kBench, a crash resolution benchmark drawn from real-world Linux kernel bugs. An example bug in kBench contains crashing stack traces, a bug-reproducer file, a developer-written fix, and other associated data. To understand current performance, we conduct baseline experiments by prompting LLMs to resolve Linux kernel crashes. Our initial evaluations reveal that the best performing LLM achieves 0.72% and 5.38% in the unassisted and assisted (i.e., buggy files disclosed to the model) settings, respectively. These results highlight the need for further research to enhance model performance in SE tasks. Improving performance on kBench requires models to master new learning skills, including understanding the cause of crashes and repairing faults, writing memory-safe and hardware-aware code, and understanding concurrency. As a result, this work opens up multiple avenues of research at the intersection of machine learning and systems software.
Path Analysis for Effective Fault Localization in Deep Neural Networks [pdf]
Authors: Soroush Hashemifar, Saeed Parsa, Akram Kalaee
Tags:
FL
Despite deep learning's transformative impact on various domains, the reliability of Deep Neural Networks (DNNs) is still a pressing concern due to their complexity and data dependency. Traditional software fault localization techniques, such as Spectrum-based Fault Localization (SBFL), have been adapted to DNNs with limited success. Existing methods like DeepFault utilize SBFL measures but fail to account for fault propagation across neural pathways, leading to suboptimal fault detection. Addressing this gap, we propose the NP-SBFL method, leveraging Layer-wise Relevance Propagation (LRP) to identify and verify critical neural pathways. Our innovative multi-stage gradient ascent (MGA) technique, an extension of gradient ascent (GA), activates neurons sequentially, enhancing fault detection efficacy. We evaluated the effectiveness of our method, i.e. NP-SBFL-MGA, on two commonly used datasets, MNIST and CIFAR-10, two baselines DeepFault and NP- SBFL-GA, and three suspicious neuron measures, Tarantula, Ochiai, and Barinel. The empirical results showed that NP-SBFL-MGA is statistically more effective than the baselines at identifying suspicious paths and synthesizing adversarial inputs. Particularly, Tarantula on NP-SBFL-MGA had the highest fault detection rate at 96.75%, surpassing DeepFault on Ochiai (89.90%) and NP-SBFL-GA on Ochiai (60.61%). Our approach also yielded results comparable to those of the baselines in synthesizing naturalness inputs, and we found a positive correlation between the coverage of critical paths and the number of failed tests in DNN fault localization.
ESBMC-Python: A Bounded Model Checker for Python Programs [pdf]
Authors: Bruno Farias, Rafael Menezes, Eddie B. de Lima Filho, Youcheng Sun, Lucas C. Cordeiro
This paper introduces a tool for verifying Python programs, which, using type annotation and front-end processing, can harness the capabilities of a bounded model-checking (BMC) pipeline. It transforms an input program into an abstract syntax tree to infer and add type information. Then, it translates Python expressions and statements into an intermediate representation. Finally, it converts this description into formulae evaluated with satisfiability modulo theories (SMT) solvers. The proposed approach was realized with the efficient SMT-based bounded model checker (ESBMC), which resulted in a tool called ESBMC-Python, the first BMC-based Python-code verifier. Experimental results, with a test suite specifically developed for this purpose, showed its effectiveness, where successful and failed tests were correctly evaluated. Moreover, it found a real problem in the Ethereum Consensus Specification.
Breaking-Good: Explaining Breaking Dependency Updates with Build Analysis [pdf]
Authors: Frank Reyes, Benoit Baudry, Martin Monperrus
Dependency updates often cause compilation errors when new dependency versions introduce changes that are incompatible with existing client code. Fixing breaking dependency updates is notoriously hard, as their root cause can be hidden deep in the dependency tree. We present Breaking-Good, a tool that automatically generates explanations for breaking updates. Breaking-Good provides a detailed categorization of compilation errors, identifying several factors related to changes in direct and indirect dependencies, incompatibilities between Java versions, and client-specific configuration. With a blended analysis of log and dependency trees, Breaking-Good generates detailed explanations for each breaking update. These explanations help developers understand the causes of the breaking update, and suggest possible actions to fix the breakage. We evaluate Breaking-Good on 243 real-world breaking dependency updates. Our results indicate that Breaking-Good accurately identifies root causes and generates automatic explanations for 70% of these breaking updates. Our user study demonstrates that the generated explanations help developers. Breaking-Good is the first technique that automatically identifies causes of a breaking dependency update and explains the breakage accordingly.
Tackling Erosion in Variant-Rich Software Systems: Challenges and Approaches [pdf]
Authors: Johannes St"umpfle, Nasser Jazdi, Michael Weyrich
Software product lines (SPL) have emerged as a pivotal paradigm in software engineering, enabling the efficient development of variant-rich software systems. Consistently updating these systems, often through over-the-air updates, enables the continuous integration of new features and bug fixes, ensuring the system remains up to date throughout its entire lifecycle. However, evolving such complex systems is an error prone task, leading to a phenomenon known as erosion. This phenomenon significantly impacts the efficiency and longevity of software systems, presenting a formidable challenge for manufacturers of variant-rich software systems, such as in the automotive domain. While existing studies concentrate on the evolutionary planning of variant-rich software systems, there is a noticeable lack of research addressing the problem of erosion. In this paper, we conduct an in-depth exploration of the erosion phenomena within variant-rich software systems. We begin by highlighting the significance of controlling erosion in extensive variant-rich software systems. Subsequently, we address the current challenges regarding tackling erosion, including issues such as the lack of a consensus on understanding and defining erosion, as well as the early detection and elimination. Finally, we outline a first approach aimed at tackling erosion in variant-rich software systems.
Narrow Transformer: Starcoder-Based Java-LM For Desktop [pdf]
Authors: Kamalkumar Rathinasamy, Balaji A J, Ankush Kumar, Gagan Gayari, Harshini K, Rajab Ali Mondal, Sreenivasa Raghavan K S, Swayam Singh
This paper presents NT-Java-1.1B, an open-source specialized code language model built on StarCoderBase-1.1B, designed for coding tasks in Java programming. NT-Java-1.1B achieves state-of-the-art performance, surpassing its base model and majority of other models of similar size on MultiPL-E Java code benchmark. While there have been studies on extending large, generic pre-trained models to improve proficiency in specific programming languages like Python, similar investigations on small code models for other programming languages are lacking. Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. This paper addresses this research gap by focusing on the development of a small Java code model, NT-Java-1.1B, and its quantized versions, which performs comparably to open models around 1.1B on MultiPL-E Java code benchmarks, making them ideal for desktop deployment. This paper establishes the foundation for specialized models across languages and sizes for a family of NT Models.
Instantaneous, Comprehensible, and Fixable Soundness Checking of Realistic BPMN Models [pdf]
Authors: Tim Kr"auter, Patrick St"unkel, Adrian Rutle, Harald K"onig, Yngve Lamo
Many business process models have control-flow errors, such as deadlocks, which can hinder proper execution. In this paper, we introduce our new soundness-checking tool that can instantaneously identify errors in BPMN models, make them comprehensible for modelers, and even suggest corrections to resolve them automatically. We demonstrate that our tool's soundness checking is instantaneous, i.e., it takes less than 500ms, by benchmarking our tool against synthetic BPMN models with increasing size and state space complexity, as well as realistic models provided in the literature. Moreover, the tool directly displays possible soundness violations in the model and provides an interactive counterexample visualization of each violation. Additionally, it provides fixes to resolve the violations found, which are not currently available in other tools. The tool is open-source, modular, extensible, and integrated into a popular BPMN modeling tool.
Contemporary Software Modernization: Perspectives and Challenges to Deal with Legacy Systems [pdf]
Authors: Wesley K. G. Assun\c{c}~ao, Luciano Marchezan, Alexander Egyed, Rudolf Ramler
Software modernization is an inherent activity of software engineering, as technology advances and systems inevitably become outdated. The term "software modernization" emerged as a research topic in the early 2000s, with a differentiation from traditional software evolution. Studies on this topic became popular due to new programming paradigms, technologies, and architectural styles. Given the pervasive nature of software today, modernizing legacy systems is paramount to provide users with competitive and innovative products and services. Despite the large amount of work available in the literature, there are significant limitations: (i) proposed approaches are strictly specific to one scenario or technology, lacking flexibility; (ii) most of the proposed approaches are not aligned with the current modern software development scenario; and (iii) due to a myriad of proposed modernization approaches, practitioners may be misguided on how to modernize legacies. In this work, our goal is to call attention to the need for advances in research and practices toward a well-defined software modernization domain. The focus is on enabling organizations to preserve the knowledge represented in legacy systems while taking advantages of disruptive and emerging technologies. Based on this goal, we put the different perspectives of software modernization in the context of contemporary software development. We also present a research agenda with 10 challenges to motivate new studies.
Annotating Control-Flow Graphs for Formalized Test Coverage Criteria [pdf]
Authors: Sean Kauffman, Carlos Moreno, Sebastian Fischmeister
Control flow coverage criteria are an important part of the process of qualifying embedded software for safety-critical systems. Criteria such as modified condition/decision coverage (MC/DC) as defined by DO-178B are used by regulators to judge the adequacy of testing and by QA engineers to design tests when full path coverage is impossible.
Despite their importance, these coverage criteria are often misunderstood. One problem is that their definitions are typically written in natural language specification documents, making them imprecise. Other works have proposed formal definitions using binary predicate logic, but these definitions are difficult to apply to the analysis of real programs. Control-Flow Graphs (CFGs) are the most common model for analyzing program logic in compilers, and seem to be a good fit for defining and analyzing coverage criteria. However, CFGs discard the explicit concept of a decision, making their use for this task seem impossible.
In this paper, we show how to annotate a CFG with decision information inferred from the graph itself. We call this annotated model a Control-Flow Decision Graph (CFDG) and we use it to formally define several common coverage criteria. We have implemented our algorithms in a tool which we show can be applied to automatically annotate CFGs output from popular compilers.
From First Patch to Long-Term Contributor: Evaluating Onboarding Recommendations for OSS Newcomers [pdf]
Authors: Asif Kamal Turzo, Sayma Sultana, Amiangshu Bosu
Attracting and retaining a steady stream of new contributors is crucial to ensuring the long-term survival of open-source software (OSS) projects. However, there are two key research gaps regarding recommendations for onboarding new contributors to OSS projects. First, most of the existing recommendations are based on a limited number of projects, which raises concerns about their generalizability. If a recommendation yields conflicting results in a different context, it could hinder a newcomer's onboarding process rather than help them. Second, it's unclear whether these recommendations also apply to experienced contributors. If certain recommendations are specific to newcomers, continuing to follow them after their initial contributions are accepted could hinder their chances of becoming long-term contributors. To address these gaps, we conducted a two-stage mixed-method study. In the first stage, we conducted a Systematic Literature Review (SLR) and identified 15 task-related actionable recommendations that newcomers to OSS projects can follow to improve their odds of successful onboarding. In the second stage, we conduct a large-scale empirical study of five Gerrit-based projects and 1,155 OSS projects from GitHub to assess whether those recommendations assist newcomers' successful onboarding. Our results suggest that four recommendations positively correlate with newcomers' first patch acceptance in most contexts. Four recommendations are context-dependent, and four indicate significant negative associations for most projects. Our results also found three newcomer-specific recommendations, which OSS joiners should abandon at non-newcomer status to increase their odds of becoming long-term contributors.
Effective Targeted Testing of Smart Contracts [pdf]
Authors: Mahdi Fooladgar, Fathiyeh Faghih
Smart contracts are autonomous and immutable pieces of code that are deployed on blockchain networks and run by miners. They were first introduced by Ethereum in 2014 and have since been used for various applications such as security tokens, voting, gambling, non-fungible tokens, self-sovereign identities, stock taking, decentralized finances, decentralized exchanges, and atomic swaps. Since smart contracts are immutable, their bugs cannot be fixed, which may lead to significant monetary losses. While many researchers have focused on testing smart contracts, our recent work has highlighted a gap between test adequacy and test data generation, despite numerous efforts in both fields. Our framework, Griffin, tackles this deficiency by employing a targeted symbolic execution technique for generating test data. This tool can be used in diverse applications, such as killing the survived mutants in mutation testing, validating static analysis alarms, creating counter-examples for safety conditions, and reaching manually selected lines of code. This paper discusses how smart contracts differ from legacy software in targeted symbolic execution and how these differences can affect the tool structure, leading us to propose an enhanced version of the control-flow graph for Solidity smart contracts called CFG+. We also discuss how Griffin can utilize custom heuristics to explore the program space and find the test data that reaches a target line while considering a safety condition in a reasonable execution time. We conducted experiments involving an extensive set of smart contracts, target lines, and safety conditions based on real-world faults and test suites from related tools. The results of our evaluation demonstrate that Griffin can effectively identify the required test data within a reasonable timeframe.
Drop it All or Pick it Up? How Developers Responded to the Log4JShell Vulnerability [pdf]
Authors: Vittunyuta Maeprasart, Ali Ouni, Raula Gaikovina Kula
Although using third-party libraries has become prevalent in contemporary software development, developers often struggle to update their dependencies. Prior works acknowledge that due to the migration effort, priority and other issues cause lags in the migration process. The common assumption is that developers should drop all other activities and prioritize fixing the vulnerability. Our objective is to understand developer behavior when facing high-risk vulnerabilities in their code. We explore the prolific, and possibly one of the cases of the Log4JShell, a vulnerability that has the highest severity rating ever, which received widespread media attention. Using a mixed-method approach, we analyze 219 GitHub Pull Requests (PR) and 354 issues belonging to 53 Maven projects affected by the Log4JShell vulnerability. Our study confirms that developers show a quick response taking from 5 to 6 days. However, instead of dropping everything, surprisingly developer activities tend to increase for all pending issues and PRs. Developer discussions involved either giving information (29.3%) and seeking information (20.6%), which is missing in existing support tools. Leveraging this possibly-one of a kind event, insights opens up a new line of research, causing us to rethink best practices and what developers need in order to efficiently fix vulnerabilities.
Mapping Cardinality-based Feature Models to Weighted Automata over Featured Multiset Semirings (Extended Version) [pdf]
Authors: Robert M"uller, Mathis Wei{\ss}, Malte Lochau
Cardinality-based feature models permit to select multiple copies of the same feature, thus generalizing the notion of product configurations from subsets of Boolean features to multisets of feature instances. This increased expressiveness shapes a-priori infinite and non-convex configuration spaces, which renders established solution-space mappings based on Boolean presence conditions insufficient for cardinality-based feature models. To address this issue, we propose weighted automata over featured multiset semirings as a novel behavioral variability modeling formalism for cardinality-based feature models. The formalism uses multisets over features as a predefined semantic domain for transition weights. It permits to use any algebraic structure forming a proper semiring on multisets to aggregate the weights traversed along paths to map accepted words to multiset configurations. In particular, tropical semirings constitute a promising sub-class with a reasonable trade-off between expressiveness and computational tractability of canonical analysis problems. The formalism is strictly more expressive than featured transition systems, as it enables upper-bound multiplicity constraints depending on the length of words. We provide a tool implementation of the behavioral variability model and present preliminary experimental results showing applicability and computational feasibility of the proposed approach.
Experiences in Using the V-Model as a Framework for Applied Doctoral Research [pdf]
Authors: Rodrigo Falc~ao, Andreas Jedlitschka, Frank Elberzhager, Dieter Rombach
The pervasive role played by software in virtually all industries has fostered ever-increasing development of applied research in software engineering. In this chapter, we contribute our experience in using the V-Model as a framework for teaching how to conduct applied research in empirical software engineering. The foundational idea of using the V-Model is presented, and guidance for using it to frame the research is provided. Furthermore, we show how the framework has been instantiated throughout nearly two decades of PhD theses done at the University of Kaiserslautern (RPTU Kaiserslautern) in partnership with Fraunhofer IESE, including the most frequent usage patterns, how the different empirical methods fit into the framework, and the lessons we have learned from this experience.
Teaching and Learning Ethnography for Software Engineering Contexts [pdf]
Authors: Yvonne Dittrich, Helen Sharp, Cleidson de Souza
Ethnography has become one of the established methods for empirical research on software engineering. Although there is a wide variety of introductory books available, there has been no material targeting software engineering students particularly, until now. In this chapter we provide an introduction to teaching and learning ethnography for faculty teaching ethnography to software engineering graduate students and for the students themselves of such courses.
The contents of the chapter focuses on what we think is the core basic knowledge for newbies to ethnography as a research method. We complement the text with proposals for exercises, tips for teaching, and pitfalls that we and our students have experienced.
The chapter is designed to support part of a course on empirical software engineering and provides pointers and literature for further reading.
Action Research with Industrial Software Engineering -- An Educational Perspective [pdf]
Authors: Yvonne Dittrich, Johan Bolmsten, Catherine Seidelin
Action research provides the opportunity to explore the usefulness and usability of software engineering methods in industrial settings, and makes it possible to develop methods, tools and techniques with software engineering practitioners. However, as the research moves beyond the observational approach, it requires a different kind of interaction with the software development organisation. This makes action research a challenging endeavour, and it makes it difficult to teach action research through a course that goes beyond explaining the principles.
This chapter is intended to support learning and teaching action research, by providing a rich set of examples, and identifying tools that we found helpful in our action research projects. The core of this chapter focusses on our interaction with the participating developers and domain experts, and the organisational setting.
This chapter is structured around a set of challenges that reoccurred in the action research projects in which the authors participated. Each section is accompanied by a toolkit that presents related techniques and tools. The exercises are designed to explore the topics, and practise using the tools and techniques presented. We hope the material in this chapter encourages researchers who are new to action research to further explore this promising opportunity.
Teaching Empirical Methods at Eindhoven University of Technology [pdf]
Authors: Alexander Serebrenik, Nathan Cassee
In this chapter, we share an experience report of teaching a master course on empirical research methods at Eindhoven University of Technology in the Netherlands. The course is taught for ten weeks to a mix of students from different study programs and combines both practical assignments with a closed-book exam. We discuss the challenges of teaching a course on research methods and explain how we address these challenges in the course design. Additionally, we share our lessons learned and the do's and don'ts we learned over several iterations of teaching the course.
Game Elements to Engage Students Learning the Open Source Software Contribution Process [pdf]
Authors: Italo Santos, Katia Romero Felizardo, Marco A. Gerosa, Igor Steinmacher
Contributing to OSS projects can help students to enhance their skills and expand their professional networks. However, novice contributors often feel discouraged due to various barriers. Gamification techniques hold the potential to foster engagement and facilitate the learning process. Nevertheless, it is unknown which game elements are effective in this context. This study explores students' perceptions of gamification elements to inform the design of a gamified learning environment. We surveyed 115 students and segmented the analysis from three perspectives: (1) cognitive styles, (2) gender, and (3) ethnicity (Hispanic/LatinX and Non-Hispanic/LatinX). The results showed that Quest, Point, Stats, and Badge are favored elements, while competition and pressure-related are less preferred. Across cognitive styles (persona), gender, and ethnicity, we could not observe any statistical differences, except for Tim's GenderMag persona, which demonstrated a higher preference for storytelling. Conversely, Hispanic/LatinX participants showed a preference for the Choice element. These results can guide tool builders in designing effective gamified learning environments focused on the OSS contributions process.
Human Reactions to Incorrect Answers from Robots [pdf]
Authors: Ponkoj Chandra Shill, Md. Azizul Hakim, Muhammad Jahanzeb Khan, Bashira Akter Anima
As robots grow more and more integrated into numerous industries, it is critical to comprehend how humans respond to their failures. This paper systematically studies how trust dynamics and system design are affected by human responses to robot failures. The three-stage survey used in the study provides a thorough understanding of human-robot interactions. While the second stage concentrates on interaction details, such as robot precision and error acknowledgment, the first stage collects demographic data and initial levels of trust. In the last phase, participants' perceptions are examined after the encounter, and trust dynamics, forgiveness, and propensity to suggest robotic technologies are evaluated. Results show that participants' trust in robotic technologies increased significantly when robots acknowledged their errors or limitations to participants and their willingness to suggest robots for activities in the future points to a favorable change in perception, emphasizing the role that direct engagement has in influencing trust dynamics. By providing useful advice for creating more sympathetic, responsive, and reliable robotic systems, the study advances the science of human-robot interaction and promotes a wider adoption of robotic technologies.
An Outline of Prognostics and Health Management Large Model: Concepts, Paradigms, and Challenges [pdf]
Authors: Laifa Tao, Shangyu Li, Haifei Liu, Qixuan Huang, Liang Ma, Guoao Ning, Yiling Chen, Yunlong Wu, Bin Li, Weiwei Zhang, Zhengduo Zhao, Wenchao Zhan, Wenyan Cao, Chao Wang, Hongmei Liu, Jian Ma, Mingliang Suo, Yujie Cheng, Yu Ding, Dengwei Song, Chen Lu
Prognosis and Health Management (PHM), critical for ensuring task completion by complex systems and preventing unexpected failures, is widely adopted in aerospace, manufacturing, maritime, rail, energy, etc. However, PHM's development is constrained by bottlenecks like generalization, interpretation and verification abilities. Presently, generative artificial intelligence (AI), represented by Large Model, heralds a technological revolution with the potential to fundamentally reshape traditional technological fields and human production methods. Its capabilities, including strong generalization, reasoning, and generative attributes, present opportunities to address PHM's bottlenecks. To this end, based on a systematic analysis of the current challenges and bottlenecks in PHM, as well as the research status and advantages of Large Model, we propose a novel concept and three progressive paradigms of Prognosis and Health Management Large Model (PHM-LM) through the integration of the Large Model with PHM. Subsequently, we provide feasible technical approaches for PHM-LM to bolster PHM's core capabilities within the framework of the three paradigms. Moreover, to address core issues confronting PHM, we discuss a series of technical challenges of PHM-LM throughout the entire process of construction and application. This comprehensive effort offers a holistic PHM-LM technical framework, and provides avenues for new PHM technologies, methodologies, tools, platforms and applications, which also potentially innovates design, research & development, verification and application mode of PHM. And furthermore, a new generation of PHM with AI will also capably be realized, i.e., from custom to generalized, from discriminative to generative, and from theoretical conditions to practical applications.
Protecting Deep Learning Model Copyrights with Adversarial Example-Free Reuse Detection [pdf]
Authors: Xiaokun Luan, Xiyue Zhang, Jingyi Wang, Meng Sun
Model reuse techniques can reduce the resource requirements for training high-performance deep neural networks (DNNs) by leveraging existing models. However, unauthorized reuse and replication of DNNs can lead to copyright infringement and economic loss to the model owner. This underscores the need to analyze the reuse relation between DNNs and develop copyright protection techniques to safeguard intellectual property rights. Existing white-box testing-based approaches cannot address the common heterogeneous reuse case where the model architecture is changed, and DNN fingerprinting approaches heavily rely on generating adversarial examples with good transferability, which is known to be challenging in the black-box setting. To bridge the gap, we propose NFARD, a Neuron Functionality Analysis-based Reuse Detector, which only requires normal test samples to detect reuse relations by measuring the models' differences on a newly proposed model characterization, i.e., neuron functionality (NF). A set of NF-based distance metrics is designed to make NFARD applicable to both white-box and black-box settings. Moreover, we devise a linear transformation method to handle heterogeneous reuse cases by constructing the optimal projection matrix for dimension consistency, significantly extending the application scope of NFARD. To the best of our knowledge, this is the first adversarial example-free method that exploits neuron functionality for DNN copyright protection. As a side contribution, we constructed a reuse detection benchmark named Reuse Zoo that covers various practical reuse techniques and popular datasets. Extensive evaluations on this comprehensive benchmark show that NFARD achieves F1 scores of 0.984 and 1.0 for detecting reuse relationships in black-box and white-box settings, respectively, while generating test suites 2 ~ 99 times faster than previous methods.
Establishing Provenance Before Coding: Traditional and Next-Gen Signing [pdf]
Authors: Taylor R. Schorlemmer, Ethan H. Burmane, Kelechi G. Kalu, Santiago Torres-Arias, James C. Davis
Software engineers integrate third-party components into their applications. The resulting software supply chain is vulnerable. To reduce the attack surface, we can verify the origin of components (provenance) before adding them. Cryptographic signatures enable this. This article describes traditional signing, its challenges, and changes introduced by next generation signing platforms
NeuFair: Neural Network Fairness Repair with Dropout [pdf]
Authors: Vishnu Asutosh Dasu, Ashish Kumar, Saeid Tizpaz-Niari, Gang Tan
This paper investigates the neural dropout method as a post-processing bias mitigation for deep neural networks (DNNs). Neural-driven software solutions are increasingly applied in socially critical domains with significant fairness implications. While neural networks are exceptionally good at finding statistical patterns from data, they are notorious for overfitting to the training datasets that may encode and amplify existing biases from the historical data. Existing bias mitigation algorithms often require either modifying the input dataset or modifying the learning algorithms. We posit that the prevalent dropout methods that prevent over-fitting during training by randomly dropping neurons may be an effective and less intrusive approach to improve fairness of pre-trained DNNs. However, finding the ideal set of neurons to drop is a combinatorial problem. We propose NeuFair, a family of post-processing randomized algorithms that mitigate unfairness in pre-trained DNNs. Our randomized search is guided by an objective to minimize discrimination while maintaining the model utility. We show that our design of randomized algorithms provides statistical guarantees on finding optimal solutions, and we empirically evaluate the efficacy and efficiency of NeuFair in improving fairness, with minimal or no performance degradation. Our results show that NeuFair improves fairness by up to 69% and outperforms state-of-the-art post-processing bias techniques.
Dance of the ADS: Orchestrating Failures through Historically-Informed Scenario Fuzzing [pdf]
Authors: Tong Wang, Taotao Gu, Huan Deng, Hu Li, Xiaohui Kuang, Gang Zhao
As autonomous driving systems (ADS) advance towards higher levels of autonomy, orchestrating their safety verification becomes increasingly intricate. This paper unveils ScenarioFuzz, a pioneering scenario-based fuzz testing methodology. Designed like a choreographer who understands the past performances, it uncovers vulnerabilities in ADS without the crutch of predefined scenarios. Leveraging map road networks, such as OPENDRIVE, we extract essential data to form a foundational scenario seed corpus. This corpus, enriched with pertinent information, provides the necessary boundaries for fuzz testing in the absence of starting scenarios. Our approach integrates specialized mutators and mutation techniques, combined with a graph neural network model, to predict and filter out high-risk scenario seeds, optimizing the fuzzing process using historical test data. Compared to other methods, our approach reduces the time cost by an average of 60.3%, while the number of error scenarios discovered per unit of time increases by 103%. Furthermore, we propose a self-supervised collision trajectory clustering method, which aids in identifying and summarizing 54 high-risk scenario categories prone to inducing ADS faults. Our experiments have successfully uncovered 58 bugs across six tested systems, emphasizing the critical safety concerns of ADS.
Blockchain-based PKI within a Corporate Organization: Advantages and Challenges [pdf]
Authors: Julian Springer, Philipp Haindl
This research investigates the potential use of a blockchain-based Public Key Infrastructure (PKI) within an organization and compares it to conventional PKI systems. The goal is to assess the advantages and disadvantages of both approaches in order to determine the feasibility of employing blockchain technology for a decentralized PKI. The study will also evaluate the impact of current legal frameworks, such as the Cyber Resilience Act (CRA) and NIS-2 Directive. The study will examine various implementations of blockchain PKIs based on factors such as security, performance, and platform. The results indicate that blockchain-based PKIs can overcome the limitations of conventional PKIs by decentralizing the trust anchor, providing greater security. Blockchain technology allows for the immutable and transparent management of certificates, making tampering significantly more challenging. Additionally, blockchain-based PKIs offer enhanced mechanisms for identifying and addressing certificate misconduct.
Socialz: Multi-Feature Social Fuzz Testing [pdf]
Authors: Francisco Zanartu, Christoph Treude, Markus Wagner
Online social networks have become an integral aspect of our daily lives and play a crucial role in shaping our relationships with others. However, bugs and glitches, even minor ones, can cause anything from frustrating problems to serious data leaks that can have farreaching impacts on millions of users. To mitigate these risks, fuzz testing, a method of testing with randomised inputs, can provide increased confidence in the correct functioning of a social network. However, implementing traditional fuzz testing methods can be prohibitively difficult or impractical for programmers outside of the social network's development team. To tackle this challenge, we present Socialz, a novel approach to social fuzz testing that (1) characterises real users of a social network, (2) diversifies their interaction using evolutionary computation across multiple, non-trivial features, and (3) collects performance data as these interactions are executed. With Socialz, we aim to put social testing tools in everybody's hands, thereby improving the reliability and security of social networks used worldwide. In our study, we came across (1) one known limitation of the current GitLab CE and (2) 6,907 errors, of which 40.16% are beyond our debugging skills.
Depends-Kotlin: A Cross-Language Kotlin Dependency Extractor [pdf]
Authors: Qiong Feng, Xiaotian Ma, Huan Ji, Wei Song, Peng Liang
Since Google introduced Kotlin as an official programming language for developing Android apps in 2017, Kotlin has gained widespread adoption in Android development. However, compared to Java, there is limited support for Kotlin code dependency analysis, which is the foundation to software analysis. To bridge this gap, we develop Depends-Kotlin to extract entities and their dependencies in Kotlin source code. Not only does Depends-Kotlin support extracting entities' dependencies in Kotlin code, but it can also extract dependency relations between Kotlin and Java. The extraction of such cross-language dependencies can help developers understand the migration process from Java to Kotlin. Using three open-source Kotlin-Java mixing projects as our subjects, Depends-Kotlin demonstrates high accuracy and performance in resolving Kotlin-Kotlin and Kotlin-Java dependencies relations. The source code of Depends-Kotlin and the dataset used have been made available at https://github.com/XYZboom/depends-kotlin. We also provide a screencast presenting Depends-Kotlin at https://youtu.be/ZPq8SRhgXzM.
Localized RETE for Incremental Graph Queries [pdf]
Authors: Matthias Barkowsky, Holger Giese
Context: The growing size of graph-based modeling artifacts in model-driven engineering calls for techniques that enable efficient execution of graph queries. Incremental approaches based on the RETE algorithm provide an adequate solution in many scenarios, but are generally designed to search for query results over the entire graph. However, in certain situations, a user may only be interested in query results for a subgraph, for instance when a developer is working on a large model of which only a part is loaded into their workspace. In this case, the global execution semantics can result in significant computational overhead.
Contribution: To mitigate the outlined shortcoming, in this paper we propose an extension of the RETE approach that enables local, yet fully incremental execution of graph queries, while still guaranteeing completeness of results with respect to the relevant subgraph.
Results: We empirically evaluate the presented approach via experiments inspired by a scenario from software development and an independent social network benchmark. The experimental results indicate that the proposed technique can significantly improve performance regarding memory consumption and execution time in favorable cases, but may incur a noticeable linear overhead in unfavorable cases.
VITAMIN: A Compositional Framework for Model Checking of Multi-Agent Systems [pdf]
Authors: Angelo Ferrando, Vadim Malvone
The verification of Multi-Agent Systems (MAS) poses a significant challenge. Various approaches and methodologies exist to address this challenge; however, tools that support them are not always readily available. Even when such tools are accessible, they tend to be hard-coded, lacking in compositionality, and challenging to use due to a steep learning curve. In this paper, we introduce a methodology designed for the formal verification of MAS in a modular and versatile manner, along with an initial prototype, that we named VITAMIN. Unlike existing verification methodologies and frameworks for MAS, VITAMIN is constructed for easy extension to accommodate various logics (for specifying the properties to verify) and models (for determining on what to verify such properties).
The text was updated successfully, but these errors were encountered: