arXiv NEWS (2024-06-27) #150

github-actions · 2024-06-27T09:04:55Z

cs.SE updates on arXiv.org, Thu, 27 Jun 2024 04:00:37 +0000

A Context-Driven Approach for Co-Auditing Smart Contracts with The Support of GPT-4 code interpreter [pdf]

Authors: Mohamed Salah Bouafif, Chen Zheng, Ilham Ahmed Qasse, Ed Zulkoski, Mohammad Hamdaqa, Foutse Khomh

Tags: LLM

The surge in the adoption of smart contracts necessitates rigorous auditing to ensure their security and reliability. Manual auditing, although comprehensive, is time-consuming and heavily reliant on the auditor's expertise. With the rise of Large Language Models (LLMs), there is growing interest in leveraging them to assist auditors in the auditing process (co-auditing). However, the effectiveness of LLMs in smart contract co-auditing is contingent upon the design of the input prompts, especially in terms of context description and code length. This paper introduces a novel context-driven prompting technique for smart contract co-auditing. Our approach employs three techniques for context scoping and augmentation, encompassing code scoping to chunk long code into self-contained code segments based on code inter-dependencies, assessment scoping to enhance context description based on the target assessment goal, thereby limiting the search space, and reporting scoping to force a specific format for the generated response. Through empirical evaluations on publicly available vulnerable contracts, our method demonstrated a detection rate of 96% for vulnerable functions, outperforming the native prompting approach, which detected only 53%. To assess the reliability of our prompting approach, manual analysis of the results was conducted by expert auditors from our partner, Quantstamp, a world-leading smart contract auditing company. The experts' analysis indicates that, in unlabeled datasets, our proposed approach enhances the proficiency of the GPT-4 code interpreter in detecting vulnerabilities.

EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models [pdf]

Authors: Chun-Chieh Liao, Wei-Ting Kuo, I-Hsuan Hu, Yen-Chen Shih, Jun-En Ding, Feng Liu, Fang-Ming Hung

Tags: LLM

Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital's backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at https://www.youtube.com/watch?v=oqmL9DEDFgA.

An Empirical Study of Unit Test Generation with Large Language Models [pdf]

Authors: Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, Junjie Chen

Tags: LLM

Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation. Existing research primarily focuses on closed-source LLMs (e.g., ChatGPT and CodeX) with fixed prompting strategies, leaving the capabilities of advanced open-source LLMs with various prompting settings unexplored. Particularly, open-source LLMs offer advantages in data privacy protection and have demonstrated superior performance in some tasks. Moreover, effective prompting is crucial for maximizing LLMs' capabilities. In this paper, we conduct the first empirical study to fill this gap, based on 17 Java projects, five widely-used open-source LLMs with different structures and parameter sizes, and comprehensive evaluation metrics. Our findings highlight the significant influence of various prompt factors, show the performance of open-source LLMs compared to the commercial GPT-4 and the traditional Evosuite, and identify limitations in LLM-based unit test generation. We then derive a series of implications from our study to guide future research and practical use of LLM-based unit test generation.

MALSIGHT: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization [pdf]

Authors: Haolang Lu, Hongrui Peng, Guoshun Nan, Jiaoyang Cui, Cheng Wang, Weifei Jin

Tags: LLM

Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose MALSIGHT, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summaries, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS dataset and a benign pseudocode dataset. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting the usability, accuracy, and completeness of summaries. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed MALSIGHT. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger ChatGPT3.5.

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [pdf]

Authors: Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, Leandro Von Werra

Tags: LLM

Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

A Large-scale Investigation of Semantically Incompatible APIs behind Compatibility Issues in Android Apps [pdf]

Authors: Shidong Pan, Tianchen Guo, Lihong Zhang, Pei Liu, Zhenchang Xing, Xiaoyu Sun

Tags: LLM

Application Programming Interface (API) incompatibility is a long-standing issue in Android application development. The rapid evolution of Android APIs results in a significant number of API additions, removals, and changes between adjacent versions. Unfortunately, this high frequency of alterations may lead to compatibility issues, often without adequate notification to developers regarding these changes. Although researchers have proposed some work on detecting compatibility issues caused by changes in API signatures, they often overlook compatibility issues stemming from sophisticated semantic changes. In response to this challenge, we conducted a large-scale discovery of incompatible APIs in the Android Open Source Project (AOSP) by leveraging static analysis and pre-trained Large Language Models (LLMs) across adjacent versions. We systematically formulate the problem and propose a unified framework to detect incompatible APIs, especially for semantic changes. It's worth highlighting that our approach achieves a 0.83 F1-score in identifying semantically incompatible APIs in the Android framework. Ultimately, our approach detects 5,481 incompatible APIs spanning from version 4 to version 33. We further demonstrate its effectiveness in supplementing the state-of-the-art methods in detecting a broader spectrum of compatibility issues (+92.3%) that have been previously overlooked.

Transforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects [pdf]

Authors: Ruchika Pandey, Prabhat Singh, Raymond Wei, Shaila Shankar

Generative AI technologies promise to transform the product development lifecycle. This study evaluates the efficiency gains, areas for improvement, and emerging challenges of using GitHub Copilot, an AI-powered coding assistant. We identified 15 software development tasks and assessed Copilot's benefits through real-world projects on large proprietary code bases. Our findings indicate significant reductions in developer toil, with up to 50% time saved in code documentation and autocompletion, and 30-40% in repetitive coding tasks, unit test generation, debugging, and pair programming. However, Copilot struggles with complex tasks, large functions, multiple files, and proprietary contexts, particularly with C/C++ code. We project a 33-36% time reduction for coding-related tasks in a cloud-first software development lifecycle. This study aims to quantify productivity improvements, identify underperforming scenarios, examine practical benefits and challenges, investigate performance variations across programming languages, and discuss emerging issues related to code quality, security, and developer experience.

Documenting Ethical Considerations in Open Source AI Models [pdf]

Authors: Haoyu Gao, Mansooreh Zahedi, Christoph Treude, Sarita Rosenstock, Marc Cheong

Background: The development of AI-enabled software heavily depends on AI model documentation, such as model cards, due to different domain expertise between software engineers and model developers. From an ethical standpoint, AI model documentation conveys critical information on ethical considerations along with mitigation strategies for downstream developers to ensure the delivery of ethically compliant software. However, knowledge on such documentation practice remains scarce. Aims: The objective of our study is to investigate how developers document ethical aspects of open source AI models in practice, aiming at providing recommendations for future documentation endeavours. Method: We selected three sources of documentation on GitHub and Hugging Face, and developed a keyword set to identify ethics-related documents systematically. After filtering an initial set of 2,347 documents, we identified 265 relevant ones and performed thematic analysis to derive the themes of ethical considerations. Results: Six themes emerge, with the three largest ones being model behavioural risks, model use cases, and model risk mitigation. Conclusions: Our findings reveal that open source AI model documentation focuses on articulating ethical problem statements and use case restrictions. We further provide suggestions to various stakeholders for improving documentation practice regarding ethical considerations.

Innovating for Tomorrow: The Convergence of SE and Green AI [pdf]

Authors: Lu'is Cruz, Xavier Franch Gutierrez, Silverio Mart'inez-Fern'andez

The latest advancements in machine learning, specifically in foundation models, are revolutionizing the frontiers of existing software engineering (SE) processes. This is a bi-directional phenomona, where 1) software systems are now challenged to provide AI-enabled features to their users, and 2) AI is used to automate tasks within the software development lifecycle. In an era where sustainability is a pressing societal concern, our community needs to adopt a long-term plan enabling a conscious transformation that aligns with environmental sustainability values. In this paper, we reflect on the impact of adopting environmentally friendly practices to create AI-enabled software systems and make considerations on the environmental impact of using foundation models for software development.

Sanskrit Knowledge-based Systems: Annotation and Computational Tools [pdf]

Authors: Hrishikesh Terdalkar

We address the challenges and opportunities in the development of knowledge systems for Sanskrit, with a focus on question answering. By proposing a framework for the automated construction of knowledge graphs, introducing annotation tools for ontology-driven and general-purpose tasks, and offering a diverse collection of web-interfaces, tools, and software libraries, we have made significant contributions to the field of computational Sanskrit. These contributions not only enhance the accessibility and accuracy of Sanskrit text analysis but also pave the way for further advancements in knowledge representation and language processing. Ultimately, this research contributes to the preservation, understanding, and utilization of the rich linguistic information embodied in Sanskrit texts.

Towards View-based Development of Quantum Software [pdf]

Authors: Joshua Ammermann, Wolfgang Mauerer, Ina Schaefer

Quantum computing is an interdisciplinary field that relies on the expertise of many different stakeholders. The views of various stakeholders on the subject of quantum computing may differ, thereby complicating communication. To address this, we propose a view-based quantum development approach based on a Single Underlying Model (SUM) and a supporting quantum Integrated Development Environment (IDE). We highlight emerging challenges for future research.

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets [pdf]

Authors: Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/

On the Contents and Utility of IoT Cybersecurity Guidelines [pdf]

Authors: Jesse Chen, Dharun Anandayuvaraj, James C Davis, Sazzadur Rahaman

Cybersecurity concerns of Internet of Things (IoT) devices and infrastructure are growing each year. In response, organizations worldwide have published IoT security guidelines to protect their citizens and customers by providing recommendations on the development and operation of IoT systems. While these guidelines are being adopted, e.g. by US federal contractors, their content and merits have not been critically examined. Specifically, we do not know what topics and recommendations they cover and their effectiveness at preventing real-world IoT failures. In this paper, we address these gaps through a qualitative study of guidelines. We collect 142 IoT cybersecurity guidelines and sample them for recommendations until reaching saturation at 25 guidelines. From the resulting 958 unique recommendations, we iteratively develop a hierarchical taxonomy following grounded theory coding principles and study the guidelines' comprehensiveness. In addition, we evaluate the actionability and specificity of each recommendation and match recommendations to CVEs and security failures in the news they can prevent. We report that: (1) Each guideline has gaps in its topic coverage and comprehensiveness; (2) 87.2% recommendations are actionable and 38.7% recommendations can prevent specific threats; and (3) although the union of the guidelines mitigates all 17 of the failures from our news stories corpus, 21% of the CVEs evade the guidelines. In summary, we report shortcomings in each guideline's depth and breadth, but as a whole they address major security issues.

Guarantees in Security: A Philosophical Perspective [pdf]

Authors: Marcel B"ohme

Research in cybersecurity may seem reactive, specific, ephemeral, and indeed ineffective. Despite decades of innovation in defense, even the most critical software systems turn out to be vulnerable to attacks. Time and again. Offense and defense forever on repeat. Even provable security, meant to provide an indubitable guarantee of security, does not stop attackers from finding security flaws. As we reflect on our achievements, we are left wondering: Can security be solved once and for all?
In this paper, we take a philosophical perspective and develop the first theory of cybersecurity that explains what fundamentally prevents us from making reliable statements about the security of a software system. We substantiate each argument by demonstrating how the corresponding challenge is routinely exploited to attack a system despite credible assurances about the absence of security flaws. To make meaningful progress in the presence of these challenges, we introduce a philosophy of cybersecurity.

The text was updated successfully, but these errors were encountered:

github-actions bot added the news label Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv NEWS (2024-06-27) #150

arXiv NEWS (2024-06-27) #150

github-actions bot commented Jun 27, 2024

arXiv NEWS (2024-06-27) #150

arXiv NEWS (2024-06-27) #150

Comments

github-actions bot commented Jun 27, 2024

cs.SE updates on arXiv.org, Thu, 27 Jun 2024 04:00:37 +0000

A Context-Driven Approach for Co-Auditing Smart Contracts with The Support of GPT-4 code interpreter [pdf]

EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models [pdf]

An Empirical Study of Unit Test Generation with Large Language Models [pdf]

MALSIGHT: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization [pdf]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [pdf]

A Large-scale Investigation of Semantically Incompatible APIs behind Compatibility Issues in Android Apps [pdf]

Transforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects [pdf]

Documenting Ethical Considerations in Open Source AI Models [pdf]

Innovating for Tomorrow: The Convergence of SE and Green AI [pdf]

Sanskrit Knowledge-based Systems: Annotation and Computational Tools [pdf]

Towards View-based Development of Quantum Software [pdf]

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets [pdf]

On the Contents and Utility of IoT Cybersecurity Guidelines [pdf]

Guarantees in Security: A Philosophical Perspective [pdf]