![]() |
International Society of Science and Applied Technologies |
On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors | ||||
Author | Mamoru Ohara
|
|||
Co-Author(s) | Yuto Sugawa; Chisato Murakami
|
|||
Abstract | In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting consistency and diversity of generated samples by learning real-world software codes.
|
|||
Keywords | Machine learning, Software defect detection, Software defect injection, C language, LLVM IR, Sequence to sequence | |||
Article #: RQD2025-241 |
Proceedings of 30th ISSAT International Conference on Reliability & Quality in Design |