On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors

On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors
Author	Mamoru Ohara
Co-Author(s)	Yuto Sugawa; Chisato Murakami
Abstract	In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting consistency and diversity of generated samples by learning real-world software codes.
Keywords	Machine learning, Software defect detection, Software defect injection, C language, LLVM IR, Sequence to sequence

		Article #: RQD2025-241

Proceedings of 30th ISSAT International Conference on Reliability & Quality in Design
August 6-8, 2025

	International Society of Science and Applied Technologies