On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors  
Author Mamoru Ohara

 

Co-Author(s) Yuto Sugawa; Chisato Murakami

 

Abstract In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting consistency and diversity of generated samples by learning real-world software codes.

 

Keywords Machine learning, Software defect detection, Software defect injection, C language, LLVM IR, Sequence to sequence
   
    Article #:  RQD2025-241
 

Proceedings of 30th ISSAT International Conference on Reliability & Quality in Design
August 6-8, 2025