Filter-Based Feature Subset Selection and Data Sampling for Fault-Prone Program Module Prediction  
Author Kehan Gao


Co-Author(s) Taghi M. Khoshgoftaar; Amri Napolitano


Abstract Classification models are effective tools for software quality prediction, helping practitioners to identify potentially problematic modules and intelligently assign limited project resources. However, two problems, high dimensionality and class imbalance, may affect the classifiers performance. In this study, we propose a data pre-processing approach, in which feature selection is combined with data sampling, to overcome these problems. More specifically, we investigate two filter-based feature subsets selection techniques, i.e., correlation-based and consistency-based subset evaluation methods, and three data sampling methods, i.e., random undersampling, random oversampling, and synthetic minority oversampling. We are interested in exploring the effect of the various feature selection techniques, sampling methods, and their interactions on the performance of classification models. The empirical study was carried out on three datasets from a real-world software system. The results demonstrate that the correlation-based subset evaluation technique outperforms the consistency-based method when both used along with a random sampling method; however, when synthetic minority oversampling is employed the consistencybased technique may be a better choice.


Keywords software quality prediction, feature selection, data sampling, subset selection
    Article #:  21114
Proceedings of the 21st ISSAT International Conference on Reliability and Quality in Design
August 6-8, 2015 - Philadelphia, Pennsylvia, U.S.A.