Filter-Based Feature Subset Selection and Data Sampling for Fault-Prone Program Module Prediction

Kehan Gao

Filter-Based Feature Subset Selection and Data Sampling for Fault-Prone Program Module Prediction
Author	Kehan Gao
Co-Author(s)	Taghi M. Khoshgoftaar; Amri Napolitano
Abstract	Classification models are effective tools for software quality prediction, helping practitioners to identify potentially problematic modules and intelligently assign limited project resources. However, two problems, high dimensionality and class imbalance, may affect the classifiers performance. In this study, we propose a data pre-processing approach, in which feature selection is combined with data sampling, to overcome these problems. More specifically, we investigate two filter-based feature subsets selection techniques, i.e., correlation-based and consistency-based subset evaluation methods, and three data sampling methods, i.e., random undersampling, random oversampling, and synthetic minority oversampling. We are interested in exploring the effect of the various feature selection techniques, sampling methods, and their interactions on the performance of classification models. The empirical study was carried out on three datasets from a real-world software system. The results demonstrate that the correlation-based subset evaluation technique outperforms the consistency-based method when both used along with a random sampling method; however, when synthetic minority oversampling is employed the consistencybased technique may be a better choice.
Keywords	software quality prediction, feature selection, data sampling, subset selection

		Article #: 21114

Proceedings of the 21st ISSAT International Conference on Reliability and Quality in Design
August 6-8, 2015 - Philadelphia, Pennsylvia, U.S.A.

	International Society of Science and Applied Technologies