International Society of Science and Applied Technologies |
|
Filter-Based Feature Subset Selection and Data Sampling for Fault-Prone Program Module Prediction | ||||
Author | Kehan Gao
|
|||
Co-Author(s) | Taghi M. Khoshgoftaar; Amri Napolitano
|
|||
Abstract | Classification models are effective tools for software quality prediction, helping practitioners to identify potentially problematic modules and intelligently assign limited project resources. However, two problems, high dimensionality and class imbalance, may affect the classifiers performance. In this study, we propose a data pre-processing approach, in which feature selection is combined with data sampling, to overcome these problems. More specifically, we investigate two filter-based feature subsets selection techniques, i.e., correlation-based and consistency-based subset evaluation methods, and three data sampling methods, i.e., random undersampling, random oversampling, and synthetic minority oversampling. We are interested in exploring the effect of the various feature selection techniques, sampling methods, and their interactions on the performance of classification models. The empirical study was carried out on three datasets from a real-world software system. The results demonstrate that the correlation-based subset evaluation technique outperforms the consistency-based method when both used along with a random sampling method; however, when synthetic minority oversampling is employed the consistencybased technique may be a better choice.
|
|||
Keywords | software quality prediction, feature selection, data sampling, subset selection | |||
Article #: 21114 |
August 6-8, 2015 - Philadelphia, Pennsylvia, U.S.A. |