Impact of Data Sampling on Feature Selection Techniques for Software Defect Prediction  
Author Kehan Gao


Co-Author(s) Taghi M. Khoshgoftaar; Amri Napolitano


Abstract In software quality modeling, two problems often come with a software training dataset: (1) high dimensionality and (2) imbalanced distributions between the two classes (fault-prone and not-fault-prone modules). To overcome these problems, an effective method is to perform feature selection and data sampling prior to building classifiers for software quality prediction. In this study, we investigate 18 filter-based feature ranking techniques and three data sampling approaches, and compare the similarity between each pair of filters with respect to different sampling techniques. We also compare the prediction performance when using every combination of filter and sampling method. The experimental results demonstrate that data sampling increases the similarity between two feature ranking techniques on average and improves the classification performance when combined with feature selection approaches.


    Article #:  1844
Proceedings of the 18th ISSAT International Conference on Reliability and Quality in Design
July 26-28, 2012 - Boston, Massachusetts, U.S.A.