Impact of Data Sampling on Feature Selection Techniques for Software Defect Prediction

Impact of Data Sampling on Feature Selection Techniques for Software Defect Prediction
Author	Kehan Gao
Co-Author(s)	Taghi M. Khoshgoftaar; Amri Napolitano
Abstract	In software quality modeling, two problems often come with a software training dataset: (1) high dimensionality and (2) imbalanced distributions between the two classes (fault-prone and not-fault-prone modules). To overcome these problems, an effective method is to perform feature selection and data sampling prior to building classifiers for software quality prediction. In this study, we investigate 18 filter-based feature ranking techniques and three data sampling approaches, and compare the similarity between each pair of filters with respect to different sampling techniques. We also compare the prediction performance when using every combination of filter and sampling method. The experimental results demonstrate that data sampling increases the similarity between two feature ranking techniques on average and improves the classification performance when combined with feature selection approaches.
Keywords

		Article #: 1844

Proceedings of the 18th ISSAT International Conference on Reliability and Quality in Design
July 26-28, 2012 - Boston, Massachusetts, U.S.A.

	International Society of Science and Applied Technologies