Assessments of Feature Selection with Respect to Data Sampling for Highly Imbalanced Software Measurement Data  
Author Kehan Gao


Co-Author(s) Taghi M. Khoshgoftaar; Lofton A. Bullard


Abstract Classification models are effective tools for helping software defect prediction. The predictive power of a classification model constructed from a given dateset is affected by a number of factors. In this paper, we are interested in two problems that often arise in software measurement data: high dimensionality and class imbalance (e.g., many more not-fault-prone modules than fault-prone modules found in a dateset). We consider using data sampling followed by feature selection to deal with these problems. Six data sampling approaches (which are made up of three sampling techniques, each consisting of two post-sampling proportion ratios) and six commonly used feature ranking methods are employed in this study. We evaluate the feature selection techniques by means of: (1) a general method, i.e., assessing the classification performance after the training data is modified, and (2) studying the stability of a feature selection method, specifically with the goal of understanding the effect of data sampling techniques on the stability of feature selection. The experiments were performed on three datesets from a real-world software project. The results demonstrate that the feature selection techniques that most enhance the models’ classification performance do not also show the best stability, and vice versa. In addition, the classification performance is more affected by the sampling techniques themselves rather than by the post-sampling proportions, whereas this is opposite for the stability.


Keywords software defect prediction, feature selection, data sampling, stability of feature selection
    Article #:  20233
Proceedings of the 20th ISSAT International Conference on Reliability and Quality in Design
August 7-9, 2014 - Seattle, Washington, U.S.A.