The Impact of Feature Selection Techniques on a Hybrid Boosting and Data Sampling Approach for Software Quality Estimation  
Author Taghi M. Khoshgoftaar


Co-Author(s) Kehan Gao; Ye Chen


Abstract Two factors that may cause low quality of training data and therefore degrade classification models are high dimensionality and class imbalance. Feature (software metric) selection and data sampling are often used to cope with these problems. Feature selection (FS) is a process of choosing the most important and revelent attributes from the original dataset that can significantly contribute to the modeling process. Data sampling alters the dataset to change its balance level. A recent study shows that another interesting method called boosting (building multiple models, with each model tuned to work better on instances misclassified by previous models) is also effective for addressing the class imbalance problem. In this paper, we present a technique that uses feature selection followed by a boosting algorithm in the context of software quality estimation. We investigate four FS approaches: individual FS, repetitive sampled FS, sampled ensemble FS, and repetitive sampled ensemble FS, and study the impact of the four approaches on software quality prediction. Ten feature ranking techniques are examined in the case study. We also employ the boosting algorithm to construct classification models without performing FS and use the results as the baseline for further comparison. The empirical result demonstrates that 1) FS is important and needed prior to the learning process; 2) the repetitive sampled FS method generally has similar performance as the individual FS technique; and 3) the ensemble filter performed better than or similarly to the average of the individual base rankers.


Keywords software quality estimation, feature selection, data sampling, RUSBoost
    Article #:  20228
Proceedings of the 20th ISSAT International Conference on Reliability and Quality in Design
August 7-9, 2014 - Seattle, Washington, U.S.A.