Exploring Area Under the Precision Recall Curve and Random Undersampling to Classify Imbalanced Big Data  
Author John Hancock

 

Co-Author(s) Taghi M. Khoshgoftaar; Justin M. Johnson

 

Abstract According to our study, Random Undersampling does not lend any benefit to Area Under the Precision Recall Curve scores in experiments involving classification of imbalanced Big Data. Many works report the positive impact of Random Undersampling on classification results in terms of other metrics. We provide a counterexample that shows the positive impact does not extend to Area Under the Precision Recall Curve. We perform experiments with XGBoost and Extremely Randomized Trees, two popular open source classifiers, and six levels of Random Undersampling, on the classification of a dataset with approximately 175 million instances. The outcomes of these experiments enable us to determine the effect of Random Undersampling on Area Under the Precision Recall Curve scores. Our contribution is to undertake the execution of Machine Learning experiments on a large scale dataset to prove that Random Undersampling does not enhance the performance, in terms of Area Under the Precision Recall Curve, of certain classifiers. To the best of our knowledge, we are the first to report such results for experiments involving a dataset on the scale we work with. Furthermore, we are the first to investigate the effect of Random Undersampling on Area Under the Precision Recall Curve in Big Data classification experiments.

 

Keywords Extremely Randomized Trees, XGBoost, Class Imbalance, Big Data, Undersampling, AUPRC
   
    Article #:  RQD27-71
 

Proceedings of 27th ISSAT International Conference on Reliability & Quality in Design
Virtual Event

August 4-6, 2022