Article, 2016

High dimensional classifiers in the imbalanced case

Computational Statistics & Data Analysis, ISSN 1872-7352, 0167-9473, Volume 98, Pages 46-59, 10.1016/j.csda.2015.12.009

Contributors

Bak, Britta Anker [1] Jensen, Jens Ledet 0000-0002-8776-5764 (Corresponding author) [1]

Affiliations

  1. [1] Aarhus University
  2. [NORA names: AU Aarhus University; University; Denmark; Europe, EU; Nordic; OECD]

Abstract

A binary classification problem is imbalanced when the number of samples from the two groups differs. For the high dimensional case, where the number of variables is much larger than the number of samples, imbalance leads to a bias in the classification. The independence classifier is studied theoretically and based on the analysis two new classifiers are suggested that can handle any imbalance ratio. The analytical results are supplemented by a simulation study, where the suggested classifiers in some aspects outperform multiple undersampling. For correlated data the ROAD classifier is considered and a suggestion is given for how to modify the classifier to handle the bias from imbalanced group sizes.

Keywords

analysis, analytical results, bias, binary classification problem, cases, classification, classification problem, classifier, correlated data, data, dimensional case, group, group size, high-dimensional cases, high-dimensional classifiers, imbalance, imbalance ratio, imbalanced case, imbalanced group sizes, independence, independent classifiers, problem, ratio, results, road, road classifier, samples, simulation, simulation study, size, study, undersampling, variables

Data Provider: Digital Science