Dealing with Imbalanced Credit Data: A Bayesian Optimization-Based Ensemble Learning Approach

The title of the article is 'Dealing with Imbalanced Credit Data: A Bayesian Optimization-Based Ensemble Learning Approach'.

Personal credit scores have become an increasingly important factor for financial institutions and a hot topic of research. One of the biggest challenges in this research is the problem of data imbalance. Based on publicly available data from the financial industry, the number of creditworthy customers is far greater than the number of credit-risky customers[1]. Classification models trained on imbalanced data are often dominated by the majority class label. Such models struggle to identify the potential credit risks of new customers, resulting in unqualified customers being approved for loans, which can lead to unrecoverable losses for banks or financial institutions. Therefore, dealing with and modeling imbalanced credit data is crucial.

Existing research on data imbalance primarily focuses on two approaches: algorithmic and data-level. At the algorithmic level, penalty mechanisms are introduced to optimize model performance and reduce incorrect predictions, thereby enhancing classification accuracy. Examples include cost-sensitive learning[2], ensemble learning[3], and support vector machines[4]. This approach, using classification algorithms to handle imbalanced datasets, avoids the disruption of the original sample space caused by generating new samples or deleting existing ones. However, the resulting classification models are prone to overfitting. At the data level, resampling is performed on the data before it is inputted into the classification model to achieve balanced data. Resampling algorithms include oversampling methods that increase the number of minority class samples, undersampling methods that decrease the number of majority class samples, and hybrid sampling methods that employ mechanisms to increase the number of majority class samples or delete minority class samples.

Random oversampling is a technique that increases the number of samples by randomly replicating minority class samples. This method can lead to the generation of duplicate data, and the number of generated minority class samples is uncertain. He Haibo et al. proposed the Adaptive Synthetic Sampling (ADASYN) algorithm[5], which automatically determines the number of new samples generated for each minority class sample based on the distribution of the samples, addressing the instability in the number of samples generated by random sampling. However, this algorithm is susceptible to outliers. Undersampling methods achieve balance at the order of magnitude level by reducing the proportion of majority samples. Models trained on undersampled data often exhibit better prediction results than those trained on oversampled data[6]. Jianping Zhang et al. proposed the NearMiss sampling algorithm[7], which addresses imbalanced data distribution using the KNN method. This algorithm selectively removes majority class samples by calculating the distance between samples, achieving undersampling but at the cost of high computational time.

The EasyEnsemble algorithm[8] employs manually defined undersampling of majority class samples, integrating them with minority class samples to create an ensemble learning model. This addresses data imbalance to some extent, but in scenarios where minority class samples are scarce, the manually set positive and negative sample ratio can lead to insufficient learning of the majority samples by the classifier, reducing its reliability. Hybrid sampling combines the advantages of oversampling and undersampling methods to address data imbalance. For instance, BATISTA et al. proposed the hybrid sampling SMOTETomek algorithm[9] and the SMOTEENN algorithm[10], both building upon the typical SMOTE algorithm. The key difference between the two lies in the removal of Tomek links from newly generated samples in the former and the use of KNN prediction for each newly generated sample in the latter. Based on the prediction results, samples are either deleted or retained, improving the quality and usability of the newly generated data.

While existing methods for handling imbalanced samples perform well in certain scenarios, they have shortcomings. For example, oversampling and hybrid sampling algorithms increase the number of negative samples, potentially introducing duplicate and artificial data. This duplicate and artificial data can distort the true sample space and cause the classification algorithm to overfit, decreasing classification accuracy. Random undersampling, by manually adjusting the positive and negative sample ratio, not only disrupts the sample space but also risks removing crucial sample information, lowering the algorithm's classification performance.

To address these issues, this paper proposes a Bayesian Optimization-based Ensemble Learning algorithm (Bys-SMOTE). The Bys-SMOTE algorithm ensures that the number of minority class samples remains constant while using Bayesian optimization to divide majority class samples into multiple subsets. Each majority class sample subset is combined with the minority class samples to train AdaBoost classifiers. Finally, the individual classifiers are combined to form an ensemble model. The output of the ensemble model is then used to adjust the positive and negative sample ratios in each new sample space to achieve optimal proportions.

The Bys-SMOTE algorithm, although using training sample subsets smaller than the total samples, retains the overall information after integration. This overcomes the defect of information loss in traditional undersampling methods, thereby improving the classification performance of imbalanced credit data.

1 Bys-SMOTE Ensemble Learning Framework

The Bys-SMOTE ensemble learning framework proposed in this paper (see Figure 1) consists of the Bys-SMOTE sampling algorithm and the AdaBoost ensemble learning algorithm. The execution flow is depicted in Figure 1. Firstly, the imbalanced financial credit data is divided into majority class samples (Majority Class N) and minority class samples (Minority Class M). Then, the Majority Class is divided into n datasets N1, N2, ..., Nn using Bys-SMOTE. These n datasets are combined with the Minority Class N to form n training samples, which are fed into classifiers for training, resulting in multiple classifiers Hi. Applying the AdaBoost algorithm's ensemble learning principles, the resulting T sub-classifiers Hi are combined to obtain the final prediction result Res. Based on Res, the system decides whether to output the ensemble model H*.

Figure 1 Bys-SMOTE Ensemble Learning Framework

Fig.1 Bys-SMOTE integrated learning framework

1.1 Bayesian Optimization

Bayesian optimization first constructs a probabilistic model of the function to be optimized. Then, it uses this model to determine the next point to evaluate, prioritizing regions with high uncertainty. This transforms the problem of finding the optimal sample ratio into finding the global optimal solution x* of an unknown objective function:

arg max ( ) x X x f x   （1）

Where:

x represents the parameter to be optimized;

X represents the set of parameters to be optimized;

f(x) represents the objective function.

1.2 Gaussian Process

The positive and negative sample ratio can be defined as a combination of continuous parameters. For continuous functions, Bayesian optimization often assumes that the unknown function is sampled from a Gaussian process[12]. The Gaussian process is characterized by a mean function and a positive semi-definite covariance function:

  ' f x gp m x k x x ( ) ( ), ( , )

Dealing with Imbalanced Credit Data: A Bayesian Optimization-Based Ensemble Learning Approach