1. bookVolume 6 (2021): Edizione 1 (February 2021)
Dettagli della rivista
License
Formato
Rivista
eISSN
2543-683X
Prima pubblicazione
30 Mar 2017
Frequenza di pubblicazione
4 volte all'anno
Lingue
Inglese
Accesso libero

A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data

Pubblicato online: 27 Jan 2021
Volume & Edizione: Volume 6 (2021) - Edizione 1 (February 2021)
Pagine: 178 - 192
Ricevuto: 29 Apr 2020
Accettato: 21 Dec 2020
Dettagli della rivista
License
Formato
Rivista
eISSN
2543-683X
Prima pubblicazione
30 Mar 2017
Frequenza di pubblicazione
4 volte all'anno
Lingue
Inglese
Introduction

Generally, many of the Machine learning algorithms applied to classification problems assume that the classes are well balanced. But in practical scenarios, many of the applications like credit card fraud, cancer detection, oil spills from radar images, malicious apps detection, Medical Diagnosis, and many other datasets are imbalanced, which deviates from the assumption of traditional classification algorithms. When the data is imbalanced or when the dataset has skewed class distributions, the predictive power of minority classes is under threat, and it exhibits poor prediction for minority classes.

Imbalance occurs due to rare instances (Fotouhi, Asadi, & Kattan, 2019), for example, in medical diagnosis, the cancerous patient will be minimal compared to non-cancerous patient. Since the minority class is limited, learning will be difficult for the classifier to make predictions. Many researchers suggest sampling methods as a solution for imbalanced problems. Several resources have indicated a performance improvement when the data is balanced by using sampling methods before employing classification algorithm. On the contrary, it is also suggested, in some rare cases, even without balancing original data, the classifier will be able to produce a comparable performance with balanced data. Studies show that several applications like customer churn prediction (Amin et al., 2016), autism imbalanced data (El-Sayed et al., 2016), diagnosing cancer, imbalanced health care data (Zhao, Wong, & Tsui, 2018) have been analyzed in this context. One such significant dataset is Medical appointment no-show data. The data set used for the experiment was extracted from the public dataset hosted by Kaggle (www.kaggle.com).

A medical hospital offers various services that need the patients to book appointments for consultation, surgery, test, etc. Though patients attend their appointments on time most of the time, 20% of the patients fail their appointments. This failure, which is called a No-Show, is vital to the medical center as it blocks other patients to book their slots. The medical hospital will prefer to reduce the no-shows by predicting earlier so that other patients can utilize those slots. In practice, for any medical problem, the patient calls the doctor, and an appointment is scheduled. When patients miss their scheduled medical appointments without cancellation, i.e. a no-show, it creates an issue to the medical department. The reservation of an appointment involves the allocation of health care providers’ time, medical equipment, room, etc. From the literature (Kheirkhah et al., 2016), it is seen that more than 20% of patients don’t show up, resulting in two fold issues such as creating a financial loss to the doctor and also it really affects the patients who need doctor appointment. This shows that data is imbalanced. Several imbalanced algorithms are available as a preprocessing technique that balances the data before we can apply the classifier.

This research aims to build a classifier to predict (Mohammadi et al., 2018) whether a scheduled appointment will be attended or not. In such a study, if any hospital data is examined, the no-show class is minimal. The application of classification algorithms on such data paves the way for inaccurate predictions since the class is imbalanced. Mostly in any training dataset, the number of instances is high for a particular class (i.e. those who do not miss their appointments), while there are few instances for other classes (no-shows). This particular scenario is called the imbalanced data. The data imbalance in the datasets could cause the resulting models to be biased towards the majority class (i.e. those who do not miss their appointments) during prediction.

Hence on the Medical Appointment No-Show dataset which is imbalanced, various under sampling and oversampling techniques were applied to make the class balanced for prior prediction. This paper also compares the performance metrics of the various over sampling and under sampling methods widely used before applying classifiers. Sampling techniques such as Random over Sampling (ROS), Random under Sampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), ADAaptive SYNthetic Sampling (ADASYN), Edited Nearest Neighbor (ENN), and Condensed Nearest Neighbor (CNN) are investigated using the medical appointment no-show dataset and the Decision Tree classifier. The majority and minority class before and after sampling were found for all the chosen sampling techniques. The performance of the classifier after applying sampling was assessed using measures such as precision, recall, and Area under Curve (AUC). The performance of ENN with decision tree classifiers outperforms other techniques for the chosen medical appointment dataset. CNN and ADASYN have performed equally well on the imbalanced data.

The purpose of this paper is to balance the classes using different sampling methods for Medical appointment no-show data and identify the best results. This paper is organized as follows: Section 2 depicts the methodology adopted for Rebalancing Framework for Classification of imbalanced Medical appointment no-show data. Section 3 describes the various sampling techniques used in this experimental analysis. Section 4 focuses on the experimental setup details and the results were analyzed in Section 5. The conclusion is provided in Section 6.

Methodology

This paper defines a Rebalancing Framework for Classification of imbalanced Medical appointment no-show data. The data set is split into k folds, each one containing 20% of the patterns of the dataset. For each fold, data is split into training set and test set. The training data is either under sampled or oversampled to generate a balanced training sample. The decision tree classifier is applied to the balanced training data and the performance of the classifier is evaluated against the test set. This process is repeated for different sampling techniques and evaluated using performance metrics for medical no-show dataset as shown in Figure 2. The objective is to build a prediction model which is capable of predicting whether an appointment would be no-show.

Sampling techniques

Data sampling provides a collection of techniques that transform a training dataset in order to balance the class distribution. Once data is balanced, standard machine learning algorithms can be trained directly on the transformed dataset without any modification. This allows the challenge of imbalanced classification, even with severely imbalanced class distributions, to be addressed with a data preparation method. There are many different types of data sampling methods that can be used, and there is no single best method to use on all classification problems and with all classification models. Sampling techniques may follow the strategy of either under sampling or oversampling.

Oversampling methods, which increase the instances of minority class as equal to majority class using various techniques like ROS, SMOTE, ADASYN

Under sampling methods, which reduces the majority class to minority level using various techniques like RUS, ENN, CNN

Hybrid methods, which is a combination of under sampling and oversampling methods.

Random over sampling

Studies (Lemnaru & Potolea, 2012) mentions that ROS is adding examples to the minority class so that it increases the size of the minority class to balance the data. Minority class is sampled with replacement and then the samples are combined with majority data. This leads to over fitting, since it makes exact copies of existing instances. The algorithm of ROS is described below.

ROS

Step 1: D is the original data set
Step 2: E is new set and adds it by appending randomly selected examples from the minority class (with replacement).
Step 3: D = Dmin + Dmaj + E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively
Random under sampling (RUS)

RUS is removing the examples from majority class (Galar et al., 2012) which might remove some important information. It keeps all minority data and randomly selects data from majority class which is equivalent to minority class. However, in cases where each example of the majority class is near other examples of the same class, this method might yield better results. Below depicts the RUS algorithm.

RUS

Step 1: D is the original data set
Step 2: E is new set and it is a subset of D which is created randomly with or without replacement.
Step 3: D = Dmaj + Dmin - E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively
Synthetic minority oversampling technique (SMOTE)

ROS generates duplicate instances of minority classes whereas SMOTE generates synthetic instances of minority classes. The synthetic instances are created by joining the line segments of nearest neighbors of minority class. To identify the nearest neighbor, K-nearest neighbor is applied. The synthetic instances will be created on the line segment. This avoids the over fitting (López et al., 2013) and causes the decision boundaries for the minority class to be spread further into the majority class space. The difference between the feature vector (sample) under consideration and its nearest neighbor is taken. This difference is multiplied by a random number between 0 and 1. This calculated value is added to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. For the nominal cases, a random instance is chosen from the k nearest neighbors and this value is assigned a synthetic variable. Each nominal attribute is repeated in this way. The disadvantage of SMOTE is that it generates synthetic data without considering nearby examples which leads to overlapping of classes.

SMOTE

Step 1: D is the original data set
Step 2: Create data set I with minority set observation of I € D
Step 3: Identify the value of K which is the number of nearest neighbors of Minority class
Step 4: Identify the value of N which is the number of synthetic examples which needs to be created.
Step 5: Create a dataset D’ which is the random sample of I of size N.
For each example xkI
        D’ = x + rand(0,1) * | x – xk |
Adaptive synthetic sampling (ADASYN)

ADASYN also improves the class imbalance by generating synthetic data like SMOTE. The ADASYN focuses on the samples which are difficult to classify (i.e. minority class with a nearest-neighbors rule) whereas SMOTE doesn’t make any distinction. SMOTE uses uniform distribution whereas ADASYN uses density distribution to identify the number of synthetic samples to be generated. ADASYN does not generate Synthetic instances for minority dataset with no majority cases in their k-nearest neighbors. (He et al., 2008).

ADASYN

Step 1: D is the dataset with m class examples {((x1,y1)(,(x2,y2)…..(xn,yn)}
Step 2: Calculate the Imbalance ratio IR = |Dmaj / |Dmin|. If IR is less than the threshold value then the data is Imbalanced
Step 3: Identify G the no of synthetic examples which need to be generated for the minority class
        G = (|Dmaj| – |DMin|) × β
Step 4: For each instance in minority class identify the K nearest neighbor using Euclidean distance (Δi) and calculate the ration Ri which is defined as
        Ri = Δi /K
Step 5: Normalize Ri
Step 6: Identify the number of synthetic data (gi) example for each minority instance
        gi = r̂i × G
Step 7: For each minority xi choose the nearest neighbor xk and generate the synthetic example as follows
        si = xi + rand(0,1) * |xi – xk|
Edited nearest neighbour (ENN)

The ENN is another method of under sampling by deleting selected samples. It uses K-nearest neighbors to locate those samples in a dataset for deletion. All instances which have been misclassified by the k-NN rule from the training set are removed.

ENN

Step 1: Let D is the original training set, and T is the edited set
Step 2: For each xi in D remove xk if it is misclassified using the k-NN rule
Condensed nearest neighbour (CNN)

CNN is under sampling technique which removes the samples from its K-nearest Neighbors. A subset of data is created which will be able to correctly classify the original data set using a one nearest neighbor rule. Take a random instance from D and create a subset T. Using subset T classify the remaining elements in D which does not match class using KNN Rule. xI is randomly chosen from xn, CNN scans all members of xn, and adds to T where x of xn whose nearest prototype does not match in label with xi. The algorithm scans xn as many times as necessary, until all members of xn have been absorbed or, equivalently, no more prototypes can be added to T.

CNN

Step 1: Let D be the original dataset(x1,x2,x3….xn)
Step 2: Take a random instance xi from D and create a Subset T
Step 3: Scan all members of D and add to T where x € T does not match class using KNN Rule and add it to T.
Step 4: Repeat step 3 until all members of xi have been checked
Experimental setup

Imbalance Dataset: The Medical appointment dataset which is taken from kaggle website is an imbalanced binary classification problem. Sample snapshot of the no-show data set is given in below Figure 1. The dataset contains the Appointment Id, Gender, Scheduled day, Appointment day, Age, medical details and whether the patient turned up or not turned up (No-show)

Figure 1

Rebalancing framework for imbalanced data classification.

Figure 2

Sample records of medical appointment no-show dataset.

The most important feature in the dataset is whether the patient show-up or no-show to the appointment. The structured dataset used in the modeling and evaluation is 110.527×14 dataset with 20% of minority cases and 80% majority cases where No-show and Non_no-show cases as minority and majority classes respectively. As an experimental study 8,000 records were taken and the resampling framework is applied to the data to identify the best performance of the sampling strategy.

Choice of classifier: Several Conventional classifying algorithms are available in the field of machine learning like Logistic regression, Support vector machine, Random Forest, and Decision Trees for a classification problem. The choice of classifier plays a crucial role in classifying and prediction of imbalanced data. Many healthcare applications choose the classifier based on simplicity, Interpretability and computational efficiency. In this study, Decision tree is chosen as classifier which is a non linear algorithm. This classifier is applied on the different strategies of rebalanced data for the prediction of appointment no-shows. The decision tree is identified as the appropriate classifier based on the following reasons:

Successive splitting of the data space results in very fewer observations of minority class instances resulting in fewer leaves describing minority class and successively does the weaker confidence estimates.

It supports weighted samples in decision trees.

Decision tree does not require parameter tuning.

Decision Tree performs well for both numerical as well as categorical data.

The height of the tree has logarithmic complexity, the cost of accessing any tree is very less compared to other models (Mehndiratta & Soni, 2019).

Performance Metrics: The choice of metrics plays a vital role for evaluating the best sampling models for medical appointment no-show dataset. The chosen metric captures the details of a model and its predictions, which are most important to the experiment. This study concentrates on the following metric for skewed class distributions:

Recall = True Positive/(True Positive+False Negative)

Precision = True Positive/(True Positive+False Positive)

F1 Score = Harmonic mean of precision and Recall

AUCROC indicates how well the probabilities from the positive classes are separated from the negative classes.

In assessing skewed distributions, recall denotes the percentage of correctly classified minority instances. Precision means the percentage of relevant results whereas recall refers to the percentage of total relevant results correctly classified by the algorithm. F1-score represents a harmonic mean between recall and precision. In many scenarios of imbalance, recall is the primary measure which identifies rare but significant minority cases. As there is always a tradeoff between recall and specificity, indiscriminately improving recall can result in low specificity and classification accuracy.

Results and discussion

A 5-fold cross validation for four times was executed to get the performance metrics for 20 runs. For every fold, data is split into a training set and a test set. The training data is either under sampled or over sampled and the balanced data is used by the decision tree classifier. The performance of the classifier is evaluated against the test set in each fold. The performance measures, Recall, Precision, F1 Score AUCROC are calculated for each sampling method RUS, ROS, ADASYN, SMOTE, CNN, ENN.

In Table 1, totally 20 trial runs were executed (RUS_TRIAL1, RUS_TRIAL2… RUS_TRIAL20), with RUS sampling method + decision tree classifier and the corresponding Recall, Precision, F1 Score, AUCROC for each trial is recorded. Similarly Table 2 to Table 6, represents 20 trial runs of various sampling methods (ROS, ADASYN, SMOTE, CNN, ENN) + Decision tree classifier with performance measures such as Recall, Precision, and F1Score AUCROC. The comparison of sampling techniques with performance metric is tabulated in table 7. The experiment was carried out in R with UBL package.

Random under sampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
RUS_TRIAL10.819348470.6661320.5745910.734838
RUS_TRIAL20.286689420.5014930.5855290.364821
RUS_TRIAL30.2958860760.5862070.6194110.39327
RUS_TRIAL40.8510848130.6773940.6071020.754371
RUS_TRIAL50.8108108110.6552260.5194370.724763
RUS_TRIAL60.8266796490.6732750.5741010.742133
RUS_TRIAL70.8514851490.6677020.5934660.748477
RUS_TRIAL80.8668668670.6797490.6358870.761989
RUS_TRIAL90.8304914740.6693610.6018980.741271
RUS_TRIAL100.8414517670.6840060.5759770.754604
RUS_TRIAL110.2954924870.558360.6147220.386463
RUS_TRIAL120.2737030410.4967530.5912560.352941
RUS_TRIAL130.8274209010.6931730.5930650.754371
RUS_TRIAL140.839603960.6671910.5873950.743534
RUS_TRIAL150.8403525950.6825780.6036790.753292
RUS_TRIAL160.2813504820.5892260.6230860.380849
RUS_TRIAL170.8596837940.6718150.603120.754226
RUS_TRIAL180.8430622010.6915230.5942280.75981
RUS_TRIAL190.8323586740.6971430.6192380.758774
RUS_TRIAL200.812375250.6506790.5559990.722592
RUS TRIAL MEAN0.6993098940.6429490.5936590.669946

Random oversampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ROS_TRIAL10.8533077660.6964010.6106230.766911
ROS_TRIAL20.2986111110.5088760.5943740.376368
ROS_TRIAL30.8546845120.6968040.6086550.767711
ROS_TRIAL40.831094050.6878470.5858590.752716
ROS_TRIAL50.8471528470.6698260.6058710.748125
ROS_TRIAL60.8384615380.6898730.5949370.756944
ROS_TRIAL70.8549172350.6827370.6041070.759187
ROS_TRIAL80.8407707910.6579370.5980860.738201
ROS_TRIAL90.8534728830.7007810.6097660.769627
ROS_TRIAL100.3048780490.5116960.5972630.382096
ROS_TRIAL110.2931654680.5062110.599350.371298
ROS_TRIAL120.830303030.6629030.5981180.73722
ROS_TRIAL130.8509433960.703040.6023090.769953
ROS_TRIAL140.3135593220.5346820.6058580.395299
ROS_TRIAL150.8612487610.672080.6080270.754996
ROS_TRIAL160.2981818180.5141070.606390.377445
ROS_TRIAL170.8440545810.6878470.609320.757987
ROS_TRIAL180.8505516550.6635370.6004020.745495
ROS_TRIAL190.8434442270.678740.5969460.752182
ROS_TRIAL200.8381502890.6904760.5981790.75718
ROS TRIAL MEAN0.7100476670.640820.6017220.67366

Adaptive synthetic sampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ADAYSYN_TRAIL10.7903736540.9912630.5102940.879493
ADAYSYN_TRAIL20.8066157760.986770.5108460.887644
ADAYSYN_TRAIL30.8001277140.9897310.5263030.884887
ADAYSYN_TRAIL40.8046924540.9875490.5048850.886792
ADAYSYN_TRAIL50.7889733840.9936150.5169810.879548
ADAYSYN_TRAIL60.7895740620.9872810.5097230.877428
ADAYSYN_TRAIL70.8159744410.985340.5189860.892695
ADAYSYN_TRAIL80.8055908510.9875390.5095920.887334
ADAYSYN_TRAIL90.7994923860.9905660.5135760.884831
ADAYSYN_TRAIL100.7786937220.9919220.5139170.872469
ADAYSYN_TRAIL110.803934010.9890710.510210.886944
ADAYSYN_TRAIL120.7903430750.9928170.5209040.880085
ADAYSYN_TRAIL130.8068833650.9844480.509740.886865
ADAYSYN_TRAIL140.786438530.9896330.5078220.876412
ADAYSYN_TRAIL150.799490770.9858710.5098070.882953
ADAYSYN_TRAIL160.7921419520.9936410.5172880.881523
ADAYSYN_TRAIL170.8130959950.9861220.5079120.891289
ADAYSYN_TRAIL180.8129032260.9714730.5071880.885142
ADAYSYN_TRAIL190.7780596070.9911150.5121320.871758
ADAYSYN_TRAIL200.7892405060.9912560.5087860.878788
ADASYN TRIAL MEAN0.7936723320.9708290.5169910.873358

Synthetic minority oversampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
SMOTE_TRAIL10.2980132450.5555560.6116340.387931
SMOTE_TRAIL20.8579766540.6944880.6260320.767624
SMOTE_TRAIL30.2734627830.534810.5925610.361884
SMOTE_TRAIL40.83826430.6692910.5861610.744308
SMOTE_TRAIL50.8278846150.6899040.5906910.752622
SMOTE_TRAIL60.2905525850.5125790.6010630.370876
SMOTE_TRAIL70.8516699410.6747080.5976720.752931
SMOTE_TRAIL80.8421052630.6666670.6117420.744186
SMOTE_TRAIL90.8413597730.6993720.5920170.763823
SMOTE_TRAIL100.8411122140.6727560.6017740.747573
SMOTE_TRAIL110.852570320.6883320.6088720.761698
SMOTE_TRAIL120.3103448280.494220.5955950.381271
SMOTE_TRAIL130.8553274680.6809340.6055460.758232
SMOTE_TRAIL140.8348348350.6619050.5883050.73838
SMOTE_TRAIL150.8483935740.6643080.6019710.74515
SMOTE_TRAIL160.3036649210.4943180.5873030.376216
SMOTE_TRAIL170.2907180390.5337620.6097830.376417
SMOTE_TRAIL180.8449848020.6640130.6096230.743647
SMOTE_TRAIL190.8518877060.6848250.5995550.759275
SMOTE_TRAIL200.8469891410.6755910.6029470.751643
SMOTE MEAN0.680105850.6306170.6010420.654427

Edited nearest neighbor performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ENN_TRAIL10.79135338310.5059350.883526
ENN_TRAIL20.8058860360.9992240.5028170.892201
ENN_TRAIL30.79649342510.5045730.88672
ENN_TRAIL40.79924953110.5015530.888425
ENN_TRAIL50.78083907310.5042490.876934
ENN_TRAIL60.79362101310.5015110.884937
ENN_TRAIL70.7875939850.9984110.5021380.88056
ENN_TRAIL80.8111668760.9992270.5077840.895429
ENN_TRAIL90.7911194510.5014930.88338
ENN_TRAIL100.79010025110.50590.882744
ENN_TRAIL110.79398872910.5045180.885166
ENN_TRAIL120.80863039410.5016290.894191
ENN_TRAIL130.7833437710.5042980.878511
ENN_TRAIL140.7948557090.9984240.5052540.885086
ENN_TRAIL150.79224030.9992110.5011070.88377
ENN_TRAIL160.80576441110.5063690.892436
ENN_TRAIL170.77847309110.5028090.87544
ENN_TRAIL180.800751880.9984380.5023440.888734
ENN_TRAIL190.79687510.50.886957
ENN_TRAIL200.79036295410.5029670.882908
ENN TRAIL MEAN0.7946354630.9996470.5034620.885429

Condensed nearest neighbor performance with decision tree.

Samplin methodPrecisionRecallAUCROCF1 Score
CNN_TRAIL10.8031545740.9953090.5116730.888966
CNN_TRAIL20.78477029610.5155810.879408
CNN_TRAIL30.8026398490.9968770.5062760.889276
CNN_TRAIL40.8026398490.9984360.5101220.889895
CNN_TRAIL50.7914572860.9984150.5080830.882971
CNN_TRAIL60.7988650690.9952870.5098760.886324
CNN_TRAIL70.8036500940.9992180.5151370.890827
CNN_TRAIL80.7976040350.9984210.5172290.886786
CNN_TRAIL90.79208542710.5117990.883982
CNN_TRAIL100.7973568280.9984240.5128070.886634
CNN_TRAIL110.798861480.9952720.5172730.886316
CNN_TRAIL120.7979861550.9984250.5128490.887023
CNN_TRAIL130.8047708730.9992210.5090740.891516
CNN_TRAIL140.7890772130.9984110.5065370.881487
CNN_TRAIL150.7957259590.9992110.5116170.885934
CNN_TRAIL160.80578252710.5141510.892447
CNN_TRAIL170.7847222220.9959940.5136220.877825
CNN_TRAIL180.7957259590.9984230.5097540.885624
CNN_TRAIL190.81143934610.5145630.895906
CNN_TRAIL200.7903022670.9968230.5101420.88163
CNN TRAIL MEAN0.7974308650.9981080.5119080.886555

Comparative study of different sampling performances with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
RUS0.699310.6429490.5936590.669946
ROS0.7100480.640820.6017220.67366
ADAYSYN0.7892410.9912560.5087860.873358
SMOTE0.6801060.6306170.6010420.654427
ENN0.7946350.9996470.5034620.885429
CNN0.7974310.9981080.5119080.886555

Before analyzing the performance of various sampling methods, it is to be noted that there is no constant winner that can always beat all the other algorithms. As Recall is the key measure for an imbalanced scenario, it is identified that ENN Performance outperforms other models. It is observed that the performance of recall scales well for CNN and ADASYN also. When False Negative and False Positive are equally costly, F1 metric is significant where ENN, CNN, ADASYN outperforms others.

High Recall, Low Precision: This is achieved by ADASYN, CNN, and ENN where it has High recall and Low precision meaning the classifier were able to predict the no-shows which is a minority class more correctly.

High recall, High precision: The classifier predicts most of the positive samples where appointment has been carried out, as appointed and missed appointment as no-shows. This enables both positive and negative cases to classify correctly. RUS and SMOTE techniques perform well with High Recall and High Precision.

Moreover F1-Score which shows the prediction performance of Appointed and No-shows gives a remarkable performance when ENN, CNN and ADASYN sampling method are used with decision tree classifier.

Overall, Recall depicts the negative samples where False Negative is high where the NO_Shows were predicted correctly by sampling method using ENN.

The limitation is that the testing was carried out with a limited dataset and needs to be tested with a larger dataset and the implications are this framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.

Conclusion

This experimental study incorporates various rebalancing strategies for appointment no-show data in healthcare and also some guidelines were provided for tackling similar imbalanced problems. The performance of each sampling method with Decision Tree classifier is compared. Performance is assessed with metrics such as Precision, Recall, AUCROC, and F1 score. In imbalanced classification problems, recall is usually more important since false negatives are frequently more costly than false positives. The recall value of 0.999 for ENN sampling method predicts the no-shows minority class exactly when compared to other sampling methods such RUS, ROS, ADASYN, and CNN. For future work the performance of sampling methods could also be tested with large datasets with more imbalance ratio. Similarly this framework can be tried on cost sensitive approaches and ensemble algorithms.

Figure 3

Comparative study of different sampling performances.

Figure 1

Rebalancing framework for imbalanced data classification.
Rebalancing framework for imbalanced data classification.

Figure 2

Sample records of medical appointment no-show dataset.
Sample records of medical appointment no-show dataset.

Figure 3

Comparative study of different sampling performances.
Comparative study of different sampling performances.

Synthetic minority oversampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
SMOTE_TRAIL10.2980132450.5555560.6116340.387931
SMOTE_TRAIL20.8579766540.6944880.6260320.767624
SMOTE_TRAIL30.2734627830.534810.5925610.361884
SMOTE_TRAIL40.83826430.6692910.5861610.744308
SMOTE_TRAIL50.8278846150.6899040.5906910.752622
SMOTE_TRAIL60.2905525850.5125790.6010630.370876
SMOTE_TRAIL70.8516699410.6747080.5976720.752931
SMOTE_TRAIL80.8421052630.6666670.6117420.744186
SMOTE_TRAIL90.8413597730.6993720.5920170.763823
SMOTE_TRAIL100.8411122140.6727560.6017740.747573
SMOTE_TRAIL110.852570320.6883320.6088720.761698
SMOTE_TRAIL120.3103448280.494220.5955950.381271
SMOTE_TRAIL130.8553274680.6809340.6055460.758232
SMOTE_TRAIL140.8348348350.6619050.5883050.73838
SMOTE_TRAIL150.8483935740.6643080.6019710.74515
SMOTE_TRAIL160.3036649210.4943180.5873030.376216
SMOTE_TRAIL170.2907180390.5337620.6097830.376417
SMOTE_TRAIL180.8449848020.6640130.6096230.743647
SMOTE_TRAIL190.8518877060.6848250.5995550.759275
SMOTE_TRAIL200.8469891410.6755910.6029470.751643
SMOTE MEAN0.680105850.6306170.6010420.654427

ADASYN

Step 1: D is the dataset with m class examples {((x1,y1)(,(x2,y2)…..(xn,yn)}
Step 2: Calculate the Imbalance ratio IR = |Dmaj / |Dmin|. If IR is less than the threshold value then the data is Imbalanced
Step 3: Identify G the no of synthetic examples which need to be generated for the minority class
        G = (|Dmaj| – |DMin|) × β
Step 4: For each instance in minority class identify the K nearest neighbor using Euclidean distance (Δi) and calculate the ration Ri which is defined as
        Ri = Δi /K
Step 5: Normalize Ri
Step 6: Identify the number of synthetic data (gi) example for each minority instance
        gi = r̂i × G
Step 7: For each minority xi choose the nearest neighbor xk and generate the synthetic example as follows
        si = xi + rand(0,1) * |xi – xk|

CNN

Step 1: Let D be the original dataset(x1,x2,x3….xn)
Step 2: Take a random instance xi from D and create a Subset T
Step 3: Scan all members of D and add to T where x € T does not match class using KNN Rule and add it to T.
Step 4: Repeat step 3 until all members of xi have been checked

Adaptive synthetic sampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ADAYSYN_TRAIL10.7903736540.9912630.5102940.879493
ADAYSYN_TRAIL20.8066157760.986770.5108460.887644
ADAYSYN_TRAIL30.8001277140.9897310.5263030.884887
ADAYSYN_TRAIL40.8046924540.9875490.5048850.886792
ADAYSYN_TRAIL50.7889733840.9936150.5169810.879548
ADAYSYN_TRAIL60.7895740620.9872810.5097230.877428
ADAYSYN_TRAIL70.8159744410.985340.5189860.892695
ADAYSYN_TRAIL80.8055908510.9875390.5095920.887334
ADAYSYN_TRAIL90.7994923860.9905660.5135760.884831
ADAYSYN_TRAIL100.7786937220.9919220.5139170.872469
ADAYSYN_TRAIL110.803934010.9890710.510210.886944
ADAYSYN_TRAIL120.7903430750.9928170.5209040.880085
ADAYSYN_TRAIL130.8068833650.9844480.509740.886865
ADAYSYN_TRAIL140.786438530.9896330.5078220.876412
ADAYSYN_TRAIL150.799490770.9858710.5098070.882953
ADAYSYN_TRAIL160.7921419520.9936410.5172880.881523
ADAYSYN_TRAIL170.8130959950.9861220.5079120.891289
ADAYSYN_TRAIL180.8129032260.9714730.5071880.885142
ADAYSYN_TRAIL190.7780596070.9911150.5121320.871758
ADAYSYN_TRAIL200.7892405060.9912560.5087860.878788
ADASYN TRIAL MEAN0.7936723320.9708290.5169910.873358

ENN

Step 1: Let D is the original training set, and T is the edited set
Step 2: For each xi in D remove xk if it is misclassified using the k-NN rule

Edited nearest neighbor performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ENN_TRAIL10.79135338310.5059350.883526
ENN_TRAIL20.8058860360.9992240.5028170.892201
ENN_TRAIL30.79649342510.5045730.88672
ENN_TRAIL40.79924953110.5015530.888425
ENN_TRAIL50.78083907310.5042490.876934
ENN_TRAIL60.79362101310.5015110.884937
ENN_TRAIL70.7875939850.9984110.5021380.88056
ENN_TRAIL80.8111668760.9992270.5077840.895429
ENN_TRAIL90.7911194510.5014930.88338
ENN_TRAIL100.79010025110.50590.882744
ENN_TRAIL110.79398872910.5045180.885166
ENN_TRAIL120.80863039410.5016290.894191
ENN_TRAIL130.7833437710.5042980.878511
ENN_TRAIL140.7948557090.9984240.5052540.885086
ENN_TRAIL150.79224030.9992110.5011070.88377
ENN_TRAIL160.80576441110.5063690.892436
ENN_TRAIL170.77847309110.5028090.87544
ENN_TRAIL180.800751880.9984380.5023440.888734
ENN_TRAIL190.79687510.50.886957
ENN_TRAIL200.79036295410.5029670.882908
ENN TRAIL MEAN0.7946354630.9996470.5034620.885429

Random under sampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
RUS_TRIAL10.819348470.6661320.5745910.734838
RUS_TRIAL20.286689420.5014930.5855290.364821
RUS_TRIAL30.2958860760.5862070.6194110.39327
RUS_TRIAL40.8510848130.6773940.6071020.754371
RUS_TRIAL50.8108108110.6552260.5194370.724763
RUS_TRIAL60.8266796490.6732750.5741010.742133
RUS_TRIAL70.8514851490.6677020.5934660.748477
RUS_TRIAL80.8668668670.6797490.6358870.761989
RUS_TRIAL90.8304914740.6693610.6018980.741271
RUS_TRIAL100.8414517670.6840060.5759770.754604
RUS_TRIAL110.2954924870.558360.6147220.386463
RUS_TRIAL120.2737030410.4967530.5912560.352941
RUS_TRIAL130.8274209010.6931730.5930650.754371
RUS_TRIAL140.839603960.6671910.5873950.743534
RUS_TRIAL150.8403525950.6825780.6036790.753292
RUS_TRIAL160.2813504820.5892260.6230860.380849
RUS_TRIAL170.8596837940.6718150.603120.754226
RUS_TRIAL180.8430622010.6915230.5942280.75981
RUS_TRIAL190.8323586740.6971430.6192380.758774
RUS_TRIAL200.812375250.6506790.5559990.722592
RUS TRIAL MEAN0.6993098940.6429490.5936590.669946

Comparative study of different sampling performances with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
RUS0.699310.6429490.5936590.669946
ROS0.7100480.640820.6017220.67366
ADAYSYN0.7892410.9912560.5087860.873358
SMOTE0.6801060.6306170.6010420.654427
ENN0.7946350.9996470.5034620.885429
CNN0.7974310.9981080.5119080.886555

RUS

Step 1: D is the original data set
Step 2: E is new set and it is a subset of D which is created randomly with or without replacement.
Step 3: D = Dmaj + Dmin - E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively

SMOTE

Step 1: D is the original data set
Step 2: Create data set I with minority set observation of I € D
Step 3: Identify the value of K which is the number of nearest neighbors of Minority class
Step 4: Identify the value of N which is the number of synthetic examples which needs to be created.
Step 5: Create a dataset D’ which is the random sample of I of size N.
For each example xkI
        D’ = x + rand(0,1) * | x – xk |

ROS

Step 1: D is the original data set
Step 2: E is new set and adds it by appending randomly selected examples from the minority class (with replacement).
Step 3: D = Dmin + Dmaj + E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively

Random oversampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ROS_TRIAL10.8533077660.6964010.6106230.766911
ROS_TRIAL20.2986111110.5088760.5943740.376368
ROS_TRIAL30.8546845120.6968040.6086550.767711
ROS_TRIAL40.831094050.6878470.5858590.752716
ROS_TRIAL50.8471528470.6698260.6058710.748125
ROS_TRIAL60.8384615380.6898730.5949370.756944
ROS_TRIAL70.8549172350.6827370.6041070.759187
ROS_TRIAL80.8407707910.6579370.5980860.738201
ROS_TRIAL90.8534728830.7007810.6097660.769627
ROS_TRIAL100.3048780490.5116960.5972630.382096
ROS_TRIAL110.2931654680.5062110.599350.371298
ROS_TRIAL120.830303030.6629030.5981180.73722
ROS_TRIAL130.8509433960.703040.6023090.769953
ROS_TRIAL140.3135593220.5346820.6058580.395299
ROS_TRIAL150.8612487610.672080.6080270.754996
ROS_TRIAL160.2981818180.5141070.606390.377445
ROS_TRIAL170.8440545810.6878470.609320.757987
ROS_TRIAL180.8505516550.6635370.6004020.745495
ROS_TRIAL190.8434442270.678740.5969460.752182
ROS_TRIAL200.8381502890.6904760.5981790.75718
ROS TRIAL MEAN0.7100476670.640820.6017220.67366

Condensed nearest neighbor performance with decision tree.

Samplin methodPrecisionRecallAUCROCF1 Score
CNN_TRAIL10.8031545740.9953090.5116730.888966
CNN_TRAIL20.78477029610.5155810.879408
CNN_TRAIL30.8026398490.9968770.5062760.889276
CNN_TRAIL40.8026398490.9984360.5101220.889895
CNN_TRAIL50.7914572860.9984150.5080830.882971
CNN_TRAIL60.7988650690.9952870.5098760.886324
CNN_TRAIL70.8036500940.9992180.5151370.890827
CNN_TRAIL80.7976040350.9984210.5172290.886786
CNN_TRAIL90.79208542710.5117990.883982
CNN_TRAIL100.7973568280.9984240.5128070.886634
CNN_TRAIL110.798861480.9952720.5172730.886316
CNN_TRAIL120.7979861550.9984250.5128490.887023
CNN_TRAIL130.8047708730.9992210.5090740.891516
CNN_TRAIL140.7890772130.9984110.5065370.881487
CNN_TRAIL150.7957259590.9992110.5116170.885934
CNN_TRAIL160.80578252710.5141510.892447
CNN_TRAIL170.7847222220.9959940.5136220.877825
CNN_TRAIL180.7957259590.9984230.5097540.885624
CNN_TRAIL190.81143934610.5145630.895906
CNN_TRAIL200.7903022670.9968230.5101420.88163
CNN TRAIL MEAN0.7974308650.9981080.5119080.886555

Amin, A., Anwar, S., Adnan, A., Nawaz, M., Howard, N., Qadir, J., …, & Hussain, A. (2016). Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case study. IEEE Access, 4(Ml), 7940–7957. https://doi.org/10.1109/ACCESS.2016.2619719AminA.AnwarS.AdnanA.NawazM.HowardN.QadirJ.HussainA.2016Comparing oversampling techniques to handle the class imbalance problem: A customer churn prediction case studyIEEE Access4Ml79407957https://doi.org/10.1109/ACCESS.2016.261971910.1109/ACCESS.2016.2619719Search in Google Scholar

El-Sayed, A.A., Mahmood, M.A.M., Meguid, N.A., & Hefny, H.A. (2016). Handling autism imbalanced data using synthetic minority over-sampling technique (SMOTE). In Proceedings of 2015 IEEE World Conference on Complex Systems, WCCS 2015, (November). https://doi.org/10.1109/ICoCS.2015.7483267El-SayedA.A.MahmoodM.A.M.MeguidN.A.HefnyH.A.2016Handling autism imbalanced data using synthetic minority over-sampling technique (SMOTE)InProceedings of 2015 IEEE World Conference on Complex Systems, WCCS2015, (November)https://doi.org/10.1109/ICoCS.2015.748326710.1109/ICoCS.2015.7483267Search in Google Scholar

Fotouhi, S., Asadi, S., & Kattan, M.W. (2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90. https://doi.org/10.1016/j.jbi.2018.12.003FotouhiS.AsadiS.KattanM.W.2019A comprehensive data level analysis for cancer diagnosis on imbalanced dataJournal of Biomedical Informatics90https://doi.org/10.1016/j.jbi.2018.12.00310.1016/j.jbi.2018.12.003Search in Google Scholar

Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 42(4), 463–484. https://doi.org/10.1109/TSMCC.2011.2161285GalarM.FernandezA.BarrenecheaE.BustinceH.HerreraF.2012A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approachesIEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews424463484https://doi.org/10.1109/TSMCC.2011.216128510.1109/TSMCC.2011.2161285Search in Google Scholar

He, H., Bai, Y., Garcia, E., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322–1328. doi: 10.1109/IJCNN.2008.4633969HeH.BaiY.GarciaE.LiS.2008ADASYN: Adaptive synthetic sampling approach for imbalanced learninInIEEE International Joint Conference on Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence)Hong Kong20081322132810.1109/IJCNN.2008.4633969Apri DOISearch in Google Scholar

Kheirkhah, P., Feng, Q., Travis, L.M., Tavakoli-Tabasi, S., & Sharafkhaneh, A. (2016). Prevalence, predictors and economic consequences of no-shows. BMC Health Services Research, 16(1), 1–6. https://doi.org/10.1186/s12913-015-1243-zKheirkhahP.FengQ.TravisL.M.Tavakoli-TabasiS.SharafkhanehA.2016Prevalence, predictors and economic consequences of no-showsBMC Health Services Research16116https://doi.org/10.1186/s12913-015-1243-z10.1186/s12913-015-1243-zSearch in Google Scholar

Lemnaru, C., & Potolea, R. (2012). Imbalanced classification problems: Systematic study, issues and best practices. Lecture Notes in Business Information Processing, 102 LNBIP(1), 35–50. https://doi.org/10.1007/978-3-642-29958-2_3LemnaruC.PotoleaR.2012Imbalanced classification problems: Systematic study, issues and best practicesLecture Notes in Business Information Processing102LNBIP13550https://doi.org/10.1007/978-3-642-29958-2_310.1007/978-3-642-29958-2_3Search in Google Scholar

López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141. https://doi.org/10.1016/j.ins.2013.07.007LópezV.FernándezA.GarcíaS.PaladeV.HerreraF.2013An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristicsInformation Sciences250113141https://doi.org/10.1016/j.ins.2013.07.00710.1016/j.ins.2013.07.007Search in Google Scholar

Mehndiratta, P., & Soni, D. (2019). Identification of sarcasm in textual data: A comparative study. Journal of Data and Information Science, 4(4), 56–83. https://doi.org/10.2478/jdis-2019-0021MehndirattaP.SoniD.2019Identification of sarcasm in textual data: A comparative studyJournal of Data and Information Science445683https://doi.org/10.2478/jdis-2019-002110.2478/jdis-2019-0021Search in Google Scholar

Mohammadi, I., Wu, H., Turkcan, A., Toscos, T., & Doebbeling, B.N. (2018). Data analytics and modeling for appointment no-show in community health centers. Journal of Primary Care and Community Health, 9. https://doi.org/10.1177/2150132718811692MohammadiI.WuH.TurkcanA.ToscosT.DoebbelingB.N.2018Data analytics and modeling for appointment no-show in community health centersJournal of Primary Care and Community Health9https://doi.org/10.1177/215013271881169210.1177/2150132718811692Search in Google Scholar

Zhao, Y., Wong, Z.S.Y., & Tsui, K.L. (2018). A framework of rebalancing imbalanced healthcare data for rare events’ classification: A case of look-alike sound-alike mix-up incident detection. Journal of Healthcare Engineering, 2018(2010), 1–11. https://doi.org/10.1155/2018/6275435ZhaoY.WongZ.S.Y.TsuiK.L.2018A framework of rebalancing imbalanced healthcare data for rare events’ classification: A case of look-alike sound-alike mix-up incident detectionJournal of Healthcare Engineering20182010111https://doi.org/10.1155/2018/627543510.1155/2018/6275435Search in Google Scholar

Articoli consigliati da Trend MD

Pianifica la tua conferenza remota con Sciendo