Accesso libero

A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data

INFORMAZIONI SU QUESTO ARTICOLO

Cita

Figure 1

Rebalancing framework for imbalanced data classification.
Rebalancing framework for imbalanced data classification.

Figure 2

Sample records of medical appointment no-show dataset.
Sample records of medical appointment no-show dataset.

Figure 3

Comparative study of different sampling performances.
Comparative study of different sampling performances.

Synthetic minority oversampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
SMOTE_TRAIL10.2980132450.5555560.6116340.387931
SMOTE_TRAIL20.8579766540.6944880.6260320.767624
SMOTE_TRAIL30.2734627830.534810.5925610.361884
SMOTE_TRAIL40.83826430.6692910.5861610.744308
SMOTE_TRAIL50.8278846150.6899040.5906910.752622
SMOTE_TRAIL60.2905525850.5125790.6010630.370876
SMOTE_TRAIL70.8516699410.6747080.5976720.752931
SMOTE_TRAIL80.8421052630.6666670.6117420.744186
SMOTE_TRAIL90.8413597730.6993720.5920170.763823
SMOTE_TRAIL100.8411122140.6727560.6017740.747573
SMOTE_TRAIL110.852570320.6883320.6088720.761698
SMOTE_TRAIL120.3103448280.494220.5955950.381271
SMOTE_TRAIL130.8553274680.6809340.6055460.758232
SMOTE_TRAIL140.8348348350.6619050.5883050.73838
SMOTE_TRAIL150.8483935740.6643080.6019710.74515
SMOTE_TRAIL160.3036649210.4943180.5873030.376216
SMOTE_TRAIL170.2907180390.5337620.6097830.376417
SMOTE_TRAIL180.8449848020.6640130.6096230.743647
SMOTE_TRAIL190.8518877060.6848250.5995550.759275
SMOTE_TRAIL200.8469891410.6755910.6029470.751643
SMOTE MEAN0.680105850.6306170.6010420.654427

ADASYN

Step 1: D is the dataset with m class examples {((x1,y1)(,(x2,y2)…..(xn,yn)}
Step 2: Calculate the Imbalance ratio IR = |Dmaj / |Dmin|. If IR is less than the threshold value then the data is Imbalanced
Step 3: Identify G the no of synthetic examples which need to be generated for the minority class
        G = (|Dmaj| – |DMin|) × β
Step 4: For each instance in minority class identify the K nearest neighbor using Euclidean distance (Δi) and calculate the ration Ri which is defined as
        Ri = Δi /K
Step 5: Normalize Ri
Step 6: Identify the number of synthetic data (gi) example for each minority instance
        gi = r̂i × G
Step 7: For each minority xi choose the nearest neighbor xk and generate the synthetic example as follows
        si = xi + rand(0,1) * |xi – xk|

CNN

Step 1: Let D be the original dataset(x1,x2,x3….xn)
Step 2: Take a random instance xi from D and create a Subset T
Step 3: Scan all members of D and add to T where x € T does not match class using KNN Rule and add it to T.
Step 4: Repeat step 3 until all members of xi have been checked

Adaptive synthetic sampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ADAYSYN_TRAIL10.7903736540.9912630.5102940.879493
ADAYSYN_TRAIL20.8066157760.986770.5108460.887644
ADAYSYN_TRAIL30.8001277140.9897310.5263030.884887
ADAYSYN_TRAIL40.8046924540.9875490.5048850.886792
ADAYSYN_TRAIL50.7889733840.9936150.5169810.879548
ADAYSYN_TRAIL60.7895740620.9872810.5097230.877428
ADAYSYN_TRAIL70.8159744410.985340.5189860.892695
ADAYSYN_TRAIL80.8055908510.9875390.5095920.887334
ADAYSYN_TRAIL90.7994923860.9905660.5135760.884831
ADAYSYN_TRAIL100.7786937220.9919220.5139170.872469
ADAYSYN_TRAIL110.803934010.9890710.510210.886944
ADAYSYN_TRAIL120.7903430750.9928170.5209040.880085
ADAYSYN_TRAIL130.8068833650.9844480.509740.886865
ADAYSYN_TRAIL140.786438530.9896330.5078220.876412
ADAYSYN_TRAIL150.799490770.9858710.5098070.882953
ADAYSYN_TRAIL160.7921419520.9936410.5172880.881523
ADAYSYN_TRAIL170.8130959950.9861220.5079120.891289
ADAYSYN_TRAIL180.8129032260.9714730.5071880.885142
ADAYSYN_TRAIL190.7780596070.9911150.5121320.871758
ADAYSYN_TRAIL200.7892405060.9912560.5087860.878788
ADASYN TRIAL MEAN0.7936723320.9708290.5169910.873358

ENN

Step 1: Let D is the original training set, and T is the edited set
Step 2: For each xi in D remove xk if it is misclassified using the k-NN rule

Edited nearest neighbor performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ENN_TRAIL10.79135338310.5059350.883526
ENN_TRAIL20.8058860360.9992240.5028170.892201
ENN_TRAIL30.79649342510.5045730.88672
ENN_TRAIL40.79924953110.5015530.888425
ENN_TRAIL50.78083907310.5042490.876934
ENN_TRAIL60.79362101310.5015110.884937
ENN_TRAIL70.7875939850.9984110.5021380.88056
ENN_TRAIL80.8111668760.9992270.5077840.895429
ENN_TRAIL90.7911194510.5014930.88338
ENN_TRAIL100.79010025110.50590.882744
ENN_TRAIL110.79398872910.5045180.885166
ENN_TRAIL120.80863039410.5016290.894191
ENN_TRAIL130.7833437710.5042980.878511
ENN_TRAIL140.7948557090.9984240.5052540.885086
ENN_TRAIL150.79224030.9992110.5011070.88377
ENN_TRAIL160.80576441110.5063690.892436
ENN_TRAIL170.77847309110.5028090.87544
ENN_TRAIL180.800751880.9984380.5023440.888734
ENN_TRAIL190.79687510.50.886957
ENN_TRAIL200.79036295410.5029670.882908
ENN TRAIL MEAN0.7946354630.9996470.5034620.885429

Random under sampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
RUS_TRIAL10.819348470.6661320.5745910.734838
RUS_TRIAL20.286689420.5014930.5855290.364821
RUS_TRIAL30.2958860760.5862070.6194110.39327
RUS_TRIAL40.8510848130.6773940.6071020.754371
RUS_TRIAL50.8108108110.6552260.5194370.724763
RUS_TRIAL60.8266796490.6732750.5741010.742133
RUS_TRIAL70.8514851490.6677020.5934660.748477
RUS_TRIAL80.8668668670.6797490.6358870.761989
RUS_TRIAL90.8304914740.6693610.6018980.741271
RUS_TRIAL100.8414517670.6840060.5759770.754604
RUS_TRIAL110.2954924870.558360.6147220.386463
RUS_TRIAL120.2737030410.4967530.5912560.352941
RUS_TRIAL130.8274209010.6931730.5930650.754371
RUS_TRIAL140.839603960.6671910.5873950.743534
RUS_TRIAL150.8403525950.6825780.6036790.753292
RUS_TRIAL160.2813504820.5892260.6230860.380849
RUS_TRIAL170.8596837940.6718150.603120.754226
RUS_TRIAL180.8430622010.6915230.5942280.75981
RUS_TRIAL190.8323586740.6971430.6192380.758774
RUS_TRIAL200.812375250.6506790.5559990.722592
RUS TRIAL MEAN0.6993098940.6429490.5936590.669946

Comparative study of different sampling performances with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
RUS0.699310.6429490.5936590.669946
ROS0.7100480.640820.6017220.67366
ADAYSYN0.7892410.9912560.5087860.873358
SMOTE0.6801060.6306170.6010420.654427
ENN0.7946350.9996470.5034620.885429
CNN0.7974310.9981080.5119080.886555

RUS

Step 1: D is the original data set
Step 2: E is new set and it is a subset of D which is created randomly with or without replacement.
Step 3: D = Dmaj + Dmin - E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively

SMOTE

Step 1: D is the original data set
Step 2: Create data set I with minority set observation of I € D
Step 3: Identify the value of K which is the number of nearest neighbors of Minority class
Step 4: Identify the value of N which is the number of synthetic examples which needs to be created.
Step 5: Create a dataset D’ which is the random sample of I of size N.
For each example xkI
        D’ = x + rand(0,1) * | x – xk |

ROS

Step 1: D is the original data set
Step 2: E is new set and adds it by appending randomly selected examples from the minority class (with replacement).
Step 3: D = Dmin + Dmaj + E which is the balanced dataset. Dmin and Dmaj refers the minority and majority dataset respectively

Random oversampling performance with decision tree.

Sampling methodPrecisionRecallAUCROCF1 Score
ROS_TRIAL10.8533077660.6964010.6106230.766911
ROS_TRIAL20.2986111110.5088760.5943740.376368
ROS_TRIAL30.8546845120.6968040.6086550.767711
ROS_TRIAL40.831094050.6878470.5858590.752716
ROS_TRIAL50.8471528470.6698260.6058710.748125
ROS_TRIAL60.8384615380.6898730.5949370.756944
ROS_TRIAL70.8549172350.6827370.6041070.759187
ROS_TRIAL80.8407707910.6579370.5980860.738201
ROS_TRIAL90.8534728830.7007810.6097660.769627
ROS_TRIAL100.3048780490.5116960.5972630.382096
ROS_TRIAL110.2931654680.5062110.599350.371298
ROS_TRIAL120.830303030.6629030.5981180.73722
ROS_TRIAL130.8509433960.703040.6023090.769953
ROS_TRIAL140.3135593220.5346820.6058580.395299
ROS_TRIAL150.8612487610.672080.6080270.754996
ROS_TRIAL160.2981818180.5141070.606390.377445
ROS_TRIAL170.8440545810.6878470.609320.757987
ROS_TRIAL180.8505516550.6635370.6004020.745495
ROS_TRIAL190.8434442270.678740.5969460.752182
ROS_TRIAL200.8381502890.6904760.5981790.75718
ROS TRIAL MEAN0.7100476670.640820.6017220.67366

Condensed nearest neighbor performance with decision tree.

Samplin methodPrecisionRecallAUCROCF1 Score
CNN_TRAIL10.8031545740.9953090.5116730.888966
CNN_TRAIL20.78477029610.5155810.879408
CNN_TRAIL30.8026398490.9968770.5062760.889276
CNN_TRAIL40.8026398490.9984360.5101220.889895
CNN_TRAIL50.7914572860.9984150.5080830.882971
CNN_TRAIL60.7988650690.9952870.5098760.886324
CNN_TRAIL70.8036500940.9992180.5151370.890827
CNN_TRAIL80.7976040350.9984210.5172290.886786
CNN_TRAIL90.79208542710.5117990.883982
CNN_TRAIL100.7973568280.9984240.5128070.886634
CNN_TRAIL110.798861480.9952720.5172730.886316
CNN_TRAIL120.7979861550.9984250.5128490.887023
CNN_TRAIL130.8047708730.9992210.5090740.891516
CNN_TRAIL140.7890772130.9984110.5065370.881487
CNN_TRAIL150.7957259590.9992110.5116170.885934
CNN_TRAIL160.80578252710.5141510.892447
CNN_TRAIL170.7847222220.9959940.5136220.877825
CNN_TRAIL180.7957259590.9984230.5097540.885624
CNN_TRAIL190.81143934610.5145630.895906
CNN_TRAIL200.7903022670.9968230.5101420.88163
CNN TRAIL MEAN0.7974308650.9981080.5119080.886555
eISSN:
2543-683X
Lingua:
Inglese
Frequenza di pubblicazione:
4 volte all'anno
Argomenti della rivista:
Computer Sciences, Information Technology, Project Management, Databases and Data Mining