A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data

Ulagapriya Krishnan; Pushpa Sangar

Accesso libero

A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data

Ulagapriya Krishnan

Pushpa Sangar

| 27 gen 2021

Journal of Data and Information Science

Volume 6 (2021): Numero 1 (February 2021)

INFORMAZIONI SU QUESTO ARTICOLO

Articolo precedente

Articolo Successivo

Cita

Article Category: Research Paper

Pubblicato online: 27 gen 2021

Pagine: 178 - 192

Ricevuto: 29 apr 2020

Accettato: 21 dic 2020

DOI: https://doi.org/10.2478/jdis-2021-0011

Parole chiave
Imbalanced data, Sampling methods, Machine learning, Classification

© 2021 Ulagapriya Krishnan et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Rebalancing framework for imbalanced data classification.

Sample records of medical appointment no-show dataset.

Comparative study of different sampling performances.

Synthetic minority oversampling performance with decision tree.

Sampling method	Precision	Recall	AUCROC	F1 Score
SMOTE_TRAIL1	0.298013245	0.555556	0.611634	0.387931
SMOTE_TRAIL2	0.857976654	0.694488	0.626032	0.767624
SMOTE_TRAIL3	0.273462783	0.53481	0.592561	0.361884
SMOTE_TRAIL4	0.8382643	0.669291	0.586161	0.744308
SMOTE_TRAIL5	0.827884615	0.689904	0.590691	0.752622
SMOTE_TRAIL6	0.290552585	0.512579	0.601063	0.370876
SMOTE_TRAIL7	0.851669941	0.674708	0.597672	0.752931
SMOTE_TRAIL8	0.842105263	0.666667	0.611742	0.744186
SMOTE_TRAIL9	0.841359773	0.699372	0.592017	0.763823
SMOTE_TRAIL10	0.841112214	0.672756	0.601774	0.747573
SMOTE_TRAIL11	0.85257032	0.688332	0.608872	0.761698
SMOTE_TRAIL12	0.310344828	0.49422	0.595595	0.381271
SMOTE_TRAIL13	0.855327468	0.680934	0.605546	0.758232
SMOTE_TRAIL14	0.834834835	0.661905	0.588305	0.73838
SMOTE_TRAIL15	0.848393574	0.664308	0.601971	0.74515
SMOTE_TRAIL16	0.303664921	0.494318	0.587303	0.376216
SMOTE_TRAIL17	0.290718039	0.533762	0.609783	0.376417
SMOTE_TRAIL18	0.844984802	0.664013	0.609623	0.743647
SMOTE_TRAIL19	0.851887706	0.684825	0.599555	0.759275
SMOTE_TRAIL20	0.846989141	0.675591	0.602947	0.751643
SMOTE MEAN	0.68010585	0.630617	0.601042	0.654427

ADASYN

Step 1: D is the dataset with m class examples {((x₁,y₁)(,(x₂,y₂)…..(x_n,y_n)}

Step 2: Calculate the Imbalance ratio IR = |D_maj / |D_min|. If IR is less than the threshold value then the data is Imbalanced

Step 3: Identify G the no of synthetic examples which need to be generated for the minority class

G = (|D_maj| – |D_Min|) × β

Step 4: For each instance in minority class identify the K nearest neighbor using Euclidean distance (Δ_i) and calculate the ration R_i which is defined as

Ri = Δ_i /K

Step 5: Normalize Ri

Step 6: Identify the number of synthetic data (g_i) example for each minority instance

g_i = r̂_i × G

Step 7: For each minority xi choose the nearest neighbor xk and generate the synthetic example as follows

si = x_i + rand(0,1) * |x_i – x_k|

CNN

Step 1: Let D be the original dataset(x1,x2,x3….xn)

Step 2: Take a random instance xi from D and create a Subset T

Step 3: Scan all members of D and add to T where x € T does not match class using KNN Rule and add it to T.

Step 4: Repeat step 3 until all members of x_i have been checked

Adaptive synthetic sampling performance with decision tree.

Sampling method	Precision	Recall	AUCROC	F1 Score
ADAYSYN_TRAIL1	0.790373654	0.991263	0.510294	0.879493
ADAYSYN_TRAIL2	0.806615776	0.98677	0.510846	0.887644
ADAYSYN_TRAIL3	0.800127714	0.989731	0.526303	0.884887
ADAYSYN_TRAIL4	0.804692454	0.987549	0.504885	0.886792
ADAYSYN_TRAIL5	0.788973384	0.993615	0.516981	0.879548
ADAYSYN_TRAIL6	0.789574062	0.987281	0.509723	0.877428
ADAYSYN_TRAIL7	0.815974441	0.98534	0.518986	0.892695
ADAYSYN_TRAIL8	0.805590851	0.987539	0.509592	0.887334
ADAYSYN_TRAIL9	0.799492386	0.990566	0.513576	0.884831
ADAYSYN_TRAIL10	0.778693722	0.991922	0.513917	0.872469
ADAYSYN_TRAIL11	0.80393401	0.989071	0.51021	0.886944
ADAYSYN_TRAIL12	0.790343075	0.992817	0.520904	0.880085
ADAYSYN_TRAIL13	0.806883365	0.984448	0.50974	0.886865
ADAYSYN_TRAIL14	0.78643853	0.989633	0.507822	0.876412
ADAYSYN_TRAIL15	0.79949077	0.985871	0.509807	0.882953
ADAYSYN_TRAIL16	0.792141952	0.993641	0.517288	0.881523
ADAYSYN_TRAIL17	0.813095995	0.986122	0.507912	0.891289
ADAYSYN_TRAIL18	0.812903226	0.971473	0.507188	0.885142
ADAYSYN_TRAIL19	0.778059607	0.991115	0.512132	0.871758
ADAYSYN_TRAIL20	0.789240506	0.991256	0.508786	0.878788
ADASYN TRIAL MEAN	0.793672332	0.970829	0.516991	0.873358

ENN

Step 1: Let D is the original training set, and T is the edited set

Step 2: For each x_i in D remove x_k if it is misclassified using the k-NN rule

Edited nearest neighbor performance with decision tree.

Sampling method	Precision	Recall	AUCROC	F1 Score
ENN_TRAIL1	0.791353383	1	0.505935	0.883526
ENN_TRAIL2	0.805886036	0.999224	0.502817	0.892201
ENN_TRAIL3	0.796493425	1	0.504573	0.88672
ENN_TRAIL4	0.799249531	1	0.501553	0.888425
ENN_TRAIL5	0.780839073	1	0.504249	0.876934
ENN_TRAIL6	0.793621013	1	0.501511	0.884937
ENN_TRAIL7	0.787593985	0.998411	0.502138	0.88056
ENN_TRAIL8	0.811166876	0.999227	0.507784	0.895429
ENN_TRAIL9	0.79111945	1	0.501493	0.88338
ENN_TRAIL10	0.790100251	1	0.5059	0.882744
ENN_TRAIL11	0.793988729	1	0.504518	0.885166
ENN_TRAIL12	0.808630394	1	0.501629	0.894191
ENN_TRAIL13	0.78334377	1	0.504298	0.878511
ENN_TRAIL14	0.794855709	0.998424	0.505254	0.885086
ENN_TRAIL15	0.7922403	0.999211	0.501107	0.88377
ENN_TRAIL16	0.805764411	1	0.506369	0.892436
ENN_TRAIL17	0.778473091	1	0.502809	0.87544
ENN_TRAIL18	0.80075188	0.998438	0.502344	0.888734
ENN_TRAIL19	0.796875	1	0.5	0.886957
ENN_TRAIL20	0.790362954	1	0.502967	0.882908
ENN TRAIL MEAN	0.794635463	0.999647	0.503462	0.885429

Random under sampling performance with decision tree.

Sampling method	Precision	Recall	AUCROC	F1 Score
RUS_TRIAL1	0.81934847	0.666132	0.574591	0.734838
RUS_TRIAL2	0.28668942	0.501493	0.585529	0.364821
RUS_TRIAL3	0.295886076	0.586207	0.619411	0.39327
RUS_TRIAL4	0.851084813	0.677394	0.607102	0.754371
RUS_TRIAL5	0.810810811	0.655226	0.519437	0.724763
RUS_TRIAL6	0.826679649	0.673275	0.574101	0.742133
RUS_TRIAL7	0.851485149	0.667702	0.593466	0.748477
RUS_TRIAL8	0.866866867	0.679749	0.635887	0.761989
RUS_TRIAL9	0.830491474	0.669361	0.601898	0.741271
RUS_TRIAL10	0.841451767	0.684006	0.575977	0.754604
RUS_TRIAL11	0.295492487	0.55836	0.614722	0.386463
RUS_TRIAL12	0.273703041	0.496753	0.591256	0.352941
RUS_TRIAL13	0.827420901	0.693173	0.593065	0.754371
RUS_TRIAL14	0.83960396	0.667191	0.587395	0.743534
RUS_TRIAL15	0.840352595	0.682578	0.603679	0.753292
RUS_TRIAL16	0.281350482	0.589226	0.623086	0.380849
RUS_TRIAL17	0.859683794	0.671815	0.60312	0.754226
RUS_TRIAL18	0.843062201	0.691523	0.594228	0.75981
RUS_TRIAL19	0.832358674	0.697143	0.619238	0.758774
RUS_TRIAL20	0.81237525	0.650679	0.555999	0.722592
RUS TRIAL MEAN	0.699309894	0.642949	0.593659	0.669946

Comparative study of different sampling performances with decision tree.

Sampling method	Precision	Recall	AUCROC	F1 Score
RUS	0.69931	0.642949	0.593659	0.669946
ROS	0.710048	0.64082	0.601722	0.67366
ADAYSYN	0.789241	0.991256	0.508786	0.873358
SMOTE	0.680106	0.630617	0.601042	0.654427
ENN	0.794635	0.999647	0.503462	0.885429
CNN	0.797431	0.998108	0.511908	0.886555

RUS

Step 1: D is the original data set

Step 2: E is new set and it is a subset of D which is created randomly with or without replacement.

Step 3: D = D_maj + D_min - E which is the balanced dataset. D_min and D_maj refers the minority and majority dataset respectively

SMOTE

Step 1: D is the original data set

Step 2: Create data set I with minority set observation of I € D

Step 3: Identify the value of K which is the number of nearest neighbors of Minority class

Step 4: Identify the value of N which is the number of synthetic examples which needs to be created.

Step 5: Create a dataset D’ which is the random sample of I of size N.

For each example x_k ⎡ I

D’ = x + rand(0,1) * | x – x_k |

ROS

Step 1: D is the original data set

Step 2: E is new set and adds it by appending randomly selected examples from the minority class (with replacement).

Step 3: D = D_min + D_maj + E which is the balanced dataset. D_min and D_maj refers the minority and majority dataset respectively

Random oversampling performance with decision tree.

Sampling method	Precision	Recall	AUCROC	F1 Score
ROS_TRIAL1	0.853307766	0.696401	0.610623	0.766911
ROS_TRIAL2	0.298611111	0.508876	0.594374	0.376368
ROS_TRIAL3	0.854684512	0.696804	0.608655	0.767711
ROS_TRIAL4	0.83109405	0.687847	0.585859	0.752716
ROS_TRIAL5	0.847152847	0.669826	0.605871	0.748125
ROS_TRIAL6	0.838461538	0.689873	0.594937	0.756944
ROS_TRIAL7	0.854917235	0.682737	0.604107	0.759187
ROS_TRIAL8	0.840770791	0.657937	0.598086	0.738201
ROS_TRIAL9	0.853472883	0.700781	0.609766	0.769627
ROS_TRIAL10	0.304878049	0.511696	0.597263	0.382096
ROS_TRIAL11	0.293165468	0.506211	0.59935	0.371298
ROS_TRIAL12	0.83030303	0.662903	0.598118	0.73722
ROS_TRIAL13	0.850943396	0.70304	0.602309	0.769953
ROS_TRIAL14	0.313559322	0.534682	0.605858	0.395299
ROS_TRIAL15	0.861248761	0.67208	0.608027	0.754996
ROS_TRIAL16	0.298181818	0.514107	0.60639	0.377445
ROS_TRIAL17	0.844054581	0.687847	0.60932	0.757987
ROS_TRIAL18	0.850551655	0.663537	0.600402	0.745495
ROS_TRIAL19	0.843444227	0.67874	0.596946	0.752182
ROS_TRIAL20	0.838150289	0.690476	0.598179	0.75718
ROS TRIAL MEAN	0.710047667	0.64082	0.601722	0.67366

Condensed nearest neighbor performance with decision tree.

Samplin method	Precision	Recall	AUCROC	F1 Score
CNN_TRAIL1	0.803154574	0.995309	0.511673	0.888966
CNN_TRAIL2	0.784770296	1	0.515581	0.879408
CNN_TRAIL3	0.802639849	0.996877	0.506276	0.889276
CNN_TRAIL4	0.802639849	0.998436	0.510122	0.889895
CNN_TRAIL5	0.791457286	0.998415	0.508083	0.882971
CNN_TRAIL6	0.798865069	0.995287	0.509876	0.886324
CNN_TRAIL7	0.803650094	0.999218	0.515137	0.890827
CNN_TRAIL8	0.797604035	0.998421	0.517229	0.886786
CNN_TRAIL9	0.792085427	1	0.511799	0.883982
CNN_TRAIL10	0.797356828	0.998424	0.512807	0.886634
CNN_TRAIL11	0.79886148	0.995272	0.517273	0.886316
CNN_TRAIL12	0.797986155	0.998425	0.512849	0.887023
CNN_TRAIL13	0.804770873	0.999221	0.509074	0.891516
CNN_TRAIL14	0.789077213	0.998411	0.506537	0.881487
CNN_TRAIL15	0.795725959	0.999211	0.511617	0.885934
CNN_TRAIL16	0.805782527	1	0.514151	0.892447
CNN_TRAIL17	0.784722222	0.995994	0.513622	0.877825
CNN_TRAIL18	0.795725959	0.998423	0.509754	0.885624
CNN_TRAIL19	0.811439346	1	0.514563	0.895906
CNN_TRAIL20	0.790302267	0.996823	0.510142	0.88163
CNN TRAIL MEAN	0.797430865	0.998108	0.511908	0.886555

eISSN:: 2543-683X
Lingua:: Inglese

Frequenza di pubblicazione:: 4 volte all'anno
Argomenti della rivista:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

Feed RSS della rivista

A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data

Article Category: Research Paper

Pubblicato online: 27 gen 2021

Pagine: 178 - 192

Ricevuto: 29 apr 2020

Accettato: 21 dic 2020

DOI: https://doi.org/10.2478/jdis-2021-0011

Parole chiave
Imbalanced data, Sampling methods, Machine learning, Classification

© 2021 Ulagapriya Krishnan et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Synthetic minority oversampling performance with decision tree.

ADASYN

CNN

Adaptive synthetic sampling performance with decision tree.

ENN

Edited nearest neighbor performance with decision tree.

Random under sampling performance with decision tree.

Comparative study of different sampling performances with decision tree.

RUS

SMOTE

ROS

Random oversampling performance with decision tree.

Condensed nearest neighbor performance with decision tree.

A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data

Article Category: Research Paper

Pubblicato online: 27 gen 2021

Pagine: 178 - 192

Ricevuto: 29 apr 2020

Accettato: 21 dic 2020

DOI: https://doi.org/10.2478/jdis-2021-0011

Parole chiaveImbalanced data, Sampling methods, Machine learning, Classification

© 2021 Ulagapriya Krishnan et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Figure 1

Figure 2

Figure 3

Synthetic minority oversampling performance with decision tree.

ADASYN

CNN

Adaptive synthetic sampling performance with decision tree.

ENN

Edited nearest neighbor performance with decision tree.

Random under sampling performance with decision tree.

Comparative study of different sampling performances with decision tree.

RUS

SMOTE

ROS

Random oversampling performance with decision tree.

Condensed nearest neighbor performance with decision tree.

Parole chiave
Imbalanced data, Sampling methods, Machine learning, Classification