A Gaussian–Based WGAN–GP Oversampling Approach for Solving the Class Imbalance Problem

Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein generative adversarial networks, International Conference on Machine Learning, Sydney, Australia, pp. 214–223. Search in Google Scholar

Barua, S., Islam, M.M. and Murase, K. (2013). PROWSYN: Proximity weighted synthetic oversampling technique for imbalanced data set learning, Advances in Knowledge Discovery and Data Mining: 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, pp. 317–328. Search in Google Scholar

Bourou, S., El Saer, A., Velivassaki, T.-H., Voulkidis, A. and Zahariadis, T. (2021). A review of tabular data synthesis using GANs on an IDS dataset, Information 12(09): 375. Search in Google Scholar

Breiman, L. (2001). Random forests, Machine Learning 45(1): 5–32. Search in Google Scholar

Breiman, L. (2017). Classification and Regression Trees, Routledge, London. Search in Google Scholar

Budach, L., Feuerpfeil, M., Ihde, N., Nathansen, A., Noack, N., Patzlaff, H., Naumann, F. and Harmouch, H. (2022). The effects of data quality on machine learning performance, arXiv: 2207.14529. Search in Google Scholar

Chaabane, I., Guermazi, R. and Hammami, M. (2020). Enhancing techniques for learning decision trees from imbalanced data, Advances in Data Analysis and Classification 14(3): 1–69. Search in Google Scholar

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002). SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16: 321–357. Search in Google Scholar

Chen, J., Huang, H., Cohn, A.G., Zhang, D. and Zhou, M. (2022). Machine learning-based classification of rock discontinuity trace: SMOTE oversampling integrated with GBT ensemble learning, International Journal of Mining Science and Technology 32(2): 309–322. Search in Google Scholar

Chen, J., Yan, Z., Lin, C., Yao, B. and Ge, H. (2023). Aero-engine high speed bearing fault diagnosis for data imbalance: A sample enhanced diagnostic method based on pre-training WGAN-GP, Measurement 213(7): 112709. Search in Google Scholar

Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13(1): 21–27. Search in Google Scholar

Cui, J., Zong, L., Xie, J. and Tang, M. (2023). A novel multi-module integrated intrusion detection system for high-dimensional imbalanced data, Applied Intelligence 53(1): 272–288. Search in Google Scholar

Derrac, J., Garcia, S., Sanchez, L. and Herrera, F. (2015). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17(2–3): 255–287. Search in Google Scholar

Douzas, G. and Bacao, F. (2018). Effective data generation for imbalanced learning using conditional generative adversarial networks, Expert Systems with Applications 91(1): 464–471. Search in Google Scholar

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository, http://archive.ics.uci.edu/ml. Search in Google Scholar

Fernández, A., Garcia, S., Herrera, F. and Chawla, N.V. (2018). Smote for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61: 863–905. Search in Google Scholar

Freund, Y. and Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55(1): 119–139. Search in Google Scholar

García, S., Luengo, J. and Herrera, F. (2016). Tutorial on practical tips of the most influential data preprocessing algorithms in data mining, Knowledge-Based Systems 98(7): 1–29. Search in Google Scholar

Gazzah, S. and Amara, N.E.B. (2008). New oversampling approaches based on polynomial fitting for imbalanced data sets, 2008 8th IAPR International Workshop on Document Analysis Systems, Nara, Japan, pp. 677–684. Search in Google Scholar

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets, Advances in Neural Information Processing Systems 27: 2672–2680. Search in Google Scholar

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. and Courville, A.C. (2017). Improved training of Wasserstein GANs, Advances in Neural Information Processing Systems 30: 5767–5777. Search in Google Scholar

Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection, Journal of Machine Learning Research 3(Mar): 1157–1182. Search in Google Scholar

He, H. and Garcia, E.A. (2009). Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21(9): 1263–1284. Search in Google Scholar

Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D. (2022). Synthetic data generation for tabular health records: A systematic review, Neurocomputing 493(27): 28–45. Search in Google Scholar

James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications to R, 2nd Edn, Springer, New York. Search in Google Scholar

Janicka, M., Lango, M. and Stefanowski, J. (2019). Using information on class interrelations to improve classification of multiclass imbalanced data: A new resampling algorithm, International Journal of Applied Mathematics and Computer Science 29(4): 769–781, DOI: 10.2478/amcs-2019-0057. Search in Google Scholar

Japkowicz, N. (2003). Class imbalances: Are we focusing on the right issue, Workshop on Learning from Imbalanced Data Sets II, Washington, USA, p. 63. Search in Google Scholar

Kaggle (2024), Datasets: Lower Back Pain, https://www.kaggle.com/datasets/sammy123/lower-back-pain-symptoms-dataset, and Telecom Churn, https://www.kaggle.com/datasets/mnassrib/telecom-churn-datasets. Search in Google Scholar

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection, 14th International Joint Conference on Artificial Intelligence (IJCAI), Montreal, Canada, pp. 1137–1145. Search in Google Scholar

Kovács, G. (2019). An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83(9): 105662. Search in Google Scholar

Liu, X.-Y., Wu, J. and Zhou, Z.-H. (2008). Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics B: Cybernetics 39(2): 539–550. Search in Google Scholar

López, V., Fernández, A., García, S., Palade, V. and Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics, Information Sciences 250(33): 113–141. Search in Google Scholar

Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets, arXiv: 1411.1784. Search in Google Scholar

Miyato, T., Kataoka, T., Koyama, M. and Yoshida, Y. (2018). Spectral normalization for generative adversarial networks, arXiv: 1802.05957. Search in Google Scholar

Moreo, A., Esuli, A. and Sebastiani, F. (2016). Distributional random oversampling for imbalanced text classification, Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, Pisa, Italy, pp. 805–808. Search in Google Scholar

Napierala, K. and Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data, Journal of Intelligent Information Systems 46: 563–597. Search in Google Scholar

Nik, A.H.Z., Riegler, M.A., Halvorsen, P. and Storås, A.M. (2023). Generation of synthetic tabular healthcare data using generative adversarial networks, International Conference on Multimedia Modeling, Bergen, Norway, pp. 434–446. Search in Google Scholar

Ohsaki, M., Wang, P., Matsuda, K., Katagiri, S., Watanabe, H. and Ralescu, A. (2017). Confusion-matrix-based kernel logistic regression for imbalanced data classification, IEEE Transactions on Knowledge and Data Engineering 29(9): 1806–1819. Search in Google Scholar

Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H. and Kim, Y. (2018). Data synthesis based on generative adversarial networks, Proceedings of the VLDB Endowment 11(10): 1071–1083. Search in Google Scholar

Park, S. and Park, H. (2021). Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic, Computing 103(3): 401–424. Search in Google Scholar

Powers, D.M. (2020). Evaluation: From precision, recall and f-measure to ROC, informedness, markedness and correlation, arXiv: 2010.16061. Search in Google Scholar

Ren, J., Wang, Y., Cheung, Y.-m., Gao, X.-Z. and Guo, X. (2023). Grouping-based oversampling in kernel space for imbalanced data classification, Pattern Recognition 133(1): 108992. Search in Google Scholar

Sáez, J.A., Luengo, J., Stefanowski, J. and Herrera, F. (2015). SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291(2): 184–203. Search in Google Scholar

Sun, B., Zhou, Q., Wang, Z., Lan, P., Song, Y., Mu, S., Li, A., Chen, H. and Liu, P. (2023). Radial-based undersampling approach with adaptive undersampling ratio determination, Neurocomputing 553(39): 126544. Search in Google Scholar

Sun, Y., Wong, A.K. and Kamel, M.S. (2009). Classification of imbalanced data: A review, International Journal of Pattern Recognition and Artificial Intelligence 23(04): 687–719. Search in Google Scholar

Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference, Springer, New York. Search in Google Scholar

Wold, S., Esbensen, K. and Geladi, P. (1987). Principal component analysis, Chemometrics and Intelligent Laboratory Systems 2(1–3): 37–52. Search in Google Scholar

Woods, K.S., Doss, C.C., Bowyer, K.W., Solka, J.L., Priebe, C.E. and Kegelmeyer Jr, W.P. (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition and Artificial Intelligence 7(06): 1417–1436. Search in Google Scholar

Xie, Y. and Zhang, T. (2018). Imbalanced learning for fault diagnosis problem of rotating machinery based on generative adversarial networks, 2018 37th Chinese Control Conference (CCC), Wuhan, China, pp. 6017–6022. Search in Google Scholar

Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN, Advances in Neural Information Processing Systems 32: 7335–7345. Search in Google Scholar

Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O. and Li, H. (2017). High-resolution image inpainting using multi-scale neural patch synthesis, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, USA, pp. 6721–6729. Search in Google Scholar

Zhang, M., Wan, X., Gang, L., Lv, X., Wu, Z. and Liu, Z. (2021). An automated driving strategy generating method based on WGAIL–DDPG, International Journal of Applied Mathematics and Computer Science 31(3): 461–470, DOI: 10.34768/amcs-2021-0031. Search in Google Scholar

Zhang, Y., Liu, Y., Wang, Y. and Yang, J. (2023). An ensemble oversampling method for imbalanced classification with prior knowledge via generative adversarial network, Chemometrics and Intelligent Laboratory Systems 235(4): 104775. Search in Google Scholar

Zhao, Y., Li, H., Bissyandé, T.F., Klein, J. and Grundy, J. (2021). On the impact of sample duplication in machine-learning-based android malware detection, ACM Transactions on Software Engineering and Methodology 30(3): 1–38. Search in Google Scholar

Zhao, Z., Kunar, A., Birke, R. and Chen, L.Y. (2021). CTAB-GAN: Effective table data synthesizing, Asian Conference on Machine Learning, pp. 97–112, (virtual). Search in Google Scholar

Zheng, M., Li, T., Zhu, R., Tang, Y., Tang, M., Lin, L. and Ma, Z. (2020a). Conditional wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification, Information Sciences 512(7): 1009–1023. Search in Google Scholar

Zheng, W. and Zhao, H. (2020b). Cost-sensitive hierarchical classification for imbalance classes, Applied Intelligence 50(8): 2328–2338. Search in Google Scholar

Zhu, B., Pan, X., vanden Broucke, S. and Xiao, J. (2022). A GAN-based hybrid sampling method for imbalanced customer classification, Information Sciences 609(28): 1397–1411. Search in Google Scholar

eISSN:: 2083-8492
Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 4 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Mathematik, Angewandte Mathematik

Zeitschrift RSS Feed

A Gaussian–Based WGAN–GP Oversampling Approach for Solving the Class Imbalance Problem

Online veröffentlicht: 25. Juni 2024

Seitenbereich: 291 - 307

Eingereicht: 31. Juli 2023

Akzeptiert: 21. Feb. 2024

DOI: https://doi.org/10.61822/amcs-2024-0021

Schlüsselwörtermachine learning, class imbalance, generative adversarial networks, oversampling, data duplication

© 2024 Qian Zhou et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Schlüsselwörter
machine learning, class imbalance, generative adversarial networks, oversampling, data duplication