1. bookVolume 12 (2021): Issue 1 (May 2021)
Journal Details
License
Format
Journal
First Published
19 Sep 2012
Publication timeframe
2 times per year
Languages
English
access type Open Access

The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems

Published Online: 04 Jun 2021
Page range: 228 - 242
Received: 12 Aug 2020
Accepted: 15 Mar 2021
Journal Details
License
Format
Journal
First Published
19 Sep 2012
Publication timeframe
2 times per year
Languages
English
Abstract

Background: The bootstrap can be alternative to cross-validation as a training/test set splitting method since it minimizes the computing time in classification problems in comparison to the tenfold cross-validation.

Objectives: Тhis research investigates what proportion should be used to split the dataset into the training and the testing set so that the bootstrap might be competitive in terms of accuracy to other resampling methods.

Methods/Approach: Different train/test split proportions are used with the following resampling methods: the bootstrap, the leave-one-out cross-validation, the tenfold cross-validation, and the random repeated train/test split to test their performance on several classification methods. The classification methods used include the logistic regression, the decision tree, and the k-nearest neighbours.

Results: The findings suggest that using a different structure of the test set (e.g. 30/70, 20/80) can further optimize the performance of the bootstrap when applied to the logistic regression and the decision tree. For the k-nearest neighbour, the tenfold cross-validation with a 70/30 train/test splitting ratio is recommended.

Conclusions: Depending on the characteristics and the preliminary transformations of the variables, the bootstrap can improve the accuracy of the classification problem.

Keywords

JEL Classification

1. Breiman L., (1995), “Better Subset Regression Using the Nonnegative Garrote”, Technometrics, Vol. 37 No 4, pp. 373 – 384. Search in Google Scholar

2. Breiman L., (1992), “The Little Bootstrap and Other Methods for Dimensionality Selection in Regression: X-fixed Prediction Error”, Journal of American Statistical Association, Vol. 87 No. 419, pp. 738-754. Search in Google Scholar

3. Breiman, L. (1996), “Bagging predictors”, Machine Learning. 24 (2), pp. 123–140. Search in Google Scholar

4. Grubinger, T., Zeileis, A. and Pfeiffer, K., 2014. Evtree: Evolutionary learning of globally optimal classification and regression trees in R. J. Stat. Software 61 (1), pp. 1-29. Search in Google Scholar

5. Efron B., (1979), “Bootstrap Methods: Another Look at the Jackknife”, the Annals of Statistics, Vol. 17, pp. 1–26. Search in Google Scholar

6. Efron B., Tibshirani R., (1997), “Improvements on Cross-Validation: The.632+ Bootstrap Method”, Journal of the American Statistical Association, vol. 92, pp. 548–560. Search in Google Scholar

7. Hoerl E., Kennard W., (1970), “Ridge Regression. Applications to nonorthogonal Problems”, Technometrics, Vol. 12 No. 1, pp. 69-82. Iz 2012 Search in Google Scholar

8. Iannarilli F., Rubin P., (2003), Feature selection for multiclass discrimination via mixed-integer linear programming, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25 No. 6, pp. 779-783. Search in Google Scholar

9. James G., D. W., Hastie T., Tibshirani R., (2013), An Introduction to Statistical Learning. Springer, STS Vol. 103. Search in Google Scholar

10. Kim B., Shin S., (2019), “Principal weighted logistic regression for sufficient dimension reduction in binary classification”, Journal of the Korean Statistical Society, Vol. 48 No. 2, pp. 194-206. Search in Google Scholar

11. Krstajic D., Buturovic J., Leahy E., Thomas S., (2014), “Cross-validation pitfalls when selecting and assessing regression and classification models”, Cheminformatics, Vol. 6 Article No. 10. Search in Google Scholar

12. MacKinnon J., (2002), “Bootstrap Inference in Econometrics”, The Canadian Journal of Economics, Vol. 35 No. 4, pp. 615—645. Search in Google Scholar

13. Maldonado S., Pérez J., Weber R., Labbé M., (2014), Feature Selection for Support Vector Machines via Mixed Integer Linear Programming, Information Sciences, Vol. 279, pp. 163–175. Search in Google Scholar

14. Pampel F., (2000), Logistic regression: A primer. Sage University Papers Series on Quantitative Applications in the Social Sciences, 07-132. Sage Publications, Thousand Oaks, CA. Search in Google Scholar

15. Pedregosa et al., (2011), Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12, pp. 2825-2830. Search in Google Scholar

16. Velliangiri S., Alagumuthukrishnan S., Joseph S., (2019), A Review of Dimensionality Reduction Techniques for Efficient Computation, Procedia Computer Science, Vol. 165, pp. 104-111. Search in Google Scholar

17. Vrigazova B., (2018), “Nonnegative Garrote as a Variable Selection Method in Panel Data”, International Journal of Computer Science and Information Security, Vol. 16 No. 1. Search in Google Scholar

18. Vrigazova B., Ivanov I., (2019), “Optimization of the ANOVA Procedure for Support Vector Machines”, International Journal of Recent Technology and Engineering, Vol. 8 No. 4. Search in Google Scholar

19. Vrigazova B., Ivanov I., (2020a), “The bootstrap procedure in classification problems”, International Journal of Data Mining, Modelling and Management, Vol. 12 No. 4. Search in Google Scholar

20. Vrigazova, B.& Ivanov, I., (2020b), “Tenfold bootstrap procedure for support vector machines”, Computer Science, Vo. 21 No. 2, pp. 241-257. 10.7494/csci.2020.21.2.3634. Search in Google Scholar

21. Wong T., (2015), “Performance evaluation of classification algorithms by k-fold and leave-one-out cross-validation”, Pattern Recognition, Vol. 48 No. 9, pp. 2839–2846. Search in Google Scholar

22. Yeturu К., (2020), Chapter 3 - Machine learning algorithms, applications, and practices in data science, Editor(s): Arni S.R. Srinivasa Rao, C.R. Rao, Handbook of Statistics, Elsevier, Vol. 43, pp. 81-206. Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo