1. bookVolumen 38 (2022): Heft 2 (June 2022)
Zeitschriftendaten
License
Format
Zeitschrift
eISSN
2001-7367
Erstveröffentlichung
01 Oct 2013
Erscheinungsweise
4 Hefte pro Jahr
Sprachen
Englisch
access type Uneingeschränkter Zugang

Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms

Online veröffentlicht: 14 Jun 2022
Volumen & Heft: Volumen 38 (2022) - Heft 2 (June 2022)
Seitenbereich: 485 - 508
Eingereicht: 01 Dec 2020
Akzeptiert: 01 Jun 2021
Zeitschriftendaten
License
Format
Zeitschrift
eISSN
2001-7367
Erstveröffentlichung
01 Oct 2013
Erscheinungsweise
4 Hefte pro Jahr
Sprachen
Englisch
Abstract

National statistical institutes currently investigate how to improve the output quality of official statistics based on machine learning algorithms. A key issue is concept drift, that is, when the joint distribution of independent variables and a dependent (categorical) variable changes over time. Under concept drift, a statistical model requires regular updating to prevent it from becoming biased. However, updating a model asks for additional data, which are not always available. An alternative is to reduce the bias by means of bias correction methods. In the article, we focus on estimating the proportion (base rate) of a category of interest and we compare two popular bias correction methods: the misclassification estimator and the calibration estimator. For prior probability shift (a specific type of concept drift), we investigate the two methods analytically as well as numerically. Our analytical results are expressions for the bias and variance of both methods. As numerical result, we present a decision boundary for the relative performance of the two methods. Our results provide a better understanding of the effect of prior probability shift on output quality. Consequently, we may recommend a novel approach on how to use machine learning algorithms in the context of official statistics.

Beck, M., F. Dumpert, and J. Feuerhake. 2018. Machine learning in official statistics. arXiv:1812.10422. DOI: https://doi.org/10.48550/arXiv.1812.10422. Search in Google Scholar

Braaksma, B., and C. Zeelenberg. 2015. “Re-make/Re-model: Should big data change the modelling paradigm in official statistics?” Statistical Journal of the IAOS 31(2): 193–202. DOI: htpps://doi.org/10.3233/sji-150892.10.3233/sji-150892 Search in Google Scholar

Breiman, L. 2001. “Statistical modeling: The two cultures.” Statistical Science 16(3): 199–231. DOI: htpps://doi.org/10.1214/ss/1009213726.10.1214/ss/1009213726 Search in Google Scholar

Bross, I.D.J. 1954. “Misclassification in 2 × 2 tables.” Biometrics 10(4): 478–486. DOI: htpps://doi.org/10.2307/3001619.10.2307/3001619 Search in Google Scholar

Buelens, B., P.-P. de Wolf, and C. Zeelenberg. 2016. “Model based estimation at Statistics Netherlands.” In European Conference on Quality in Official Statistics, Madrid, Spain. Available at: https://www.ine.es/q2016/docs/q2016Final00196.pdf. Search in Google Scholar

Buonaccorsi, J.P. 2010. Measurement Error: Models, Methods, and Applications. Chapman & Hall/CRC, 31 May – 3 June, Boca Raton, Florida.10.1201/9781420066586 Search in Google Scholar

Buskirk, T.D., and S. Kolenikov. 2015. Finding respondents in the forest: A comparison of logistic regression and random forest models for response propensity weighting and stratification. Available at: https://surveyinsights.org/?p=5108 (accessed April 2020). Search in Google Scholar

Costa, H, D. Almeida, F. Vala, F. Marcelino, and M. Caetano. 2018. “Land cover mapping from remotely sensed and auxiliary data for harmonized official statistics.” ISPRS International Journal of Geo-Information 7(4):157. DOI: htpps://doi.org/10.3390/ijgi7040157.10.3390/ijgi7040157 Search in Google Scholar

Curier, R.L., T.J.A. de Jong, K. Strauch, K. Cramer, N. Rosenski, C. Schartner, M. Debusschere, H. Ziemons, D. Iren, and S. Bromuri. 2018. Monitoring spatial sustainable development: Semi-automated analysis of satellite and aerial images for energy transition and sustainability indicators. arXiv:1810.04881. DOI: https://doi.org/10.48550/arXiv.1810.04881. Search in Google Scholar

Daas P.J.H., and S. van der Doef. 2020. “Detecting innovative companies via their website.” Statistical Journal of the IAOS 36(4): 1239–1251. DOI: htpps://doi.org/10. 3233/SJI-200627.10.3233/SJI-200627 Search in Google Scholar

De Broe, S.M.M.G., P. Struijs, P.J.H. Daas, A. van Delden, J. Burger, J.A. van den Brakel, K.O. ten Bosch, C. Zeelenberg, and W.F.H. Ypma. 2020. Updating the paradigm of official statistics. CBDS Working Paper 02-20, Statistics Netherlands, The Hague/Heerlen. Search in Google Scholar

European Commission. 2009. Regulation of European Statistics. Available at: https://eurlex.europa.eu/legal-content/EN/ALL/?uri=CELEX%3A32009R0223 (accessed April 2020). Search in Google Scholar

Eurostat. 2017. European Statistics Code of Practice. Available at: https://ec.europa.eu/eurostat/web/ (accessed April 2020). Search in Google Scholar

Forman, G. 2015. “Counting positives accurately despite inaccurate classification.” In Machine Learning: ECML 2005, Lecture Notes in Computer Science, edited by J. Gama, R. Camacho, P.B. Brazdil, A.M. Jorge, and L. Torgo: 564–575, Berlin, Heidelberg, Springer. DOI: https://oi.org/10.1007/11564096_55. Search in Google Scholar

Gama, J., I. Žliobaité, A. Bifet, M. Pechenizkiy, and A. Bouchachia. 2014. “A survey on concept drift adaptation.” ACM Computing Surveys 46(4): 1–37. DOI: htpps://doi.org/10.1145/2523813.10.1145/2523813 Search in Google Scholar

Goldenberg, I., and G.I. Webb. 2019. “Survey of distance measures for quantifying concept drift and shift in numeric data.” Knowledge and Information Systems 60(2): 591–615. DOI: https://doi.org/10.1007/s10115-018-1257-z.10.1007/s10115-018-1257-z Search in Google Scholar

González, P., A. Castaño, N.V. Chawla, and J.J. Del Coz. 2017. “A review on quantification learning.” ACM Computing Surveys 50(5): 74:1–74:40. DOI: https://doi.org/10.1145/3117807.10.1145/3117807 Search in Google Scholar

Helmbold D.P., and P.M. Long. 1994. “Tracking drifting concepts by minimizing disagreements.” Machine Learning 14(1): 27–45. DOI: https://doi.org/10.1007/BF00993161.10.1007/BF00993161 Search in Google Scholar

Kenett, R.S., and G. Shmueli. 2016. “From quality to information quality in official statistics.” Journal of Official Statistics 32(4): 867–885. DOI: https://doi.org/10.1515/-jos-2016-0045. Search in Google Scholar

Kloos, K., Q.A. Meertens, S. Scholtus, and J.D. Karch. 2020. “Comparing correction methods to reduce misclassification bias.” In BNAIC/BENELEARN 2020 edited by L. Cao, W.A. Kosters, and J. Lijffijt: 103–129, Leiden.10.1007/978-3-030-76640-5_5 Search in Google Scholar

Kuha, J., and C.J. Skinner. 1997. “Categorical data analysis and misclassification.” In Survey Measurement and Process Quality, edited by L.E. Lyberg, P.P. Biemer, M. Collins, E.D. de Leeuw, C. Dippo, N. Schwarz, and D. Trewin: 633–670. Wiley, New York. DOI: https://doi.org/10.1002/9781118490013.10.1002/9781118490013 Search in Google Scholar

Liu, M. 2020. “Using machine learning models to predict attrition in a survey panel.” In Big Data Meets Survey Science, edited by C.A. Hill, P.P. Biemer, T.D. Buskirk, L. Japec, A. Kirchner, S. Kolenikov, and L.E. Lyberg: 415–433. John Wiley & Sons. doi: https://doi.org\10.1002/9781118976357.ch14. Search in Google Scholar

Lu, J., A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang. 2019. “Learning under concept drift: A review.” IEEE Transactions on Knowledge and Data Engineering 31(12): 2346–2363. DOI: https://doi.org/10.1109/TKDE.2018.2876857.10.1109/TKDE.2018.2876857 Search in Google Scholar

Moreno-Torres, J.G., T. Raeder, R. Alaiz-Rodríguez, N.V. Chawla, and F. Herrera. 2012. “A unifying view on dataset shift in classification.” Pattern Recognition 45(1): 521–530. DOI: https://doi.org/10.1016/j.patcog.2011.06.019.10.1016/j.patcog.2011.06.019 Search in Google Scholar

O’Connor, B., R. Balasubramanyan, B.R. Routledge, and N.A. Smith. 2010. “From tweets to polls: Linking text sentiment to public opinion time series.” In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media (ICWSM) May 23 – May 26, edited by M.A. Hearst: 122–129, Washington, D.C, U.S.A. Available at: https://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/view/1536/1842. Search in Google Scholar

OECD. 2011. Quality Framework for OECD Statistical Activities. Available at: https://www.oecd.org/sdd/qualityframeworkforoecdstatisticalactivities.htm (accessed April 2020). Search in Google Scholar

Schlimmer, J.C., and R.H. Granger. 1986. “Incremental learning from noisy data.” Machine Learning 1(3): 317–354. DOI: https://doi.org/10.1007/BF00116895.10.1007/BF00116895 Search in Google Scholar

Scholtus, S., and A. van Delden. 2020. The accuracy of estimators based on a binary classifier. Discussion Paper 202006, Statistics Netherlands, The Hague. Available at: https://www.cbs.nl/-/media/_pdf/2020/06/classification-errors-binary.pdf. Search in Google Scholar

Schwartz, J.E. 1985. “The neglected problem of measurement error in categorical data.” Sociological Methods & Research 13(4): 435–466. DOI: https://doi.org/10.1177/0049124185013004001.10.1177/0049124185013004001 Search in Google Scholar

Tenenbein, A. 1970. “A double sampling scheme for estimating from binomial data with misclassifications.” Journal of the American Statistical Association 65(331): 1350–1361. DOI: https://doi.org/10.1080/01621459.1970.10481170.10.1080/01621459.1970.10481170 Search in Google Scholar

Van Delden, A., S. Scholtus, and J. Burger. 2016. “Accuracy of Mixed-Source Statistics as Affected by Classification Errors.” Journal of Official Statistics 32(3): 619–642. DOI: https://doi.org/10.1515/jos-2016-0032.10.1515/jos-2016-0032 Search in Google Scholar

Webb, G.I., R. Hyde, H. Cao, H.L. Nguyen, and F. Petitjean. 2016. “Characterizing concept drift.” Data Mining and Knowledge Discovery 30(4): 964–994. DOI: https://doi.org/10.1007/s10618-015-0448-4.10.1007/s10618-015-0448-4 Search in Google Scholar

Widmer, G., and M. Kubat. 1996. “Learning in the presence of concept drift and hidden contexts.” Machine Learning 23(1): 69–101. DOI: https://doi.org/10.1023/A:1018046501280.10.1023/A:1018046501280 Search in Google Scholar

Empfohlene Artikel von Trend MD

Planen Sie Ihre Fernkonferenz mit Scienceendo