Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms

National statistical institutes currently investigate how to improve the output quality of official statistics based on machine learning algorithms. A key issue is concept drift, that is, when the joint distribution of independent variables and a dependent (categorical) variable changes over time. Under concept drift, a statistical model requires regular updating to prevent it from becoming biased. However, updating a model asks for additional data, which are not always available. An alternative is to reduce the bias by means of bias correction methods. In the article, we focus on estimating the proportion (base rate) of a category of interest and we compare two popular bias correction methods: the misclassification estimator and the calibration estimator. For prior probability shift (a specific type of concept drift), we investigate the two methods analytically as well as numerically. Our analytical results are expressions for the bias and variance of both methods. As numerical result, we present a decision boundary for the relative performance of the two methods. Our results provide a better understanding of the effect of prior probability shift on output quality. Consequently, we may recommend a novel approach on how to use machine learning algorithms in the context of official statistics.

eISSN:: 2001-7367
Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 4 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Mathematik, Wahrscheinlichkeitstheorie und Statistik

Zeitschrift RSS Feed

Improving the Output Quality of Official Statistics Based on Machine Learning Algorithms

Online veröffentlicht: 14. Juni 2022

Seitenbereich: 485 - 508

Eingereicht: 01. Dez. 2020

Akzeptiert: 01. Juni 2021

DOI: https://doi.org/10.2478/jos-2022-0023

SchlüsselwörterOutput quality, concept drift, prior probability shift, misclassification bias

© 2022 Q.A. Meertens et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Schlüsselwörter
Output quality, concept drift, prior probability shift, misclassification bias