1. bookVolume 38 (2022): Edizione 2 (June 2022)
Dettagli della rivista
Prima pubblicazione
01 Oct 2013
Frequenza di pubblicazione
4 volte all'anno
access type Accesso libero

Data Fusion for Joining Income and Consumtion Information using Different Donor-Recipient Distance Metrics

Pubblicato online: 14 Jun 2022
Volume & Edizione: Volume 38 (2022) - Edizione 2 (June 2022)
Pagine: 509 - 532
Ricevuto: 01 Nov 2020
Accettato: 01 Sep 2021
Dettagli della rivista
Prima pubblicazione
01 Oct 2013
Frequenza di pubblicazione
4 volte all'anno

Data fusion describes the method of combining data from (at least) two initially independent data sources to allow for joint analysis of variables which are not jointly observed. The fundamental idea is to base inference on identifying assumptions, and on common variables which provide information that is jointly observed in all the data sources. A popular class of methods dealing with this particular missing-data problem in practice is based on covariate-based nearest neighbour matching, whereas more flexible semi- or even fully parametric approaches seem underrepresented in applied data fusion. In this article we compare two different approaches of nearest neighbour hot deck matching: One, Random Hot Deck, is a variant of the covariate-based matching methods which was proposed by Eurostat, and can be considered as a ’classical’ statistical matching method, whereas the alternative approach is based on Predictive Mean Matching. We discuss results from a simulation study where we deviate from previous analyses of marginal distributions and consider joint distributions of fusion variables instead, and our findings suggest that Predictive Mean Matching tends to outperform Random Hot Deck.


Albayrak, O., and T. Masterson. 2017. Quality of statistical match of household budget survey and SILC for Turkey, Levy Economics Institute, Working Paper (885). DOI: https://doi.org/10.2139/ssrn.2924849.10.2139/ssrn.2924849 Search in Google Scholar

Andridge, R.R., and R.J.A. Little. 2009, “The Use of Sample Weights in Hot Deck Imputation”. Journal of Official Statistics 25(1): 21–36. Available at: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/the-use-of-sample-weights-inhot-deck-imputation.pdf (accessed March 2022). Search in Google Scholar

Andridge, R.R., and R.J.A. Little. 2010. “A review of hot deck imputation for survey non-response”. International statistical review 78(1): 40–64. DOI: https://doi.org/10.1111/j.1751-5823.2010.00103.x.10.1111/j.1751-5823.2010.00103.x313033821743766 Search in Google Scholar

Beretta, L. and A. Santaniello. 2016. “Nearest neighbor imputation algorithms: a critical evaluation”. BMC medical informatics and decision making 16(3): 74. DOI: https://doi.org/10.1186/s12911-016-0318-z.10.1186/s12911-016-0318-z495938727454392 Search in Google Scholar

Burgette, L.F., and J.P. Reiter. 2010 “Multiple imputation for missing data via sequential regression trees”. American Journal of Epidemiology 172(9): 1070–1076. DOI: https://doi.org/10.1093/aje/kwq260.10.1093/aje/kwq26020841346 Search in Google Scholar

Chen, J., and J. Shao. 2000. “Nearest Neighbor Imputation for Survey Data”. Journal of Official Statistics 16(2): 113–131. Available at: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/nearest-neighbor-imputation-for-survey-data.pdf. Search in Google Scholar

Conti, P.L., D. Marella, and M. Scanu 2012. “Uncertainty analysis in statistical matching”. Journal of Official Statistics 28(1): 69–88. Available at: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/uncertainty-analysis-in-statistical-matching.pdf. Search in Google Scholar

Dalla Chiara, E., Menon, M., and F. Perali, F. 2019. “An Integrated Database to Measure Living Standards”. Journal of Official Statistics 35(3): 531–576. DOI: https://doi.org/10.2478/JOS-2019-0023.10.2478/jos-2019-0023 Search in Google Scholar

Donatiello, G., M. D’Orazio, D. Frattarola, A. Rizzi, M. Scanu, and M. Spaziani. 2014. “Statistical matching of income and consumption expenditures”. International Journal of Economic Sciences 3(3): 50–65. Search in Google Scholar

D’Orazio, M. 2020. Statmatch: Statistical matching or data fusion: R-package. Available at: https://cran.r-project.org/web/packages/StatMatch/StatMatch.pdf (accessed September 2021). Search in Google Scholar

D’Orazio, M., M. Di Zio, and M. Scanu. 2006a. Statistical matching: Theory and practice, John Wiley & Sons.10.1002/0470023554 Search in Google Scholar

D’Orazio, M., M. Di Zio, and M. Scanu. 2006b. “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints”. Journal of Official Statistics 22(1): l37–157. Available at: https://www.scb.se/contentassets/ca21efb41-fee47d293bbee5bf7be7fb3/statistical-matching-for-categorical-data-displaying-uncertainty-and-using-logical-constraints.pdf. Search in Google Scholar

D’Orazio, M., Frattarola, D., A. Rizzi, A., M. Scanu, and M. Spaziani. 2018, The statistical matching of EU-SILC and HBS at ISTAT: where do we stand for the production of official statistics. Available at: https://www.istat.it/it/les//2018/11/Scanuoriginal-paper.pdf (accessed September 2021). Search in Google Scholar

Endres, E., P. Fink, and T. Augustin. 2019. “Imprecise Imputation: A Nonparametric Micro Approach Reecting the Natural Uncertainty of Statistical Matching with Categorical Data”. Journal of Official Statistics 35(3): 599–624. DOI: http://doi.org/10.2478/JOS-2019-0025.10.2478/jos-2019-0025 Search in Google Scholar

EU-SILC SUF DE. 2015. European union statistics on income and living conditions. Scientific use file Germany. Available at: https://ec.europa.eu/eurostat/cros/EU-SILCSUF_en. Search in Google Scholar

EU-SILC SUF FR. 2015. European union statistics. Scientific use file France. Available at: https://ec.europa.eu/eurostat/cros/EU-SILC-SUF_en. Search in Google Scholar

Eurostat. 2013. European household income by groups of households. Available at: https://ec.europa.eu/eurostat/documents/3888793/5858173/KS-RA-13-023-EN.PDF (accessed September 2021). Search in Google Scholar

Eurostat. 2016. Methodological Guidelines and Description of EU-SILC Target Variables: DocSILC065, 2015 Operation. Available at: https://circabc.europa.eu/sd/a/afb4601b-4e5c-4f40-86bb-0c3d0d94aa12/DOCSILC065operation2015VERSION08-08-2016.pdf. Search in Google Scholar

Eurostat. 2018. R code to match EU-SILC and HBS. Search in Google Scholar

Fosdick, B.K., M. DeYoreo, and J.P. Reiter. 2016. “Categorical data fusion using auxiliary information”. The Annals of Applied Statistics 10(4): 1907–1929. DOI: https://doi.org/10.1214/16-AOAS925.10.1214/16-AOAS925 Search in Google Scholar

Gabler, S. 1997. “Datenfusion”. ZUMA-Nachrichten 21(40): 81–92. Search in Google Scholar

Gilula, Z., R.E. McCulloch, and P.E. Rossi. 2006. “A direct approach to data fusion”. Journal of Marketing Research 43(1): 73–83. DOI: https://doi.org/10.1509/jmkr. Search in Google Scholar

Gower, J.C. 1971. “A general coefficient of similarity and some of its properties”. Biometrics 27(4): 857–871. DOI: https://doi.org/10.2307/2528823.10.2307/2528823 Search in Google Scholar

Kamakura, W.A., and M. Wedel. 1997. “Statistical data fusion for cross-tabulation”. Journal of Marketing Research 34(4): 485–498. DOI: https://doi.org/10.1177/002224379703400406.10.1177/002224379703400406 Search in Google Scholar

Kiesl, H., and S. Rässler. 2005. “Techniken und Einsatzgebiete von Datenintegration und Datenfusion”. In Datenfusion und Datenintegration: 6. Wissenschaftliche Tagung, Tagungsberichte, Bonn: 17–32. Search in Google Scholar

Kiesl, H., and S. Rässler. 2006. How valid can data fusion be? Available at: http://doku.iab.de/discussionpapers/2006/dp1506.pdf (accessed September 2021). Search in Google Scholar

Kim, J.K. 2002. “A note on approximate bayesian bootstrap imputation”. Biometrika 89(2): 470–477. DOI: https://doi.org/10.1093/biomet/89.2.470.10.1093/biomet/89.2.470 Search in Google Scholar

Kleinke, K. 2017. “Multiple imputation under violated distributional assumptions: A systematic evaluation of the assumed robustness of predictive mean matching”. Journal of Educational and Behavioral Statistics 42(4): 371–404. DOI: https://doi.org/10.3102/1076998616687084.10.3102/1076998616687084 Search in Google Scholar

Koller-Meinfelder, F. 2009. Analysis of Incomplete Survey Data – Multiple Imputation via Bayesian Bootstrap Predictive Mean Matching. PhD thesis, Bamberg. Available at: https://fis.uni-bamberg.de/bitstream/uniba/213/2/Dokument_1.pdf. Search in Google Scholar

Koschnick, W.J. 1995. Standard-Lexikon für Mediaplanung und Mediaforschung in Deutschland: Bd. 1.2, 2., überarb. aufl. edn, Saur, München. Search in Google Scholar

Lamarche, P. 2017. Measuring Income, Consumption and Wealth jointly at the Micro-Level. Eurostat. Available at: https://ec.europa.eu/eurostat/documents/7894008/8074103/income_methodological_note.pdf. Search in Google Scholar

Lamarche, P. 2018. Measuring Income, Consumption and Wealth jointly at the microlevel. Eurostat. Search in Google Scholar

Landerman, L.R., K.C. Land, and C.F. Pieper. 1997. “An empirical evaluation of the predictive mean matching method for imputing missing values”. Sociological Methods & Research 26(1): 3–33. DOI: https://doi.org/10.1177/0049124197026001001.10.1177/0049124197026001001 Search in Google Scholar

Leulescu, A. and M. Agafitei. 2013, Statistical matching: A model based approach for data integration. Available at: https://ec.europa.eu/eurostat/documents/3888793/5855821/KS-RA-13-020-EN.PDF (accessed September 2021). Search in Google Scholar

Little, R.J.A. 1988. “Missing-data adjustments in large surveys”. Journal of Business & Economic Statistics 6(3): 287–296. Search in Google Scholar

Little, R.J.A., and D.B. Rubin. 2020. Statistical analysis with missing data, third edition, John Wiley & Sons.10.1002/9781119482260 Search in Google Scholar

Lumley, T., and A. Miller. 2020. leaps: Regression subset selection: R-package. Available at: https://cran.r-project.org/web/packages/leaps/leaps.pdf (accessed September 2021). Search in Google Scholar

Meinfelder, F. 2013. “Datenfusion: Theoretische implikationen und praktische umsetzung”. In Weiterentwicklung der amtlichen Haushaltsstatistiken, edited by T. Riede, N. Ott, S. Bechthold, T. Schmidt, M. Eisele, B. Schimpl-Neimanns, F. Meinfelder, R. MŁunnich, J.P. Burgard and T. Zimmermann: 83–98. Search in Google Scholar

Meinfelder, F., and T. Schnapp. 2015. Baboon: Bayesian bootstrap predictive mean matching – multiple and single imputation for discrete data: R-package. Available at: https://cran.r-project.org/web/packages/BaBooN/BaBooN.pdf (accessed September 2021). Search in Google Scholar

Meng, X.-L. 1994. “Multiple-imputation inferences with uncongenial sources of input”. Statistical Science 9(4): 538–558. DOI: https://doi.org/10.1214/ss/1177010269.10.1214/ss/1177010269 Search in Google Scholar

Okner, B. 1972. “Constructing a new data base from existing microdata sets: The 1966 merge file”. In Annals of Economic and Social Measurement, 3(1): 325–362, National Bureau of Economic Research, Inc. Search in Google Scholar

Parzen, M., Lipsitz, S.R., and G.M. Fitzmaurice. 2005. “A note on reducing the bias of the approximate bayesian bootstrap imputation variance estimator”. Biometrika 92(4): 971–974. DOI: https://doi.org/10.1093/biomet/92.4.971.10.1093/biomet/92.4.971 Search in Google Scholar

Pfeffermann, D., and A. Sikov. 2011. “Imputation and Estimation under Nonignorable Nonresponse in Household Surveys with Missing Covariate Information”. Journal of Official Statistics 27(2): 181–209. Available at: https://www.scb.se/contentassets/-ca21efb41fee47d293bbee5bf7be7fb3/imputation-and-estimation-under-nonignorablenonresponse-in-household-surveys-with-missing-covariate-information.pdf (accessed March 2022). Search in Google Scholar

Quartagno, M., J.R. Carpenter, and H. Goldstein. 2020. “Multiple imputation with survey weights: a multilevel approach”. Journal of Survey Statistics and Methodology 8(5): 965–989. DOI: https://doi.org/10.1093/jssam/smz036.10.1093/jssam/smz036 Search in Google Scholar

R Core Team. 2021. R: A Language and Environment for Statistical Computing, R. Foundation for Statistical Computing, Vienna, Austria. Available at: https://www.R-project.org/ (accessed September 2021). Search in Google Scholar

Rässler, S. 2002. “Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches”. Vol. 168 of Lecture notes in statistics, Springer, New York.10.1007/978-1-4613-0053-3_2 Search in Google Scholar

Rodgers, W.L. 1984. “An evaluation of statistical matching”. Journal of Business & Economic Statistics 2: 91–102. DOI: https://doi.org/10.2307/1391358.10.2307/1391358 Search in Google Scholar

Rubin, D.B. 1978. “Multiple imputation in sample surveys – a phenomological bayesian approach to nonresponse”. In Proceedings of the Survey Research Method Section of the American Statistical Association: 20–40. Available at: http://www.asasrms.org/GGTSPU-f422b6f0b7825427-56279-110474-QWt4FYDtNN9fK3kX-LOD/Proceedings/papers/1978_004.pdf. Search in Google Scholar

Rubin, D.B. 1986. “Statistical matching using file concatenation with adjusted weights and multiple imputations”. Journal of Business & Economic Statistics 4(1): 87–94. Search in Google Scholar

Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys, Wiley, New York.10.1002/9780470316696 Search in Google Scholar

Rubin, D.B., and N. Schenker. 1986. “Multiple imputation for interval estimation from simple random samples with ignorable nonresponse”. Journal of the American Statistical Association 81(394): 366–374. DOI: https://doi.org/10.2307/1391390.10.2307/1391390 Search in Google Scholar

Serafino, P., and R. Tonkin. 2017. Statistical Matching of European Union Statistics on Income and Living Conditions (EU-SILC) and the Household Budget Survey, Eurostat. Available at:. https://ec.europa.eu/eurostat/documents/3888793/7882299/KS-TC-16-026-ENN.pdf (accessed September 2021). Search in Google Scholar

Sims, C.A. 1972. “Comments (on Okner 1972)”. Annals of Economic and Social Measurement 1: 343–345. Search in Google Scholar

Singh, A.C., H.J. Mantel, M.D. Kinack, and G. Rowe. 1993. “Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology 19(1): 59–79. Search in Google Scholar

Stiglitz, J., Sen, A., and J. Fitoussi. 2009. Report of the Commission on the Measurement of Economic Performance and Social Progress (CMEPSP). Avasilable at: https://ec.europa.eu/eurostat/documents/8131721/8131772/Stiglitz-Sen-Fitoussi-Commission-report.pdf. Search in Google Scholar

Uçar, B., and G. Betti. 2016. Longitudinal statistical matching: transferring consumption expenditure from hbs to silc panel survey, Technical report, Department of Economics, University of Siena. Available at: http://repec.deps.unisi.it/quaderni/739.pdf. Search in Google Scholar

Van Buuren, S. 2018. Flexible imputation of missing data, CRC press.10.1201/9780429492259 Search in Google Scholar

Van Buuren, S. 2021. Mice: Multivariate imputation by chained equations: R-package. Available at: https://cran.r-project.org/web/packages/mice/mice.pdf (accessed September 2021). Search in Google Scholar

Van Buuren, S. and K. Groothuis-Oudshoorn. 2011. “Mice: Multivariate imputation by chained equations in r”. Journal of Statistical Software 45(3): l–67.10.18637/jss.v045.i03 Search in Google Scholar

Van der Putten, P., Kok, J.N., and A. Gupta. 2002. Data fusion through statistical matching: Working paper 4342-02, MIT Sloan School of Management. DOI: http://doi.org/10.2139/ssrn.297501.10.2139/ssrn.297501 Search in Google Scholar

Webber, D. and R. Tonkin. 2013. Statistical Matching of EU-SILC and the Household Budget Survey to Compare Poverty Estimates Using Income, Expenditures and Material Deprivation, Eurostat. Available: https://ec.europa.eu/eurostat/documents/3888793/5857145/KS-RA-13-007-EN.PDF (accessed September 2021). Search in Google Scholar

Xie, X. and X.-L. Meng. 2017. “Dissecting multiple imputation from a multi-phase inference perspective: What happens when god’s, imputer’s and analyst’s models are uncongenial?”. Statistica Sinica: l485–1545. DOI: https://doi.org/10.5705/ss.2014.067.10.5705/ss.2014.067 Search in Google Scholar

Zhang, L.-C. 2015. “On Proxy Variables and Categorical Data Fusion”. Journal of Official Statistics 31(4): 783–807. DOI: http://doi.org/10.1515/JOS-2015-0045.10.1515/jos-2015-0045 Search in Google Scholar

Zhou, H. 2014. Accounting for Complex Sample Designs in Multiple Imputation Using the Finite Population Bayesian Bootstrap, PhD thesis, Michigan. DOI: https://citeseerx.ist.psu.edu/viewdoc/download?doi= Search in Google Scholar

Articoli consigliati da Trend MD

Pianifica la tua conferenza remota con Sciendo