1. bookTom 38 (2022): Zeszyt 2 (June 2022)
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2001-7367
Pierwsze wydanie
01 Oct 2013
Częstotliwość wydawania
4 razy w roku
Języki
Angielski
access type Otwarty dostęp

Improved Assessment of the Accuracy of Record Linkage via an Extended MaCSim Approach

Data publikacji: 14 Jun 2022
Tom & Zeszyt: Tom 38 (2022) - Zeszyt 2 (June 2022)
Zakres stron: 429 - 451
Otrzymano: 01 May 2020
Przyjęty: 01 Nov 2021
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2001-7367
Pierwsze wydanie
01 Oct 2013
Częstotliwość wydawania
4 razy w roku
Języki
Angielski
Abstract

Record linkage is the process of bringing together the same entity from overlapping data sources while removing duplicates. Huge amounts of data are now being collected by public or private organizations as well as by researchers and individuals. Linking and analysing relevant information from this massive data reservoir can provide new insights into society. It has become increasingly important to have effective and efficient methods for linking data from different sources. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or to compare between methods with respect to accuracy. In this article, we improve on a Markov Chain based Monte Carlo simulation approach (MaCSim) for assessing a linking method. The improvement proposed here involves calculation of a similarity weight for every linking variable value for each record pair, which allows partial agreement of the linking variable values. To assess the accuracy of the linking method, correctly linked proportions are investigated for each record. The extended MaCSim approach is illustrated using a synthetic data set provided by the Australian Bureau of Statistics based on realistic data settings. Test results show high accuracy of the assessment of the linkages.

Keywords

Belin, T.R., and D.B. Rubin. 1995. “A Method for Calibrating False-Match Rates in Record Linkage.” Journal of the American Statistical Association, 90 (430): 694–707. DOI: https://doi.org/10.1080/01621459.1995.10476563.10.1080/01621459.1995.10476563 Search in Google Scholar

Borkar, V., K. Deshmukh, and S. Sarawagi. 2001. “Automatic Segmentation of Text into Structured Records.” Association of Computing Machinery SIGMOD, 30, no. 2: 175–186. DOI: https://doi.org/10.1145/376284.375682.10.1145/376284.375682 Search in Google Scholar

Chambers, R. 2009. “Regression analysis of probability-linked data.” Statisphere 4, Official Statistics Research Series, Statistics New Zealand. Available at: http://www.statisphere.govt.nz/official-statistics-research/series/vol-4.htm. Search in Google Scholar

Chambers, R., J.O. Chipperfield, W. Davis, and M. Kovacevic. 2009. Inference Based on Estimating Equations and Probability-Linked Data. Centre for Statistical and Survey Methodology, University of Wollongong, Working Paper 18(09). Available at: https://ro.uow.edu.au/cssmwp/38 (accessed August 2015). Search in Google Scholar

Chipperfield, J.O., G.R. Bishop, and P. Campbell. 2011. Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data. Statistics Canada. Available at: https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2011001/article/11444-eng.pdf?st=NcU2PgN1 (accessed August 2015). Search in Google Scholar

Chipperfield, J.O., and R. Chambers. 2015. “Using the Bootstrap to Analyse Binary Data Obtained Via Probabilistic Linkage.” Journal of Official Statistics, 31: 397–414. DOI: http://dx.doi.org/10.1515/JOS-2015-0024.10.1515/jos-2015-0024 Search in Google Scholar

Christen, P., T. Churches, J.Xi. Zhu. 2002. “Probabilistic Name and Address Cleaning and Standardization.” The Australian Data Mining Workshop, 3rd December, Canberra, Australia. Available at: http://datamining.anu.edu.au/projects/linkage.html (accessed June 2016). Search in Google Scholar

Churches, T., P. Christen, K. Lim, and J.Xi. Zhu. 2002. “Preparation of Name and Address Data for Record Linkage Using Hidden Markov Models.” BioMed Central Medical Informatics and Decision Making, 2, 9. Available at: http://www.biomedcentral.com/1472-6947/2/9/ (accessed June 2016).10.1186/1472-6947-2-914001912482326 Search in Google Scholar

Cohen, W.W., P. Ravikumar, and S.E. Fienberg. 2003a. “A Comparison of String Metrics for Matching Names and Addresses.” International Joint Conference on Artificial Intelligence, Proceedings of the Workshop on Information Integration on the Web. August_9-10, Acapulco, Mexico. Available at: https://www.researchgate.net/publication/242505941_Proceedings_of_IJCAI03_Workshop_on_Information_Integration_on_the_Web_IIWeb-03_August_9-10_2003_Acapulco_Mexico. Search in Google Scholar

Cohen, W.W., P. Ravikumar, and S.E. Fienberg. 2003b. “A Comparison of String Distance Metrics for Name-Matching Tasks.” Proceedings of the 2003 International Conference on Information Integration of the Web August 9-10, Acapulco, Mexico: 73–78. DOI: https://doi.org/10.5555/3104278.3104293. Search in Google Scholar

Di Consiglio, L., and T. Tuoto. 2018. “When adjusting for the bias due to linkage errors: A sensitivity analysis.” Statistical Journal of the IAOS, 34(4): 589–597. DOI: https://doi.org/10.3233/SJI-170377.10.3233/SJI-170377 Search in Google Scholar

Fair, M.E, M. Cyr, A.C. Alexander, S.-W. Wen, G. Guyon, and R.C. MacDonald. 2000. “An assessment of the validity of a computer system for probabilistic record linkage of birth and infant death records in Canada. The Fetal and Infant Health Study Group.” Chronic diseases in Canada, 21(1): 8–13. Search in Google Scholar

Fellegi, I.P., A.B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association, 64 (328): 1183–1210. DOI: https://doi.org/10.1080/01621459.1969.10501049.10.1080/01621459.1969.10501049 Search in Google Scholar

Fortini, M., B. Liseo, N.A. Brunero, and M. Scanu. 2001. “On Bayesian Record Linkage.” Research in Official Statistics 4(1): 185–198. Available at: https://www.researchgate.net/profile/George-Kokolakis-2/publication/2397375_Bayesian_Multivariate_Micro-Aggregation_Under_the_Hellinger’s_Distance_Criterion/links/0046351a4a7bd5e6e0000000/Bayesian-Multivariate-Micro-Aggregation-Under-the-Hellingers-Distance-Criterion.pdf#page=179. Search in Google Scholar

Goldstein, H., K. Harron, and A. Wade. 2012. “The analysis of record-linked data using multiple imputation with data value priors.” Statistics in Medicine, 31(28). DOI: https://doi.org/10.1002/sim.5508.10.1002/sim.550822807145 Search in Google Scholar

Gomatam, S., R. Carter, M. Ariet, and G. Mitchell. 2002. “An empirical comparison of record linkage procedures.” Statistics in Medicine, 21(10): 1485–1496. DOI: https://doi.org/10.1002/sim.1147. PMID: 12185898.10.1002/sim.114712185898 Search in Google Scholar

Grannis, S.J., J.M. Overhage, S. Hui, and C.J. McDonald. 2003. “Analysis of a Probabilistic Record Linkage Technique without Human Review.” In American Medical Informatics Association (AMIA) Annual Symposium Proceedings, vol 2003: 259. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1479910/. Search in Google Scholar

Haque, S., K. Mengersen, and S. Stern. 2021. “Assessing the accuracy of record linkages with Markov chain based Monte Carlo simulation approach.” Journal of Big Data, 8(1). DOI: https://doi.org/10.1186/s40537-020-00394-7.10.1186/s40537-020-00394-7 Search in Google Scholar

Harron, K., A. Wade, R. Gilbert, B. Muller-Pebody, and H. Goldstein. 2014. “Evaluating bias due to data linkage error in electronic healthcare records.” BMC Medical Research Methodology, 14(36). DOI: https://doi.org/10.1186/1471-2288-14-36.10.1186/1471-2288-14-36401570624597489 Search in Google Scholar

Herzog, T.N., F.J. Scheuren, and W.E. Winkler. 2007. “Data Quality and Record Linkage Techniques.” Springer: New York. Search in Google Scholar

Jaro, M.A. 1972. “UNIMATCH: a computer system for generalized record linkage under conditions of uncertainty.” AFIPS ’72: 523–530. DOI: https://doi.org/10.1145/1478873.1478943.10.1145/1478873.1478943 Search in Google Scholar

Jaro, M.A. 1989. “Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.” Journal of the American Statistical Association 84: 414–420. DOI: https://doi.org/10.2307/2289924.10.2307/2289924 Search in Google Scholar

Kim, G., and R. Chambers. 2012. “Regression Analysis under Probabilistic Multi-Linkage.” Statistica Neerlandica, 66 (1): 64–79. DOI: https://doi.org/10.1111/j.1467-9574.2011.00509.x.10.1111/j.1467-9574.2011.00509.x Search in Google Scholar

Kim, G., and R. Chambers. 2013. Bias reduction for correlated linkage error. Centre for Statistical and Survey Methodology, University of Wollongong, Working Paper: 16–13. Available at: https://ro.uow.edu.au/cssmwp/112. Search in Google Scholar

Lahiri, P. and M.D. Larsen. 2005. “Regression analysis with linked data.” Journal of the American Statistical Association, 100(469): 222–230. DOI: https://doi.org/10.1198/016214504000001277.10.1198/016214504000001277 Search in Google Scholar

Liseo, B. and A. Tancredi. 2011. “Some advances on Bayesian record linkage and inference for linked data.” Proceedings of the ESSnet Data Integration Workshop, 24–25 November, Madrid, Spain. Available at: http://www.ine.es/e/essnetdi_ws2011/ppts/Liseo_Tancredi.pdf (accessed October 2016). Search in Google Scholar

Newcombe, H.B., J.M. Kennedy, S.J. Axford, and A.P. James, 1959. “Automatic Linkage of Vital Records.” Science: 954–959. DOI: https://doi.org/10.1126/science.130.3381.954.10.1126/science.130.3381.95414426783 Search in Google Scholar

Newcombe, H.B., and Kennedy, J.M. 1962. “Record linkage: making maximum use of the discriminating power of identifying information.” Communications of the ACM, 5: 563–566. DOI: 1 https://doi.org/0.1145/368996.369026.10.1145/368996.369026 Search in Google Scholar

Nitsch, D., B.L. DeStavola, S. Morton, and D.A. Leon. 2006. “Linkage Bias in Estimating the Association between Childhood Exposures and Propensity to Become a Mother: An Example of Simple Sensitivity Analyses.” Journal of the Royal Statistical Society. 169(3): 493–505. Available at: http://www.jstor.org/stable/3877432 (accessed January 2017).10.1111/j.1467-985X.2006.00400.x Search in Google Scholar

Pollock, J.J. and A. Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4): 358–368. DOI: https://doi.org/10.1145/358027.358048.10.1145/358027.358048 Search in Google Scholar

Resnick, D., and J, Asher, 2019. “Measurement of Type I and Type II Record Linkage Error.” Proceedings of the American Statistical Association, Government Statistics Section, Denver CO, USA: 293–311. Joint Statistical Meetings, 27 July – 1 August. Available at: https://www.researchgate.net/publication/336796421_Measurement_of_-Type_I_and_Type_II_Record_Linkage_Error. Search in Google Scholar

Rossiter, P. 2014. Simulating Probabilistic Record Linkage. Internal Report, Analytical Services Branch, Australian Bureau of Statistics. Search in Google Scholar

Sadinle, M. 2014. “Detecting Duplicates in a Homicide Registry using a Bayesian Partitioning Approach.” The Annals of Applied Statistics, 8(4): 2404–2434. Available at: https://www.jstor.org/stable/24522389. Search in Google Scholar

Sadinle, M. 2016. “Bayesian Estimation of Bipartite Matchings for Record Linkage.” Journal of the American Statistical Association, 112: 600–612.10.1080/01621459.2016.1148612 Search in Google Scholar

Sadinle, M., and S.E. Fienberg. 2013. “A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems.” Journal of the American Statistical Association, 108 (502): 385–397. DOI: https://doi.org/10.1080/01621459.2012.757231.10.1080/01621459.2012.757231 Search in Google Scholar

Sayers, A., Y. Ben-Shlomo, A.W. Blom. and F. Steele. 2016. “Probabilistic record linkage.” International journal of epidemiology, 45(3): 954-964. DOI: https://doi.org/10.1093/ije/dyv322.10.1093/ije/dyv322500594326686842 Search in Google Scholar

Scheuren, F., and W.E. Winkler. 1993. “Regression Analysis of Data Files that are Computer Matched.” Survey Methodology, 19: 39–58. Available at: https://www.researchgate.net/publication/247377872_Regression_analysis_of_data_files_that_are_-computer_matched (accessed November 2015). Search in Google Scholar

Smith, D., and N. Shlomo. 2014. “Privacy Preserving Probabilistic Record Linkage.” University of Manchester, School of Social Sciences. Collaboration in Research and Methodology for Official Statistics, Available at: https://ec.europa.eu/eurostat/cros/-content/privacy-preserving-probabilistic-record-linkage-duncan-smith-natalie-shlomo-university_en. Search in Google Scholar

Steorts, R.C. 2015. “Entity resolution with empirically motivated priors.” Bayesian Analysis, 10(4): 849–875. DOI: https://doi.org/10.1214/15-BA965SI.10.1214/15-BA965SI Search in Google Scholar

Steorts, R.C., R. Hall, and S.E. Fienberg. 2016. “A Bayesian approach to graphical record linkage and de-duplication.” Journal of the American Statistical Association, 111(516): 1660–1672. DOI: https://doi.org/10.1080/01621459.2015.1105807.10.1080/01621459.2015.1105807 Search in Google Scholar

Winglee, M., R. Valliant, and F. Scheuren. 2005. “A case study in record linkage.” Surv Methodol 31(1): 3–11. Available at: https://www.researchgate.net/profile/Peter-Lynn-4/publication/5017808_Approximations_to_b_in_the_Prediction_of_Design_Effects_-due_to_Clustering/links/0912f510fbddba0df4000000/Approximations-to-b-in-the-Prediction-of-Design-Effects-due-to-Clustering.pdf#page=11 (accessed March 2016). Search in Google Scholar

Winkler, W.E. 1989. “Frequency-based matching in Fellegi-Sunter model of record linkage.” Proceedings of the Section on Survey Research Methods, American Statistical Association: 778–783. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.3559. Search in Google Scholar

Winkler, W.E. 1990. “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.” Proceedings of the Section on Survey Research Methods, American Statistical Association: 354–359. https://eric.ed.gov/?id=ED325505. Search in Google Scholar

Winkler, W.E. 1995. “Matching and Record Linkage.” Business Survey Methods 1: 355–384. DOI: http://dx.doi.org/10.1002/wics.1317.10.1002/wics.1317 Search in Google Scholar

Winkler, W. 2001. Record Linkage Software and Methods for Merging Administrative Lists. Statistical Research Report Series, No. RR2001/03, U.S. Bureau of the Census. Available at: https://www.census.gov/content/dam/Census/library/working-papers/2001/adrm/rr2001-03.pdf (accessed April 2017). Search in Google Scholar

Winkler, W.E. 2005. Approximate String Comparator Search Strategies for Very Large Administrative Lists. Statistical Research Report Series, RRS2005(2), U.S. Bureau of the Census. Available at: https://www.census.gov/library/working-papers/2005/adrm/rrs2005-02.html. Search in Google Scholar

Winkler, W.E. 2007. Automatically Estimating Record Linkage False Match Rates. Statistical Research Report Series, RRS2007(5), U.S. Bureau of the Census. Search in Google Scholar

Yancey, W.E. 2000. “Frequency-Dependent Probability Measures for Record Linkage.” Proceedings of the Section on Survey Research Methods, American Statistical Association: 752–757. Available at: http://www.census.gov/srd/www/byyear.html. Search in Google Scholar

Polecane artykuły z Trend MD

Zaplanuj zdalną konferencję ze Sciendo