1. bookVolume 36 (2020): Issue 2 (June 2020)
Journal Details
First Published
01 Oct 2013
Publication timeframe
4 times per year
access type Open Access

Controlling for Selection Bias in Social Media Indicators through Official Statistics: a Proposal

Published Online: 15 Jun 2020
Volume & Issue: Volume 36 (2020) - Issue 2 (June 2020)
Page range: 315 - 338
Received: 01 Mar 2019
Accepted: 01 Jan 2020
Journal Details
First Published
01 Oct 2013
Publication timeframe
4 times per year

With the increase of social media usage, a huge new source of data has become available. Despite the enthusiasm linked to this revolution, one of the main outstanding criticisms in using these data is selection bias. Indeed, the reference population is unknown. Nevertheless, many studies show evidence that these data constitute a valuable source because they are more timely and possess higher space granularity. We propose to adjust statistics based on Twitter data by anchoring them to reliable official statistics through a weighted, space-time, small area estimation model. As a by-product, the proposed method also stabilizes the social media indicators, which is a welcome property required for official statistics. The method can be adapted anytime official statistics exists at the proper level of granularity and for which social media usage within the population is known. As an example, we adjust a subjective well-being indicator of “working conditions” in Italy, and combine it with relevant official statistics. The weights depend on broadband coverage and the Twitter rate at province level, while the analysis is performed at regional level. The resulting statistics are then compared with survey statistics on the “quality of job” at macro-economic regional level, showing evidence of similar paths.


Alajajian, S.E., J.R. Williams, A.J. Reagan, S.C. Alajajian, M.R. Frank, L. Mitchell, J. Lahne, C.M. Danforth, and P.S. Dodds. 2017. “The Lexicocalorimeter: Gauging public health through caloric input and output on social media.” PLOS ONE 12(2)(February): 1–25. DOI: https://doi.org/10.1371/journal.pone.0168893.10.1371/journal.pone.0168893530285328187216 Search in Google Scholar

Baker, R., J.M. Brick, N.A. Bates, M. Battaglia, M.P. Couper, J.A. Dever, K.J. Gile, and R. Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-probability Sampling.” Journal of Survey Statistics and Methodology 1(2): 90. DOI: https://doi.org/10.1093/jssam/smt008.10.1093/jssam/smt008 Search in Google Scholar

Bollen, J., B. Gonçalves, G. Ruan, and H. Mao. 2011. “Happiness is Assortative in Online Social Networks.” Artif. Life (Cambridge, MA, USA) 17(3)(August): 237–251. DOI: https://doi.org/10.1162/artl_a_00034.10.1162/artl_a_0003421554117 Search in Google Scholar

Braaksma, B. and K. Zeelenberg. 2015. “Re-make/Re-model: Should big data change the modelling paradigm in official statistics?” Statistical Journal of the IAOS 31(2): 193–202. DOI: https://doi.org/10.3233/sji-150892.10.3233/sji-150892 Search in Google Scholar

Ceron, A., L. Curini, and S.M. Iacus. 2016. “iSA: A fast, scalable and accurate algorithm for sentiment analysis of social media content.” Information Sciences 367–368: 105–124. ISSN: 0020-0255. DOI: https://doi.org/10.1016/j.ins.2016. Search in Google Scholar

Clark, A.E. and A.J. Oswald. 1994. “Unhappiness and Unemployment.” Economic Journal 104(424): 648–659. DOI: https://doi.org/10.2307/2234639.10.2307/2234639 Search in Google Scholar

Cooper, D. and M. Greenaway. 2015. Non-probability Survey Sampling in Official Statistics. Office for National Statistics – Methodology Working Paper Series N4. Available at: https://www.k/ons/guide-method/method-quality/specific/gss-methodology-series/ons-working-paper-series/mwp3-non-probability-survey-sampling-inofficial-statistics.pdf (accessed May 2020). Search in Google Scholar

Couper, M.P. 2013. “Is the Sky Falling? New Technology, Changing Media, and the Future of Surveys.” Survey Research Methods 7(3): 145–156. ISSN: 1864-3361. DOI: https://doi.org/10.18148/srm/2013.v7i3.5751. Search in Google Scholar

Culotta, A. 2014. “Estimating County Health Statistics with Twitter.” In Proceedings of the 32nd Annual ACM Conference on Human Factors in Computing Systems, 1335–1344. CHI ’14. Toronto, Ontario, Canada: ACM. ISBN: 978-1-4503-2473-1. DOI: https://doi.org/10.1145/2556288.2557139.10.1145/2556288.2557139 Search in Google Scholar

Curini, L., S. Iacus, and L. Canova. 2015. “Measuring Idiosyncratic Happiness Through the Analysis of Twitter: An Application to the Italian Case.” Social Indicators Research 121(2): 525–542. ISSN: 1573-0921. DOI: https://doi.org/10.1007/s11205-014-0646-2.10.1007/s11205-014-0646-2 Search in Google Scholar

Daas, P.J.H., M.J. Puts, B. Buelens, and P. A.M. van den Hurk. “Big Data as a Source for Official Statistics.” Journal of Official Statistics 31(2): 249–262. DOI: https://doi.org/10.1515/jos-2015-0016.10.1515/jos-2015-0016 Search in Google Scholar

Deaton, A. 2011. “The Financial Crisis and the Well-Being of America.” In Investigations in the Economics of Aging, edited by David A. Wise, 343–368. University of Chicago Press, June.10.7208/chicago/9780226903163.003.0011 Search in Google Scholar

Falorsi, S., A. Fasulo, A. Naccarato, and M. Pratesi. 2017. Small Area model for Italian regional monthly estimates of young unemployed using Google Trends Data. 61st World Congress of the International Statistical Institute 16–21 July 2017 – Marrakech, Marocco, October. Available at: https://www.researchgate.net/publication/320554956_Small_Area_model_for_Italian_regional_monthly_estimates_of_young_unemployed_using_Google_Trends_Data (accessed May 2020). Search in Google Scholar

Fay, R.E. and R.A. Herriot. 1979. “Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data.” Journal of the American Statistical Association 74(366): 269–277. ISSN: 01621459. DOI: https://doi.org/10.2307/2286322.10.2307/2286322 Search in Google Scholar

Feddersen, J., R. Metcalfe, and M. Wooden. 2016. “Subjective wellbeing: why weather matters.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 179(1): 203–228. ISSN: 1467-985X. DOI: https://doi.org/10.1111/rssa.12118.10.1111/rssa.12118 Search in Google Scholar

Fleurbaey, M. 2009. “Beyond GDP: The Quest for a Measure of Social Welfare.” Journal of Economic Literature 47(4): 1029–1075. DOI: https://doi.org/10.1257/jel.47.4.1029.10.1257/jel.47.4.1029 Search in Google Scholar

Ghosh, M., N. Nangia, and D.H. Kim. 1996. “Estimation of Median Income of Four-Person Families: A Bayesian Time Series Approach.” Journal of the American Statistical Association 91(436): 1423–1431. ISSN: 01621459. DOI: https://doi.org/10.2307/2291568.10.1080/01621459.1996.10476710 Search in Google Scholar

Heckman, J.J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47(1): 153–161. ISSN 00129682, 14680262. DOI: https://doi.org/10.2307/1912352.10.2307/1912352 Search in Google Scholar

Henderson, C.R. 1975. “Best Linear Unbiased Estimation and Prediction under a Selection Model.” Biometrics 31(2): 423–447. ISSN 0006341X, 15410420. DOI: https://doi.org/10.2307/2529430.10.2307/2529430 Search in Google Scholar

Hofacker, C.F., E.C. Malthouse, and F. Sultan. 2016. “Big Data and consumer behavior: imminent opportunities.” Journal of Consumer Marketing 33(2): 89–97. DOI: https://doi.org/10.1108/JCM-04-2015-1399.10.1108/JCM-04-2015-1399 Search in Google Scholar

Iacus, S.M. 2014. “Big Data or Big Fail?” The Good, the Bad and the Ugly and the missing role of Statistics. Electronic Journal of Applied Statistical Analysis: Decision Support Systems and Services Evaluation 5(1): 4–11. DOI: https://doi.org/10.1285/i2037-3627v5n1p4. Search in Google Scholar

Iacus, S.M., G. Porro, S. Salini, and E. Siletti. 2015. “Social networks, happiness and health: from sentiment analysis to a multidimensional indicator of subjective well-being.” ArXiv e-prints Statistics – Applications (December): 1–26. Available at: 1512.01569 [stat.AP] (accessed December 2015). Search in Google Scholar

Iacus, S.M., G. Porro, S. Salini, and E. Siletti. 2017. “How to exploit big data from social networks: a subjective well-being indicator via Twitter.” In SIS 2017. Statistics and data science: new challenges, new generations. Proceedings of the Conference of the Italian Statistical Society, edited by Alessandra Petrucci and Rosanna Verde, 537–542. 28–30 June 2017, Firenze: Firenze University Press. ISBN: 978-88-6453-521-0 Search in Google Scholar

Iacus, S.M., G. Porro, S. Salini, and E. Siletti. 2019. “Social Networks Data and Subjective Well-Being. An Innovative Measurement for Italian Provinces.” Scienze Regionali, Italian Journal of Regional Science Speciale (2019): 667–678. ISSN: 1720-3929. DOI: https://doi.org/10.14650/94673. Search in Google Scholar

Kahneman, D. and A.B. Krueger. 2006. “Developments in the Measurement of Subjective Well-Being.” Journal of Economic Perspectives 20(1): 3–24. DOI: https://doi.org/10.1257/089533006776526030.10.1257/089533006776526030 Search in Google Scholar

King, G. 2011. “Ensuring the Data Rich Future of the Social Sciences.” Science 331(February): 719–721. DOI: https://doi.org/10.1126/science.1197872.10.1126/science.119787221311013 Search in Google Scholar

King, G. 2016. “Preface: Big Data is Not About the Data!” Chap. 1 in Computational Social Science: Discovery and Prediction, edited by R. Michael Alvarez, 1–10. Cambridge: Cambridge University Press. Search in Google Scholar

King, G., J. Pan, and M.E. Roberts. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107(2): 326–343. DOI: https://doi.org/10.1017/S0003055413000014.10.1017/S0003055413000014 Search in Google Scholar

King, G., J. Pan, and M.E. Roberts. 2014. “Reverse-engineering censorship in China: Randomized experimentation and participant observation.” Science 345(6199): 891–913. ISSN: 0036-8075. DOI: https://doi.org/10.1126/science.1251722.10.1126/science.125172225146296 Search in Google Scholar

King, G., J. Pan, and M.E. Roberts. 2017. “How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument.” American Political Science Review 111(3): 484 – 501. DOI: https://doi.org/10.1017/S0003055417000144.10.1017/S0003055417000144 Search in Google Scholar

Kitchin, R. 2015. “The opportunities, challenges and risks of big data for official statistics.” Statistical Journal of the IAOS 31(3): 471–481. DOI: https://doi.org/10.3233/SJI-150906.10.3233/SJI-150906 Search in Google Scholar

Kwong, B.M., S.M. McPherson, J.F.A. Shibata, and O.T. Zee. 2012. “Facebook: Data mining the world’s largest focus group.” Graziadia Business Review 15: 1–8. Available at: https://gbr.pepperdine.edu/2012/11/facebook-data-mining-the-worlds-largest-focus-group/ (accessed April 2020). Search in Google Scholar

Lazer, D., A. Pentland, L. Adamic, S. Aral, A.-L. Barabási, D. Brewer, N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King, M. Macy, D. Roy, and M. van Alstyne. 2009. “Computational Social Science.” Science 323(5915): 721–723. DOI: https://doi.org/10.1126/science.1167742.10.1126/science.1167742274521719197046 Search in Google Scholar

Marchetti, S., C. Giusti, and M. Pratesi. 2016. “The use of Twitter data to improve small area estimates of households’ share of food consumption expenditure in Italy.” AStA Wirtschafts – und Sozialstatistisches Archiv 10(2)(October): 79–93. ISBN 1863-8163. DOI: https://doi.org/10.1007/s11943-016-0190-4.10.1007/s11943-016-0190-4 Search in Google Scholar

Marchetti, S., C. Giusti, M. Pratesi, N. Salvati, F. Giannotti, D. Pedreschi, S. Rinzivillo, L. Pappalardo, and L. Gabrielli. 2015. “Small Area Model-Based Estimators Using Big Data Sources.” Journal of Official Statistics 31(2): 263–281. DOI: https://doi.org/10.1515/jos-2015-0017.10.1515/jos-2015-0017 Search in Google Scholar

Marhuenda, Y., I. Molina, and D. Morales. 2013. “Small area estimation with spatio-temporal Fay-Herriot models.” The Third Special Issue on Statistical Signal Extraction and Filtering, Computational Statistics & Data Analysis 58: 308–325. ISSN: 0167-9473. DOI: https://doi.org/10.1016/j.csda.2012. Search in Google Scholar

Molina, I. and Y. Marhuenda. 2015. “sae: An R package for small area estimation.” The R Journal 7(1): 81–98. DOI: https://doi.org/10.32614/RJ-2015-007.10.32614/RJ-2015-007 Search in Google Scholar

Murphy, J., M.W. Link, J. Childs, C. Tesfaye, E. Dean, M. Stern, J. Pasek, J. Cohen, M. Callegaro, and P. Harwood. 2014. “Social Media in Public Opinion Research Executive summary of the AAPOR task force on Emerging Technologies in Public Opinion Research.” Public Opinion Quarterly 78(4): 788–794. DOI: https://doi.org/10.1093/poq/nfu053.10.1093/poq/nfu053 Search in Google Scholar

New Economics Foundation. 2012. The Happy Planet Index: 2012 Report. A global index of sustainable well-being. New Economics Foundation. Available at: https://neweconomics.org/uploads/files/d8879619b64bae461f_opm6ixqee.pdf (accessed August 2015). Search in Google Scholar

Pentland, A. 2014. Social Physics: how good ideas spread – the lessons from a new science. EBL-Schweitzer. Scribe Publications Pty Limited. ISBN: 978113143. Search in Google Scholar

Porter, A.T., S.H. Holan, C.K. Wikle, and N. Cressie. 2014. “Spatial Fay-Herriot models for small area estimation with functional covariates.” Spatial Statistics 10: 27–42. DOI: https://doi.org/10.1016/j.spasta.2014. Search in Google Scholar

Rao, J.N.K. and M. Yu. 1994. “Small-Area Estimation by Combining Time-Series and Cross-Sectional Data.” The Canadian Journal of Statistics 22(4): 511–528. ISSN: 03195724. DOI: https://doi.org/10.2307/3315407.10.2307/3315407 Search in Google Scholar

Rao, J.N.K. 2005. Small Area Estimation. Wiley Series in Survey Methodology. John Wiley & Sons, January. ISBN: 9780471431626. Search in Google Scholar

Rosembaum, P.R. and D.B. Rubin. 1983. “The central role of the propensity score in observational studies for causal effects.” Biometrika 70(1): 41 – 55. DOI: https://doi.org/10.2307/2335942.10.1093/biomet/70.1.41 Search in Google Scholar

Schwarz, N. 1999. “Self-reports: how the questions shape the answers.” American psychologist 54(2): 93–105. DOI: https://doi.org/10.1037/0003-066X. Search in Google Scholar

Schwarz, N. and F. Strack. 1999. “Reports of subjective well-being: Judgmental processes and their methodological implications.” In Well-being: The foundations of hedonic psychology, edited by D. Kahneman, E. Diener, and N. Schwarz, 7: 61–84. New York: Russell Sage Foundation. Search in Google Scholar

Severo, M., A. Feredj, and A. Romele. 2016. “Soft Data and Public Policy: Can Social Media Offer Alternatives to Official Statistics in Urban Policymaking?” Policy & Internet 8(3)(September): 354–372. ISSN: 1944-2866. DOI: https://doi.org/10.1002/poi3.127.10.1002/poi3.127 Search in Google Scholar

Singh, B.B., G.K. Shukla, and D. Kundu. 2005. “Spatio-temporal models in small area estimation.” Survey Methodology 31(2): 183–195. DOI: https://doi.org/ Search in Google Scholar

Stiglitz, J., A. Sen, and J.-P. Fitoussi. 2009. Report by the Commission on the Measurement of Economic Performance and Social Progress. INSEE. Available at: https://www.researchgate.net/publication/258260767_Report_of_the_Commission_on_the_Measurement_of_Economic_Performance_and_Social_Progress_CMEPSP (accessed April 2020). Search in Google Scholar

Struijs, P., B. Braaksma, and P.J.H. Daas. 2014. “Official statistics and Big Data.” Big Data & Society 1(1): 1–6. DOI: https://doi.org/10.1177/2053951714538417.10.1177/2053951714538417 Search in Google Scholar

Tam, S.-M. and F. Clarke. 2015. “Big Data, Official Statistics and Some Initiatives by the Australian Bureau of Statistics.” International Statistical Review 83(3)(December): 436–448. DOI: https://doi.org/10.1111/insr.12105.10.1111/insr.12105 Search in Google Scholar

Van den Brakel, J., J. Söhler, P.J.H. Daas, and B. Buelens. 2017. “Social media as a data source for official statistics; the Dutch Consumer Conhdence Index.” Survey Methodology 12-001-X (43): 183–210. DOI: https://doi.org/10.13140/RG.2.2.19294.64326. Search in Google Scholar

Winkelmann, R. 2014. “Unhappiness and Unemployment.” IZA World of Labor 94. DOI: https://doi.org/10.15185/izawol.94.10.15185/izawol.94 Search in Google Scholar

Ybarra, L.M.R. and S.L. Lohr. 2008. “Small Area Estimation When Auxiliary Information Is Measured with Error.” Biometrika 95(4): 919–931. ISSN: 00063444. DOI: https://doi.org/10.1093/biomet/asn048.10.1093/biomet/asn048 Search in Google Scholar

Zhao, Y., F. Yu, B. Jing, X. Hu, A. Luo, and K. Peng. 2018. “An Analysis of Well-Being Determinants at the City Level in China Using Big Data.” Social Indicators Research (October). ISSN: 1573-0921. DOI: https://doi.org/10.1007/s11205-018-2015-z.10.1007/s11205-018-2015-z Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo