Published Online: 15 Jun 2020 Page range: 237 - 249
Abstract
Abstract
We propose an estimator for the Gini coefficient, based on a ratio of means. We show how bootstrap and empirical likelihood can be combined to construct confidence intervals. Our simulation study shows the estimator proposed is usually less biased than customary estimators. The observed coverages of the empirical likelihood confidence interval proposed are also closer to the nominal value.
Published Online: 15 Jun 2020 Page range: 251 - 274
Abstract
Abstract
Policy measures to combat low literacy are often targeted at municipalities or regions with low levels of literacy. However, current surveys on literacy do not contain enough observations at this level to allow for reliable estimates when using only direct estimation techniques. To provide more reliable results at a detailed regional level, alternative methods must be used.
The aim of this article is to obtain literacy estimates at the municipality level using model-based small area estimation techniques in a hierarchical Bayesian framework. To do so, we link Dutch Labour Force Survey data to the most recent literacy survey available, that of the Programme for the International Assessment of Adult Competencies (PIAAC). We estimate the average literacy score, as well as the percentage of people with a low literacy level. Variance estimators for our small area predictions explicitly account for the imputation uncertainty in the PIAAC estimates. The proposed estimation method improves the precision of the area estimates, making it possible to break down the national figures by municipality.
Published Online: 15 Jun 2020 Page range: 275 - 296
Abstract
Abstract
Contingency tables provide a convenient format to publish summary data from confidential survey and administrative records that capture a wide range of social and economic information. By their nature, contingency tables enable aggregation of potentially sensitive data, limiting disclosure of identifying information. Furthermore, censoring or perturbation can be used to desensitise low cell counts when they arise. However, access to detailed cross-classified tables for research is often restricted by data custodians when too many censored or perturbed cells are required to preserve privacy. In this article, we describe a framework for selecting and combining log-linear models when accessible data is restricted to overlapping marginal contingency tables. The approach is demonstrated through application to housing transition data from the Australian Census Longitudinal Data set provided by the Australian Bureau of Statistics.
Published Online: 15 Jun 2020 Page range: 297 - 314
Abstract
Abstract
The transformation of area aggregates between non-hierarchical area systems (administrative areas) is a standard problem in official statistics. For this problem, we present a proposal which is based on kernel density estimates. The approach applies a modification of a stochastic expectation maximization algorithm, which was proposed in the literature for the transformation of totals on rectangular areas to kernel density estimates. As a by-product of the routine, one obtains simulated geo-coordinates for each unit. With the help of these geo-coordinates, it is possible to calculate case numbers for any area system of interest. The proposed method is evaluated in a design-based simulation based on a close-to-reality, simulated data set with known exact geo-coordinates. In the empirical part, the method is applied to student resident figures from Berlin, Germany. These are known only at the level of ZIP codes, but they are needed for smaller administrative planning districts. Results for (a) student concentration areas and (b) temporal changes in the student residential areas between 2005 and 2015 are presented and discussed.
Published Online: 15 Jun 2020 Page range: 315 - 338
Abstract
Abstract
With the increase of social media usage, a huge new source of data has become available. Despite the enthusiasm linked to this revolution, one of the main outstanding criticisms in using these data is selection bias. Indeed, the reference population is unknown. Nevertheless, many studies show evidence that these data constitute a valuable source because they are more timely and possess higher space granularity. We propose to adjust statistics based on Twitter data by anchoring them to reliable official statistics through a weighted, space-time, small area estimation model. As a by-product, the proposed method also stabilizes the social media indicators, which is a welcome property required for official statistics. The method can be adapted anytime official statistics exists at the proper level of granularity and for which social media usage within the population is known. As an example, we adjust a subjective well-being indicator of “working conditions” in Italy, and combine it with relevant official statistics. The weights depend on broadband coverage and the Twitter rate at province level, while the analysis is performed at regional level. The resulting statistics are then compared with survey statistics on the “quality of job” at macro-economic regional level, showing evidence of similar paths.
Published Online: 15 Jun 2020 Page range: 339 - 360
Abstract
Abstract
Respondent driven sampling (RDS) is a sampling method designed for hard-to-sample groups with strong social ties. RDS starts with a small number of arbitrarily selected participants (“seeds”). Seeds are issued recruitment coupons, which are used to recruit from their social networks. Waves of recruitment and data collection continue until reaching a sufficient sample size. Under the assumptions of random recruitment, with-replacement sampling, and a sufficient number of waves, the probability of selection for each participant converges to be proportional to their network size. With recruitment noncooperation, however, recruitment can end abruptly, causing operational difficulties with unstable sample sizes. Noncooperation may void the recruitment Markovian assumptions, leading to selection bias. Here, we consider two RDS studies: one targeting Korean immigrants in Los Angeles and in Michigan; and another study targeting persons who inject drugs in Southeast Michigan. We explore predictors of coupon redemption, associations between recruiter and recruits, and details within recruitment dynamics. While no consistent predictors of noncooperation were found, there was evidence that coupon redemption of targeted recruits was more common among those who shared social bonds with their recruiters, suggesting that noncooperation is more likely to be a feature of recruits not cooperating, rather than recruiters failing to distribute coupons.
Published Online: 15 Jun 2020 Page range: 361 - 378
Abstract
Abstract
In March 2017, the United Nations (UN) Statistical Commission adopted a measurement framework for the UN Agenda 2030 for Sustainable Development, comprising of 232 indicators designed to measure the 17 Sustainable Development Goals (SDGs) and their respective 169 targets. The scope of this measurement framework is so ambitious it led Mogens Lykketoft, President of the seventieth session of the UN General Assembly, to describe it as an ‘unprecedented statistical challenge’.
Naturally, with a programme of this magnitude, there will be foreseen and unforeseen challenges and consequences. This article outlines some of the key differences between the Millennium Development Goals and the SDGs, before detailing some of the measurement challenges involved in compiling the SDG indicators, and examines some of the unanticipated consequences arising from the mechanisms put in place to measure progress from a broad political economy perspective.
Published Online: 15 Jun 2020 Page range: 379 - 409
Abstract
Abstract
Surveys measuring the same concept using the same measure on the same population at the same point in time should result in highly similar results. If this is not the case, this is a strong sign of lacking reliability, resulting in non-comparable data across surveys. Looking at the education variable, previous research has identified inconsistencies in the distributions of harmonised education variables, using the International Standard Classification of Education (ISCED), across surveys within the same countries and years. These inconsistencies are commonly explained by differences in the measurement, especially in the response categories of the education question, and in the harmonisation when classifying country-specific education categories into ISCED. However, other methodological characteristics of surveys, which we regard as ‘containers’ for several characteristics, may also contribute to this finding. We compare the education distributions of nine cross-national surveys with the European Union Labour Force Survey (EU-LFS), which is used as benchmark. This study analyses 15 survey characteristics to better explain the inconsistencies. The results confirm a predominant effect of the measurement instrument and harmonisation. Different sampling designs also explain inconsistencies, but to a lesser degree. Finally, we discuss the results and limitations of the study and provide ideas for improving data comparability.
Published Online: 15 Jun 2020 Page range: 411 - 434
Abstract
Abstract
In 2014, many innovations were introduced in the Italian Household Budget Survey (HBS) in response to changes in European recommendations and purchasing behaviours and to an increased demand for information in the context of social and economic research. New instruments and techniques have been introduced, together with more accurate methodologies, with the aim of improving the survey, by both reducing the bias and variance of survey estimates and supplying estimation for additional subpopulations and variables. Given the parallel conduction of the former and new HBS in 2013, it has been possible to evaluate the effects of the abovementioned changes on consumption expenditure and inequality estimates and to compare the sample representativeness of selected subpopulations in both surveys.
Published Online: 15 Jun 2020 Page range: 435 - 458
Abstract
Abstract
We consider longitudinal data and the problem of prediction of subpopulation (domain) characteristics that can be written as a linear combination of the variable of interest, including cases of small or zero sample sizes in the domain and time period of interest. We consider the empirical version of the predictor proposed by Royall (1976) showing that it is a generalization of the empirical version of the predictor presented by Henderson (1950). We propose a parametric bootstrap MSE estimator of the predictor. We prove its asymptotic unbiasedness and derive the order of its bias. Considerations are supported by Monte Carlo simulation analyses to compare its accuracy (not only the bias) with other MSE estimators, including jackknife and weighted jackknife MSE estimators that we adapt for the considered predictor.
We propose an estimator for the Gini coefficient, based on a ratio of means. We show how bootstrap and empirical likelihood can be combined to construct confidence intervals. Our simulation study shows the estimator proposed is usually less biased than customary estimators. The observed coverages of the empirical likelihood confidence interval proposed are also closer to the nominal value.
Policy measures to combat low literacy are often targeted at municipalities or regions with low levels of literacy. However, current surveys on literacy do not contain enough observations at this level to allow for reliable estimates when using only direct estimation techniques. To provide more reliable results at a detailed regional level, alternative methods must be used.
The aim of this article is to obtain literacy estimates at the municipality level using model-based small area estimation techniques in a hierarchical Bayesian framework. To do so, we link Dutch Labour Force Survey data to the most recent literacy survey available, that of the Programme for the International Assessment of Adult Competencies (PIAAC). We estimate the average literacy score, as well as the percentage of people with a low literacy level. Variance estimators for our small area predictions explicitly account for the imputation uncertainty in the PIAAC estimates. The proposed estimation method improves the precision of the area estimates, making it possible to break down the national figures by municipality.
Contingency tables provide a convenient format to publish summary data from confidential survey and administrative records that capture a wide range of social and economic information. By their nature, contingency tables enable aggregation of potentially sensitive data, limiting disclosure of identifying information. Furthermore, censoring or perturbation can be used to desensitise low cell counts when they arise. However, access to detailed cross-classified tables for research is often restricted by data custodians when too many censored or perturbed cells are required to preserve privacy. In this article, we describe a framework for selecting and combining log-linear models when accessible data is restricted to overlapping marginal contingency tables. The approach is demonstrated through application to housing transition data from the Australian Census Longitudinal Data set provided by the Australian Bureau of Statistics.
The transformation of area aggregates between non-hierarchical area systems (administrative areas) is a standard problem in official statistics. For this problem, we present a proposal which is based on kernel density estimates. The approach applies a modification of a stochastic expectation maximization algorithm, which was proposed in the literature for the transformation of totals on rectangular areas to kernel density estimates. As a by-product of the routine, one obtains simulated geo-coordinates for each unit. With the help of these geo-coordinates, it is possible to calculate case numbers for any area system of interest. The proposed method is evaluated in a design-based simulation based on a close-to-reality, simulated data set with known exact geo-coordinates. In the empirical part, the method is applied to student resident figures from Berlin, Germany. These are known only at the level of ZIP codes, but they are needed for smaller administrative planning districts. Results for (a) student concentration areas and (b) temporal changes in the student residential areas between 2005 and 2015 are presented and discussed.
With the increase of social media usage, a huge new source of data has become available. Despite the enthusiasm linked to this revolution, one of the main outstanding criticisms in using these data is selection bias. Indeed, the reference population is unknown. Nevertheless, many studies show evidence that these data constitute a valuable source because they are more timely and possess higher space granularity. We propose to adjust statistics based on Twitter data by anchoring them to reliable official statistics through a weighted, space-time, small area estimation model. As a by-product, the proposed method also stabilizes the social media indicators, which is a welcome property required for official statistics. The method can be adapted anytime official statistics exists at the proper level of granularity and for which social media usage within the population is known. As an example, we adjust a subjective well-being indicator of “working conditions” in Italy, and combine it with relevant official statistics. The weights depend on broadband coverage and the Twitter rate at province level, while the analysis is performed at regional level. The resulting statistics are then compared with survey statistics on the “quality of job” at macro-economic regional level, showing evidence of similar paths.
Respondent driven sampling (RDS) is a sampling method designed for hard-to-sample groups with strong social ties. RDS starts with a small number of arbitrarily selected participants (“seeds”). Seeds are issued recruitment coupons, which are used to recruit from their social networks. Waves of recruitment and data collection continue until reaching a sufficient sample size. Under the assumptions of random recruitment, with-replacement sampling, and a sufficient number of waves, the probability of selection for each participant converges to be proportional to their network size. With recruitment noncooperation, however, recruitment can end abruptly, causing operational difficulties with unstable sample sizes. Noncooperation may void the recruitment Markovian assumptions, leading to selection bias. Here, we consider two RDS studies: one targeting Korean immigrants in Los Angeles and in Michigan; and another study targeting persons who inject drugs in Southeast Michigan. We explore predictors of coupon redemption, associations between recruiter and recruits, and details within recruitment dynamics. While no consistent predictors of noncooperation were found, there was evidence that coupon redemption of targeted recruits was more common among those who shared social bonds with their recruiters, suggesting that noncooperation is more likely to be a feature of recruits not cooperating, rather than recruiters failing to distribute coupons.
In March 2017, the United Nations (UN) Statistical Commission adopted a measurement framework for the UN Agenda 2030 for Sustainable Development, comprising of 232 indicators designed to measure the 17 Sustainable Development Goals (SDGs) and their respective 169 targets. The scope of this measurement framework is so ambitious it led Mogens Lykketoft, President of the seventieth session of the UN General Assembly, to describe it as an ‘unprecedented statistical challenge’.
Naturally, with a programme of this magnitude, there will be foreseen and unforeseen challenges and consequences. This article outlines some of the key differences between the Millennium Development Goals and the SDGs, before detailing some of the measurement challenges involved in compiling the SDG indicators, and examines some of the unanticipated consequences arising from the mechanisms put in place to measure progress from a broad political economy perspective.
Surveys measuring the same concept using the same measure on the same population at the same point in time should result in highly similar results. If this is not the case, this is a strong sign of lacking reliability, resulting in non-comparable data across surveys. Looking at the education variable, previous research has identified inconsistencies in the distributions of harmonised education variables, using the International Standard Classification of Education (ISCED), across surveys within the same countries and years. These inconsistencies are commonly explained by differences in the measurement, especially in the response categories of the education question, and in the harmonisation when classifying country-specific education categories into ISCED. However, other methodological characteristics of surveys, which we regard as ‘containers’ for several characteristics, may also contribute to this finding. We compare the education distributions of nine cross-national surveys with the European Union Labour Force Survey (EU-LFS), which is used as benchmark. This study analyses 15 survey characteristics to better explain the inconsistencies. The results confirm a predominant effect of the measurement instrument and harmonisation. Different sampling designs also explain inconsistencies, but to a lesser degree. Finally, we discuss the results and limitations of the study and provide ideas for improving data comparability.
In 2014, many innovations were introduced in the Italian Household Budget Survey (HBS) in response to changes in European recommendations and purchasing behaviours and to an increased demand for information in the context of social and economic research. New instruments and techniques have been introduced, together with more accurate methodologies, with the aim of improving the survey, by both reducing the bias and variance of survey estimates and supplying estimation for additional subpopulations and variables. Given the parallel conduction of the former and new HBS in 2013, it has been possible to evaluate the effects of the abovementioned changes on consumption expenditure and inequality estimates and to compare the sample representativeness of selected subpopulations in both surveys.
We consider longitudinal data and the problem of prediction of subpopulation (domain) characteristics that can be written as a linear combination of the variable of interest, including cases of small or zero sample sizes in the domain and time period of interest. We consider the empirical version of the predictor proposed by Royall (1976) showing that it is a generalization of the empirical version of the predictor presented by Henderson (1950). We propose a parametric bootstrap MSE estimator of the predictor. We prove its asymptotic unbiasedness and derive the order of its bias. Considerations are supported by Monte Carlo simulation analyses to compare its accuracy (not only the bias) with other MSE estimators, including jackknife and weighted jackknife MSE estimators that we adapt for the considered predictor.