Published Online: 12 Sep 2022 Page range: 673 - 708
Abstract
Abstract
The demand for small area estimates can conflict with the objective of producing a multi-purpose data set. We use donor imputation to construct a database that supports small area estimation. Appropriately weighted sums of observed and imputed values produce model-based small area estimates. We develop imputation procedures for both unit-level and area-level models. For area-level models, we restrict to linear models. We assume a single vector of covariates is used for a possibly multivariate response. Each record in the imputed data set has complete data, an estimation weight, and a set of replicate weights for mean square error (MSE) estimation. We compare imputation procedures based on area-level models to those based on unit-level models through simulation. We apply the methods to the Iowa Seat-Belt Use Survey, a survey designed to produce state-level estimates of the proportions of vehicle occupants who wear a seat-belt. We develop a bivariate unit-level model for prediction of county-level proportions of belted drivers and total occupants. We impute values for the proportions of belted drivers and vehicle occupants onto the full population of road segments in the sampling frame. The resulting imputed data set returns approximations for the county-level predictors based on the bivariate model.
Published Online: 12 Sep 2022 Page range: 709 - 732
Abstract
Abstract
In the production of US agricultural official statistics, certain inequality and benchmarking constraints must be satisfied. For example, available administrative data provide an accurate lower bound for the county-level estimates of planted acres, produced by the U.S. Department of Agriculture’s (USDA) National Agricultural statistics Services (NASS). In addition, the county-level estimates within a state need to add to the state-level estimates. A sub-area hierarchical Bayesian model with inequality constraints to produce county-level estimates that satisfy these important relationships is discussed, along with associated measures of uncertainty. This model combines the County Agricultural Production Survey (CAPS) data with administrative data. Inequality constraints add complexity to fitting the model and present a computational challenge to a full Bayesian approach. To evaluate the inclusion of these constraints, the models with and without inequality constraints were compared using 2014 corn planted acres estimates for three states. The performance of the model with inequality constraints illustrates the improvement of county-level estimates in accuracy and precision while preserving required relationships.
Published Online: 12 Sep 2022 Page range: 733 - 765
Abstract
Abstract
In this article, we present a new approach based on dynamic factor models (DFMs) to perform accurate nowcasts for the percentage annual variation of the Mexican Global Economic Activity Indicator (IGAE), the commonly used variable as an approximation of monthly GDP. The procedure exploits the contemporaneous relationship of the timely traditional macroeconomic time series and nontraditional variables as Google Trends with respect to the IGAE. We evaluate the performance of the approach in a pseudo real-time framework, which includes the pandemic of COVID-19, and conclude that the procedure obtains accurate estimates, for one and two-steps ahead, above all, given the use of Google Trends. Another contribution for economic nowcasting is that the approach allows to disentangle the key variables in the DFM by estimating the confidence interval for the factor loadings, hence allows to evaluate the statistical significance of the variables in the DFM. This approach is used in official statistics to obtain preliminary and accurate estimates for IGAE up to 40 days before the official data release.
Published Online: 12 Sep 2022 Page range: 767 - 792
Abstract
Abstract
Many countries conduct a full census survey to report official population statistics. As no census survey ever achieves 100% response rate, a post-enumeration survey (PES) is usually conducted and analysed to assess census coverage and produce official population estimates by geographic area and demographic attributes. Considering the usually small size of PES, direct estimation at the desired level of disaggregation is not feasible. Design-based estimation with sampling weight adjustment is a commonly used method but is difficult to implement when survey nonresponse patterns cannot be fully documented and population benchmarks are not available. We overcome these limitations with a fully model-based Bayesian approach applied to the New Zealand PES. Although theory for the Bayesian treatment of complex surveys has been described, published applications of individual level Bayesian models for complex survey data remain scarce. We provide such an application through a case study of the 2018 census and PES surveys. We implement a multilevel model that accounts for the complex design of PES. We then illustrate how mixed posterior predictive checking and cross-validation can assist with model building and model selection. Finally, we discuss potential methodological improvements to the model and potential solutions to mitigate dependence between the two surveys.
Published Online: 12 Sep 2022 Page range: 793 - 822
Abstract
Abstract
In this article, we evaluate how the analysis of open-ended probes in an online cognitive interview can serve as a metric to identify cases that should be excluded due to disingenuous responses by ineligible respondents. We analyze data collected in 2019 via an online opt-in panel in English and Spanish to pretest a public opinion questionnaire (n = 265 in English and 199 in Spanish). We find that analyzing open-ended probes allowed us to flag cases completed by respondents who demonstrated problematic behaviors (e.g., answering many probes with repetitive textual patterns, by typing random characters, etc.), as well as to identify cases completed by ineligible respondents posing as eligible respondents (i.e., non-Spanish-speakers posing as Spanish-speakers). These findings indicate that data collected for multilingual pretesting research using online opt-in panels likely require additional evaluations of data quality. We find that open-ended probes can help determine which cases should be replaced when conducting pretesting using opt-in panels. We argue that open-ended probes in online cognitive interviews, while more time consuming and expensive to analyze than close-ended questions, serve as a valuable method of verifying response quality and respondent eligibility, particularly for researchers conducting multilingual surveys with online opt-in panels.
Published Online: 12 Sep 2022 Page range: 823 - 845
Abstract
Abstract
The earning inequality in India has unfavorably obstructed underprivileged in accessing elementary needs like health and education. Periodic labour force survey conducted by National Statistical Office of India generates estimates on earning status at national and state level for both rural and urban sectors separately. However, due to small sample size problem, these surveys cannot generate reliable estimates at micro-level viz. district or block. Thus, owing to unavailability of district-level estimates, analysis of earning inequality is restricted to the national and the state level. Therefore, the existing variability in disaggregate-level earning distribution often goes unnoticed. This article describes multivariate small area estimation method to generate precise and representative district-wise estimate of earning distribution in rural and urban areas of the Indian State of Bihar by linking Periodic labour force survey data of 2018–2019 and 2011 Population Census data of India. These disaggregate-level estimates and spatial mapping of earning distribution are essential for measuring and monitoring the goal of reduced inequalities related to the sustainable development of 2030 agenda. They expected to offer insightful information to decision-makers and policy experts for identifying the areas demanding more attention.
Published Online: 12 Sep 2022 Page range: 847 - 873
Abstract
Abstract
Artificial neural networks (ANNs) have been the catalyst to numerous advances in a variety of fields and disciplines in recent years. Their impact on economics, however, has been comparatively muted. One type of ANN, the long short-term memory network (LSTM), is particularly well-suited to deal with economic time-series. Here, the architecture’s performance and characteristics are evaluated in comparison with the dynamic factor model (DFM), currently a popular choice in the field of economic nowcasting. LSTMs are found to produce superior results to DFMs in the nowcasting of three separate variables; global merchandise export values and volumes, and global services exports. Further advantages include their ability to handle large numbers of input features in a variety of time frequencies. A disadvantage is the stochastic nature of outputs, common to all ANNs. In order to facilitate continued applied research of the methodology by avoiding the need for any knowledge of deep-learning libraries, an accompanying Python (Hopp 2021a) library was developed using PyTorch. The library is also available in R, MATLAB, and Julia.
Published Online: 12 Sep 2022 Page range: 875 - 900
Abstract
Abstract
Along with the rapid emergence of web surveys to address time-sensitive priority topics, various propensity score (PS)-based adjustment methods have been developed to improve population representativeness for nonprobability- or probability-sampled web surveys subject to selection bias. Conventional PS-based methods construct pseudo-weights for web samples using a higher-quality reference probability sample. The bias reduction, however, depends on the outcome and variables collected in both web and reference samples. A central issue is identifying variables for inclusion in PS-adjustment. In this article, directed acyclic graph (DAG), a common graphical tool for causal studies but largely under-utilized in survey research, is used to examine and elucidate how different types of variables in the causal pathways impact the performance of PS-adjustment. While past literature generally recommends including all variables, our research demonstrates that only certain types of variables are needed in PS-adjustment. Our research is illustrated by NCHS’ Research and Development Survey, a probability-sampled web survey with potential selection bias, PS-adjusted to the National Health Interview Survey, to estimate U.S. asthma prevalence. Findings in this article can be used by National Statistics Offices to design questionnaires with variables that improve web-samples’ population representativeness and to release more timely and accurate estimates for priority topics.
Published Online: 12 Sep 2022 Page range: 901 - 928
Abstract
Abstract
When random effects are correlated with survey sample design variables, the usual approach of employing individual survey weights (constructed to be inversely proportional to the unit survey inclusion probabilities) to form a pseudo-likelihood no longer produces asymptotically unbiased inference. We construct a weight-exponentiated formulation for the random effects distribution that achieves approximately unbiased inference for generating hyperparameters of the random effects. We contrast our approach with frequentist methods that rely on numerical integration to reveal that the pseudo Bayesian method achieves both unbiased estimation with respect to the sampling design distribution and consistency with respect to the population generating distribution. Our simulations and real data example for a survey of business establishments demonstrate the utility of our approach across different modeling formulations and sampling designs. This work serves as a capstone for recent developmental efforts that combine traditional survey estimation approaches with the Bayesian modeling paradigm and provides a bridge across the two rich but disparate sub-fields.
The demand for small area estimates can conflict with the objective of producing a multi-purpose data set. We use donor imputation to construct a database that supports small area estimation. Appropriately weighted sums of observed and imputed values produce model-based small area estimates. We develop imputation procedures for both unit-level and area-level models. For area-level models, we restrict to linear models. We assume a single vector of covariates is used for a possibly multivariate response. Each record in the imputed data set has complete data, an estimation weight, and a set of replicate weights for mean square error (MSE) estimation. We compare imputation procedures based on area-level models to those based on unit-level models through simulation. We apply the methods to the Iowa Seat-Belt Use Survey, a survey designed to produce state-level estimates of the proportions of vehicle occupants who wear a seat-belt. We develop a bivariate unit-level model for prediction of county-level proportions of belted drivers and total occupants. We impute values for the proportions of belted drivers and vehicle occupants onto the full population of road segments in the sampling frame. The resulting imputed data set returns approximations for the county-level predictors based on the bivariate model.
In the production of US agricultural official statistics, certain inequality and benchmarking constraints must be satisfied. For example, available administrative data provide an accurate lower bound for the county-level estimates of planted acres, produced by the U.S. Department of Agriculture’s (USDA) National Agricultural statistics Services (NASS). In addition, the county-level estimates within a state need to add to the state-level estimates. A sub-area hierarchical Bayesian model with inequality constraints to produce county-level estimates that satisfy these important relationships is discussed, along with associated measures of uncertainty. This model combines the County Agricultural Production Survey (CAPS) data with administrative data. Inequality constraints add complexity to fitting the model and present a computational challenge to a full Bayesian approach. To evaluate the inclusion of these constraints, the models with and without inequality constraints were compared using 2014 corn planted acres estimates for three states. The performance of the model with inequality constraints illustrates the improvement of county-level estimates in accuracy and precision while preserving required relationships.
In this article, we present a new approach based on dynamic factor models (DFMs) to perform accurate nowcasts for the percentage annual variation of the Mexican Global Economic Activity Indicator (IGAE), the commonly used variable as an approximation of monthly GDP. The procedure exploits the contemporaneous relationship of the timely traditional macroeconomic time series and nontraditional variables as Google Trends with respect to the IGAE. We evaluate the performance of the approach in a pseudo real-time framework, which includes the pandemic of COVID-19, and conclude that the procedure obtains accurate estimates, for one and two-steps ahead, above all, given the use of Google Trends. Another contribution for economic nowcasting is that the approach allows to disentangle the key variables in the DFM by estimating the confidence interval for the factor loadings, hence allows to evaluate the statistical significance of the variables in the DFM. This approach is used in official statistics to obtain preliminary and accurate estimates for IGAE up to 40 days before the official data release.
Many countries conduct a full census survey to report official population statistics. As no census survey ever achieves 100% response rate, a post-enumeration survey (PES) is usually conducted and analysed to assess census coverage and produce official population estimates by geographic area and demographic attributes. Considering the usually small size of PES, direct estimation at the desired level of disaggregation is not feasible. Design-based estimation with sampling weight adjustment is a commonly used method but is difficult to implement when survey nonresponse patterns cannot be fully documented and population benchmarks are not available. We overcome these limitations with a fully model-based Bayesian approach applied to the New Zealand PES. Although theory for the Bayesian treatment of complex surveys has been described, published applications of individual level Bayesian models for complex survey data remain scarce. We provide such an application through a case study of the 2018 census and PES surveys. We implement a multilevel model that accounts for the complex design of PES. We then illustrate how mixed posterior predictive checking and cross-validation can assist with model building and model selection. Finally, we discuss potential methodological improvements to the model and potential solutions to mitigate dependence between the two surveys.
In this article, we evaluate how the analysis of open-ended probes in an online cognitive interview can serve as a metric to identify cases that should be excluded due to disingenuous responses by ineligible respondents. We analyze data collected in 2019 via an online opt-in panel in English and Spanish to pretest a public opinion questionnaire (n = 265 in English and 199 in Spanish). We find that analyzing open-ended probes allowed us to flag cases completed by respondents who demonstrated problematic behaviors (e.g., answering many probes with repetitive textual patterns, by typing random characters, etc.), as well as to identify cases completed by ineligible respondents posing as eligible respondents (i.e., non-Spanish-speakers posing as Spanish-speakers). These findings indicate that data collected for multilingual pretesting research using online opt-in panels likely require additional evaluations of data quality. We find that open-ended probes can help determine which cases should be replaced when conducting pretesting using opt-in panels. We argue that open-ended probes in online cognitive interviews, while more time consuming and expensive to analyze than close-ended questions, serve as a valuable method of verifying response quality and respondent eligibility, particularly for researchers conducting multilingual surveys with online opt-in panels.
The earning inequality in India has unfavorably obstructed underprivileged in accessing elementary needs like health and education. Periodic labour force survey conducted by National Statistical Office of India generates estimates on earning status at national and state level for both rural and urban sectors separately. However, due to small sample size problem, these surveys cannot generate reliable estimates at micro-level viz. district or block. Thus, owing to unavailability of district-level estimates, analysis of earning inequality is restricted to the national and the state level. Therefore, the existing variability in disaggregate-level earning distribution often goes unnoticed. This article describes multivariate small area estimation method to generate precise and representative district-wise estimate of earning distribution in rural and urban areas of the Indian State of Bihar by linking Periodic labour force survey data of 2018–2019 and 2011 Population Census data of India. These disaggregate-level estimates and spatial mapping of earning distribution are essential for measuring and monitoring the goal of reduced inequalities related to the sustainable development of 2030 agenda. They expected to offer insightful information to decision-makers and policy experts for identifying the areas demanding more attention.
Artificial neural networks (ANNs) have been the catalyst to numerous advances in a variety of fields and disciplines in recent years. Their impact on economics, however, has been comparatively muted. One type of ANN, the long short-term memory network (LSTM), is particularly well-suited to deal with economic time-series. Here, the architecture’s performance and characteristics are evaluated in comparison with the dynamic factor model (DFM), currently a popular choice in the field of economic nowcasting. LSTMs are found to produce superior results to DFMs in the nowcasting of three separate variables; global merchandise export values and volumes, and global services exports. Further advantages include their ability to handle large numbers of input features in a variety of time frequencies. A disadvantage is the stochastic nature of outputs, common to all ANNs. In order to facilitate continued applied research of the methodology by avoiding the need for any knowledge of deep-learning libraries, an accompanying Python (Hopp 2021a) library was developed using PyTorch. The library is also available in R, MATLAB, and Julia.
Along with the rapid emergence of web surveys to address time-sensitive priority topics, various propensity score (PS)-based adjustment methods have been developed to improve population representativeness for nonprobability- or probability-sampled web surveys subject to selection bias. Conventional PS-based methods construct pseudo-weights for web samples using a higher-quality reference probability sample. The bias reduction, however, depends on the outcome and variables collected in both web and reference samples. A central issue is identifying variables for inclusion in PS-adjustment. In this article, directed acyclic graph (DAG), a common graphical tool for causal studies but largely under-utilized in survey research, is used to examine and elucidate how different types of variables in the causal pathways impact the performance of PS-adjustment. While past literature generally recommends including all variables, our research demonstrates that only certain types of variables are needed in PS-adjustment. Our research is illustrated by NCHS’ Research and Development Survey, a probability-sampled web survey with potential selection bias, PS-adjusted to the National Health Interview Survey, to estimate U.S. asthma prevalence. Findings in this article can be used by National Statistics Offices to design questionnaires with variables that improve web-samples’ population representativeness and to release more timely and accurate estimates for priority topics.
When random effects are correlated with survey sample design variables, the usual approach of employing individual survey weights (constructed to be inversely proportional to the unit survey inclusion probabilities) to form a pseudo-likelihood no longer produces asymptotically unbiased inference. We construct a weight-exponentiated formulation for the random effects distribution that achieves approximately unbiased inference for generating hyperparameters of the random effects. We contrast our approach with frequentist methods that rely on numerical integration to reveal that the pseudo Bayesian method achieves both unbiased estimation with respect to the sampling design distribution and consistency with respect to the population generating distribution. Our simulations and real data example for a survey of business establishments demonstrate the utility of our approach across different modeling formulations and sampling designs. This work serves as a capstone for recent developmental efforts that combine traditional survey estimation approaches with the Bayesian modeling paradigm and provides a bridge across the two rich but disparate sub-fields.