This research outlines the process of building a sample frame of US SMEs. The method starts with a list of patenting organizations and defines the boundaries of the population and subsequent frame using free to low-cost data sources, including search engines and websites. Generating high-quality data is of key importance throughout the process of building the frame and subsequent data collection; at the same time, there is too much data to curate by hand. Consequently, we turn to machine learning and other computational methods to apply a number of data matching, filtering, and cleaning routines. The results show that it is possible to generate a sample frame of innovative SMEs with reasonable accuracy for use in subsequent research: Our method provides data for 79% of the frame. We discuss implications for future work for researchers and NSIs alike and contend that the challenges associated with big data collections require not only new skillsets but also a new mode of collaboration.
This article presents a new method to reconcile direct and indirect deseasonalized economic time series. The proposed technique uses a Combining Rule to merge, in an optimal manner, the directly deseasonalized aggregated series with its indirectly deseasonalized counterpart. The lastmentioned series is obtained by aggregating the seasonally adjusted disaggregates that compose the aggregated series. This procedure leads to adjusted disaggregates that verify Denton’s movement preservation principle relative to the originally deseasonalized disaggregates. First, we use as preliminary estimates the directly deseasonalized economic time series obtained with the X-13ARIMA-SEATS program applied to all the disaggregation levels. Second, we contemporaneously reconcile the aforementioned seasonally adjusted disaggregates with its seasonally adjusted aggregate, using Vector Autoregressive models. Then, we evaluate the finite sample performance of our solution via a Monte Carlo experiment that considers six Data Generating Processes that may occur in practice, when users apply seasonal adjustment techniques. Finally, we present an empirical application to the Mexican Global Economic Indicator and its components. The results allow us to conclude that the suggested technique is appropriate to indirectly deseasonalize economic time series, mainly because we impose the movement preservation condition to the preliminary estimates produced by a reliable seasonal adjustment procedure.
The U.S. Consumer Expenditure Interview Survey asks many filter questions to identify the items that households purchase. Each reported purchase triggers follow-up questions about the amount spent and other details. We test the hypothesis that respondents learn how the questionnaire is structured and underreport purchases in later waves to reduce the length of the interview. We analyze data from 10,416 four-wave respondents over two years of data collection. We find no evidence of decreasing data quality over time; instead, panel respondents tend to give higher quality responses in later waves. The results also hold for a larger set of two-wave respondents.
Standard randomization-based inference conditions on the data in the population and makes inference with respect to the repeating sampling properties of the sampling indicators. In some settings these estimators can be quite unstable; Bayesian model-based approaches focus on the posterior predictive distribution of population quantities, potentially providing a better balance between bias correction and efficiency. Previous work in this area has focused on estimation of means and linear and generalized linear regression parameters; these methods do not allow for a general estimation of distributional functions such as quantile or quantile regression parameters. Here we adapt an extended Dirichlet Process Mixture model that allows the DP prior to be a mixture of DP random basis measures that are a function of covariates. These models allow many mixture components when necessary to accommodate the sample design, but can shrink to few components for more efficient estimation when the data allow. We provide an application to the estimation of relationships between serum dioxin levels and age in the US population, either at the mean level (via linear regression) or across the dioxin distribution (via quantile regression) using the National Health and Nutrition Examination Survey.
Published Online: 13 Mar 2021 Page range: 97 - 119
Abstract
Abstract
When monitoring industrial processes, a Statistical Process Control tool, such as a multivariate Hotelling T2 chart is frequently used to evaluate multiple quality characteristics. However, research into the use of T2 charts for survey fieldwork–essentially a production process in which data sets collected by means of interviews are produced–has been scant to date. In this study, using data from the eighth round of the European Social Survey in Belgium, we present a procedure for simultaneously monitoring six response quality indicators and identifying outliers: interviews with anomalous results. The procedure integrates Kernel Density Estimation (KDE) with a T2 chart, so that historical “in-control” data or reference to the assumption of a parametric distribution of the indicators is not required. In total, 75 outliers (4.25%) are iteratively removed, resulting in an in-control data set containing 1,691 interviews. The outliers are mainly characterized by having longer sequences of identical answers, a greater number of extreme answers, and against expectation, a lower item nonresponse rate. The procedure is validated by means of ten-fold cross-validation and comparison with the minimum covariance determinant algorithm as the criterion. By providing a method of obtaining in-control data, the present findings go some way toward a way to monitor response quality, identify problems, and provide rapid feedbacks during survey fieldwork.
Published Online: 13 Mar 2021 Page range: 121 - 147
Abstract
Abstract
In this article we evaluate the viability of using big data produced by smart city systems for creating new official statistics. We assess sixteen sources of urban transportation and environmental big data that are published as open data or were made available to the project for Dublin, Ireland. These data were systematically explored through a process of data checking and wrangling, building tools to display and analyse the data, and evaluating them with respect to 16 measures of their suitability: access, sustainability and reliability, transparency and interpretability, privacy, fidelity, cleanliness, completeness, spatial granularity, temporal granularity, spatial coverage, coherence, metadata availability, changes over time, standardisation, methodological transparency, and relevance. We assessed how the data could be used to produce key performance indicators and potential new official statistics. Our analysis reveals that, at present, a limited set of smart city data is suitable for creating new official statistics, though others could potentially be made suitable with changes to data management. If these new official statistics are to be realised then National Statistical Institutions need to work closely with those organisations generating the data to try and implement a robust set of procedures and standards that will produce consistent, long-term data sets.
Published Online: 13 Mar 2021 Page range: 149 - 170
Abstract
Abstract
Advances in smartphone technology have allowed for individuals to have access to near-continuous location tracking at a very precise level. As the backbone of mobility research, the Travel Diary Study, has continued to offer decreasing response rates over the years, researchers are looking to these mobile devices to bridge the gap between self-report recall studies and a person’s underlying travel behavior. This article details an open-source application that collects real-time location data which respondents may then annotate to provide a detailed travel diary. Results of the field test involving 674 participants are discussed, including technical performance, data quality and response rate.
Published Online: 13 Mar 2021 Page range: 171 - 211
Abstract
Abstract
Within the context of Sustainable Development Goals, progress towards Target 12.3 can be measured and monitored with the Food Loss Index. A major challenge is the lack of data, which dictated many methodology decisions. Therefore, the objective of this work is to present a possible improvement to the modeling approach used by the Food and Agricultural Organization in estimating the annual percentage of food losses by country and commodity. Our proposal combines robust statistical techniques with the strict adherence to the rules of the official statistics. In particular, the case study focuses on cereal crops, which currently have the highest (yet incomplete) data coverage and allow for more ambitious modeling choices. Cereal data is available in 66 countries and 14 different cereal commodities from 1991 to 2014. We use the annual food loss as response variable, expressed as percentage over production, by country and cereal commodity. The estimation work is twofold: it aims at selecting the most important factors explaining losses worldwide, comparing two Bayesian model selection approaches, and then at predicting losses with a Beta regression model in a fully Bayesian framework.
Published Online: 13 Mar 2021 Page range: 213 - 237
Abstract
Abstract
Web questionnaires are increasingly used to complement traditional data collection in mixed mode surveys. However, the utilization of web data raises concerns whether web questionnaires lead to mode-specific measurement bias. We argue that the magnitude of measurement bias strongly depends on the content of a variable. Based on the Luxembourgish Labour Force Survey, we investigate differences between web and telephone data in terms of objective (i.e., Employment Status) and subjective (i.e., Wage Adequacy and Job Satisfaction) variables. To assess whether differences in outcome variables are caused by sample composition or mode-specific measurement bias, we apply a coarsened exact matching that approximates randomized experiments by reducing dissimilarities between web and telephone samples. We select matching variables with a combination of automatic variable selection via random forest and a literature-driven selection. The results show that objective variables are not affected by mode-specific measurement bias, but web participants report lower satisfaction-levels on subjective variables than telephone participants. Extensive supplementary analyses confirm our results. The present study supports the view that the impact of survey mode depends on the content of a survey and its variables.
Published Online: 13 Mar 2021 Page range: 239 - 255
Abstract
Abstract
Generalised regression estimation allows one to make use of available auxiliary information in survey sampling. We develop three types of generalised regression estimator when the auxiliary data cannot be matched perfectly to the sample units, so that the standard estimator is inapplicable. The inference remains design-based. Consistency of the proposed estimators is either given by construction or else can be tested given the observed sample and links. Mean square errors can be estimated. A simulation study is used to explore the potentials of the proposed estimators.
This research outlines the process of building a sample frame of US SMEs. The method starts with a list of patenting organizations and defines the boundaries of the population and subsequent frame using free to low-cost data sources, including search engines and websites. Generating high-quality data is of key importance throughout the process of building the frame and subsequent data collection; at the same time, there is too much data to curate by hand. Consequently, we turn to machine learning and other computational methods to apply a number of data matching, filtering, and cleaning routines. The results show that it is possible to generate a sample frame of innovative SMEs with reasonable accuracy for use in subsequent research: Our method provides data for 79% of the frame. We discuss implications for future work for researchers and NSIs alike and contend that the challenges associated with big data collections require not only new skillsets but also a new mode of collaboration.
This article presents a new method to reconcile direct and indirect deseasonalized economic time series. The proposed technique uses a Combining Rule to merge, in an optimal manner, the directly deseasonalized aggregated series with its indirectly deseasonalized counterpart. The lastmentioned series is obtained by aggregating the seasonally adjusted disaggregates that compose the aggregated series. This procedure leads to adjusted disaggregates that verify Denton’s movement preservation principle relative to the originally deseasonalized disaggregates. First, we use as preliminary estimates the directly deseasonalized economic time series obtained with the X-13ARIMA-SEATS program applied to all the disaggregation levels. Second, we contemporaneously reconcile the aforementioned seasonally adjusted disaggregates with its seasonally adjusted aggregate, using Vector Autoregressive models. Then, we evaluate the finite sample performance of our solution via a Monte Carlo experiment that considers six Data Generating Processes that may occur in practice, when users apply seasonal adjustment techniques. Finally, we present an empirical application to the Mexican Global Economic Indicator and its components. The results allow us to conclude that the suggested technique is appropriate to indirectly deseasonalize economic time series, mainly because we impose the movement preservation condition to the preliminary estimates produced by a reliable seasonal adjustment procedure.
The U.S. Consumer Expenditure Interview Survey asks many filter questions to identify the items that households purchase. Each reported purchase triggers follow-up questions about the amount spent and other details. We test the hypothesis that respondents learn how the questionnaire is structured and underreport purchases in later waves to reduce the length of the interview. We analyze data from 10,416 four-wave respondents over two years of data collection. We find no evidence of decreasing data quality over time; instead, panel respondents tend to give higher quality responses in later waves. The results also hold for a larger set of two-wave respondents.
Standard randomization-based inference conditions on the data in the population and makes inference with respect to the repeating sampling properties of the sampling indicators. In some settings these estimators can be quite unstable; Bayesian model-based approaches focus on the posterior predictive distribution of population quantities, potentially providing a better balance between bias correction and efficiency. Previous work in this area has focused on estimation of means and linear and generalized linear regression parameters; these methods do not allow for a general estimation of distributional functions such as quantile or quantile regression parameters. Here we adapt an extended Dirichlet Process Mixture model that allows the DP prior to be a mixture of DP random basis measures that are a function of covariates. These models allow many mixture components when necessary to accommodate the sample design, but can shrink to few components for more efficient estimation when the data allow. We provide an application to the estimation of relationships between serum dioxin levels and age in the US population, either at the mean level (via linear regression) or across the dioxin distribution (via quantile regression) using the National Health and Nutrition Examination Survey.
When monitoring industrial processes, a Statistical Process Control tool, such as a multivariate Hotelling T2 chart is frequently used to evaluate multiple quality characteristics. However, research into the use of T2 charts for survey fieldwork–essentially a production process in which data sets collected by means of interviews are produced–has been scant to date. In this study, using data from the eighth round of the European Social Survey in Belgium, we present a procedure for simultaneously monitoring six response quality indicators and identifying outliers: interviews with anomalous results. The procedure integrates Kernel Density Estimation (KDE) with a T2 chart, so that historical “in-control” data or reference to the assumption of a parametric distribution of the indicators is not required. In total, 75 outliers (4.25%) are iteratively removed, resulting in an in-control data set containing 1,691 interviews. The outliers are mainly characterized by having longer sequences of identical answers, a greater number of extreme answers, and against expectation, a lower item nonresponse rate. The procedure is validated by means of ten-fold cross-validation and comparison with the minimum covariance determinant algorithm as the criterion. By providing a method of obtaining in-control data, the present findings go some way toward a way to monitor response quality, identify problems, and provide rapid feedbacks during survey fieldwork.
In this article we evaluate the viability of using big data produced by smart city systems for creating new official statistics. We assess sixteen sources of urban transportation and environmental big data that are published as open data or were made available to the project for Dublin, Ireland. These data were systematically explored through a process of data checking and wrangling, building tools to display and analyse the data, and evaluating them with respect to 16 measures of their suitability: access, sustainability and reliability, transparency and interpretability, privacy, fidelity, cleanliness, completeness, spatial granularity, temporal granularity, spatial coverage, coherence, metadata availability, changes over time, standardisation, methodological transparency, and relevance. We assessed how the data could be used to produce key performance indicators and potential new official statistics. Our analysis reveals that, at present, a limited set of smart city data is suitable for creating new official statistics, though others could potentially be made suitable with changes to data management. If these new official statistics are to be realised then National Statistical Institutions need to work closely with those organisations generating the data to try and implement a robust set of procedures and standards that will produce consistent, long-term data sets.
Advances in smartphone technology have allowed for individuals to have access to near-continuous location tracking at a very precise level. As the backbone of mobility research, the Travel Diary Study, has continued to offer decreasing response rates over the years, researchers are looking to these mobile devices to bridge the gap between self-report recall studies and a person’s underlying travel behavior. This article details an open-source application that collects real-time location data which respondents may then annotate to provide a detailed travel diary. Results of the field test involving 674 participants are discussed, including technical performance, data quality and response rate.
Within the context of Sustainable Development Goals, progress towards Target 12.3 can be measured and monitored with the Food Loss Index. A major challenge is the lack of data, which dictated many methodology decisions. Therefore, the objective of this work is to present a possible improvement to the modeling approach used by the Food and Agricultural Organization in estimating the annual percentage of food losses by country and commodity. Our proposal combines robust statistical techniques with the strict adherence to the rules of the official statistics. In particular, the case study focuses on cereal crops, which currently have the highest (yet incomplete) data coverage and allow for more ambitious modeling choices. Cereal data is available in 66 countries and 14 different cereal commodities from 1991 to 2014. We use the annual food loss as response variable, expressed as percentage over production, by country and cereal commodity. The estimation work is twofold: it aims at selecting the most important factors explaining losses worldwide, comparing two Bayesian model selection approaches, and then at predicting losses with a Beta regression model in a fully Bayesian framework.
Web questionnaires are increasingly used to complement traditional data collection in mixed mode surveys. However, the utilization of web data raises concerns whether web questionnaires lead to mode-specific measurement bias. We argue that the magnitude of measurement bias strongly depends on the content of a variable. Based on the Luxembourgish Labour Force Survey, we investigate differences between web and telephone data in terms of objective (i.e., Employment Status) and subjective (i.e., Wage Adequacy and Job Satisfaction) variables. To assess whether differences in outcome variables are caused by sample composition or mode-specific measurement bias, we apply a coarsened exact matching that approximates randomized experiments by reducing dissimilarities between web and telephone samples. We select matching variables with a combination of automatic variable selection via random forest and a literature-driven selection. The results show that objective variables are not affected by mode-specific measurement bias, but web participants report lower satisfaction-levels on subjective variables than telephone participants. Extensive supplementary analyses confirm our results. The present study supports the view that the impact of survey mode depends on the content of a survey and its variables.
Generalised regression estimation allows one to make use of available auxiliary information in survey sampling. We develop three types of generalised regression estimator when the auxiliary data cannot be matched perfectly to the sample units, so that the standard estimator is inapplicable. The inference remains design-based. Consistency of the proposed estimators is either given by construction or else can be tested given the observed sample and links. Mean square errors can be estimated. A simulation study is used to explore the potentials of the proposed estimators.