À propos de cet article

Citez

History of ETER and its political importance for research on higher education

Studies, analyzes and policy investigations about the positioning and the characterization of education and research systems need data to be performed. Whenever we need data, we need a method for the management of data, and in the Big Data era, a crucial role is played by data quality. Therefore, higher education policies and indicators development need data quality techniques to increase the value of data and improve the exploitation of the available data.

The availability of data and information about Higher Education Institutions (HEIs) is then the first requirement for the development of empirical studies. The second relevant aspect is the consideration of the quality of the available data and information. In higher education, we observe a kind of paradox. While we are leaving in the Big data era, in which huge amount of data are produced, stored in non-SQL databases, and analyzed on large scale, in this field still relational databases are used to organize the existing data and information, and cases of “little data or no data” at all (Borgman, 2015) are the normality.

Higher Education Systems are complex systems and their assessment is complex too. The development of models of indicators or metrics for a quantitative assessment requires a comprehensive framework, which should include the specification of the underlying theory, methodology, and data properties. Models of metrics are necessary to assess the meaning, validity, and robustness of metrics (Daraio, 2017). Daraio and Glanzel (2016) identified the following critical issues: i) data quality issues (OECD, 2011) including completeness, validity, accuracy, consistency, availability, and timeliness; ii) comparability problems related to heterogeneous definitions of the variables, data collection practices, and databases; iii) lack of standardization; iv) lack of interoperability; v) lack of modularization; vi) problems of classification; vii) difficulties in the creation of concordance tables among different classification schemes; viii) problems and costs of the extensibility of the system; ix) problems and costs of updating of the system.

The development of the European Tertiary Education Register (ETER) has grown up of the recognition that, beyond aggregated data at the country and regional level provided by EUROSTAT, there is an urgent need to have information on individual HEIs and their individual profiles. On the one hand, New Public Management approaches to higher education governance (Capano, 2011; Ferlie et al., 1996) focused on “steering at distance” and on transforming HEIs into strategic actors, which are capable to develop their own profile and strategy (Bonaccorsi & Daraio, 2007a). On the other hand, empirical studies have shown that higher education systems are highly heterogeneous as of the type and characteristics of HEIs (Daraio et al., 2011) and, therefore, analyzes based on natural aggregates might lead to incorrect conclusions. Moreover, the emergence of international rankings emphasized the importance of comparing institutions rather than countries; in that respect, while comparative analyzes of publication outputs were available since many years (van Raan, 2013), until recently very few analysis were performed including both inputs and outputs of universities. Daraio, Bonaccorsi, and Simar (2015) proposed a methodological contribution that overcomes four main criticisms of university rankings, including monodimensionality, statistical robustness, dependence on university size and subject-mix and lack of consideration of the input–output structure. They illustrated their method on European university data and pointed out on the importance of investing in the data collection and integration for research and policymaking. Daraio and Bonaccorsi (2017) after summarizing the main criticisms of rankings, and recent trends in indicators development, proposed an approach to overcome rankings based on the integration of multidimensional data in open platforms. More recently, Lepori, Geuna, and Mira (2019) compared European and USA universities.

The development of ETER has grown up from this recognition and from the objective of the European Commission to improve transparency and accountability of higher education in Europe (European Commission, 2011). From the beginning, ETER was entitled with two main functions: first, establishing a register of higher education institutions and, accordingly being able to identify them and locate them in the European space; second, collecting statistical data on relevant dimensions of HEIs as identified by the scholarly literature in the field (Huisman et al., 2015). The first function raised complex issues of delimiting higher education and defining inclusion and exclusion criteria, which turned out to be largely conventional (Lepori & Bonaccorsi, 2013). The second entailed a complex work of addressing comparability problems between national systems (Bonaccorsi et al., 2007); while ETER could build on standardization work by EUROSTAT for what concerns students and graduates (Unesco, OECD, Eurostat–UOE-2013), the project had to work out its own definitions for what concerns finances and staff data, as well as suitable mappings from (heterogeneous) national classifications.

The establishment process of ETER took however more than one decade because of the complexity of the European statistical system, which was by large composed of different national statistical systems with their specificities (Lepori & Bonaccorsi, 2013) and of the lack of a suitable institutional framework, as the option of managing ETER within EUROSTAT was discarded because of practical and legal constraints.

An important role for the success of ETER has been played by a pioneering European research project, called AQUAMETH, that integrated for the first time comparable data on six European countries (Italy, Norway, Portugal, Spain, Switzerland, and the UK), showing the feasibility and interest of this data integration for research analysis (Bonaccorsi & Daraio, 2007b). Williams (2008) in his book review of Bonaccorsi and Daraio (2007c) in the London Review of Education highlighted the importance of data for making econometric analysis and in particular comparison of the efficiency of universities across Europe. He wrote: “the main intention of the book is to use these data to undertake institution-level cross-national econometric analyses of the efficiency of universities (Williams, 2008)” “Their analyses willserve the purpose of showing serious mathematical economists in Europe that their higher education systems are potentially a fruitful subject of study and are beginning to produce data that are worth serious analytical attention”, “this book and the AQUAMETH studies that underpin it can serve an important proselytising function. It deserves to be widely read by serious higher education researchers (Williams, 2008)”. Daraio (2018, 2019) recently considered the important role of the availability and quality of institutional level data for econometric analysis.

After the AQUAMETH project, the European Commission launched in 2011 a large-scale pilot, called EUMIDA that provided for the first time a complete mapping of European higher education and proved the feasibility of a large-scale data collection (Niederl et al., 2014). From 2013 to 2019, ETER was established as a regular (yearly) data collection on European higher education with the aim of reaching a comparable level of quality and completeness as the US Integrated Postsecondary Data Service IPEDS (https://nces.ed.gov/ipeds/). The development process of ETER entailed the consolidation of methodology and definitions inherited from the pilot (Lepori et al., 2015), the introduction of a set of procedures for the collection of data and the verification of their quality and, finally, the programming of a database to manage data collection and host the data, as well as of a public website so that users can search and download data (www.eter-project.com).

Aim of the paper

The main objective of this paper is to describe the flexible approach developed for monitoring the data quality of ETER, illustrate its functioning and highlight the main challenges that still have to be faced. More specifically, we will focus on the data quality checks that are helpful to identify outliers, extreme observations, and to detect ontological inconsistencies not described in the available meta-data. We aim also to raise awareness on the users of institutional data about the importance of data quality issues for a correct interpretation of the results and to show the functioning of the proposed approach that can be easily adapted, mutatis mutandis, to other complex institutional databases characterized by a high heterogeneity of their units of analysis.

The paper is organized as follows. Section 3 provides an outline of the European Tertiary Education Register information system. Section 4 introduces the current data quality approach developed for ETER and its management, keeping into account the peculiarity of the ETER data collection. The methodology developed for the multiannual and the cross-sectional checks is then explained in Section 5. After that, in Section 6, we describe the results obtained by applying the proposed approach to the last version of the ETER database available. Finally, in the concluding section, we outline the strength of the proposed approach and its potential applicability to other databases as well as existing challenges and possible extensions.

An outline of the European Tertiary Education Register

ETER is a database of microdata on Higher Education Institutions (HEIs) in Europe, concerning their basic characteristics and geographical location, staff, finances, education, and research activities. ETER includes the following main groups of variables:

Institutional descriptors and geographical information on the included HEIs.

Data on students and graduates, including breakdowns by International Standard Classification of Education (ISCED-2011) level, gender, citizenship, mobility, and field of education.

Data on research, including PhD students and graduates, as well as R&D expenditure and participation to European Framework Programmes for Research and Innovation.

Financial data: expenditures and revenues of the HEI.

Staff data (academic and non-academic), including some breakdowns by gender, citizenship, and field.

When compared with the data provided by education and R&D statistics at EUROSTAT, ETER includes very similar variables and breakdowns for students and graduates, since ETER readily adopted the definitions from the UOE manual on education statistics. However, HEI-level data are provided rather than national aggregates.

ETER provides substantial additional information concerning the other dimensions: descriptors are of the outmost importance in order to characterize types of HEIs and their history, while geographical information allows for an analysis of the distribution of HEI activities across the European space. ETER also provides more detailed information on expenditures and revenues, including an important breakdown of revenues by core budget and third-party funds, which is not foreseen in education statistics. Additional data have also been collected concerning staff, including the number of full professors and breakdowns by gender and citizenship.

The ETER database is targeted to include 37 countries composed of the 27 EU Member States, plus the UK, plus EFTA countries (CH, IS, LI, NO) and other five EU candidate countries (AL, ME, MK, RS, TR). In principle, ETER data are provided by National Statistical Authorities (NSAs), Higher Education Ministries, or Higher Education Agencies, based on national statistical databases or higher education information systems, with few exceptions. Descriptors and geographical information are mostly collected by the ETER consortium. ETER data have been collected for six years (2011–2016).

The ETER database includes 3,198 unique HEIs over all years. For the academic year 2015/2016, 22.1 million undergraduate and graduate students and around 688 thousand Ph.D. students are accounted in ETER.

Data quality is a relevant issue for any data collection and is a greater challenge for microdata multi-sources data collection processes as is the case of ETER. A basic but very important dimension of the data quality is completeness that evaluates the share of missing values in the considered dataset. The current ETER dataset has an overall completeness index of 63%, meaning that the number of missing and confidential data is around 37%.

The lower completeness observed is due to the inclusion of some countries in which limited data have been collected namely Albania (AL), Iceland (IS), Republic of North Macedonia (MK), Montenegro (ME), and Turkey (TR) or only descriptors and geographical information is available, as is the case for the French part of Belgium (BE) and Romania (RO). In general terms, the level of completeness also varies largely by country (see Table 1). There are 10 countries for which completeness is 90% or more, including Austria (AT), Switzerland (CH), Cyprus (CY), Germany (DE), Ireland (IE), Liechtenstein (LI), Malta (MT), Portugal (PT), Sweden (SE), and the UK. For some countries, such as Italy (IT) and Poland (PL), data are largely complete for universities, but there are missing information for other institutions (particularly about staff and financial data).

Completeness of data by country in the ETER Database.

Completeness (2011–2016)
CountryAverage CompletenessMinMaxRange
High level of completeness
Switzerland CH0.990.931.000.06
Liechtenstein LI0.980.980.990.01
Germany DE0.970.230.990.76
United Kingdom UK0.960.131.000.87
Sweden SE0.950.291.000.71
Portugal PT0.920.230.970.74
Malta MT0.920.830.950.13
Cyprus CY0.910.490.990.51
Ireland IE0.900.800.950.15
Austria AT0.900.850.960.11
Medium-High level of completeness
Spain ES0.870.810.940.13
Estonia EE0.860.850.990.14
Finland FI0.850.600.950.35
Norway NO0.840.130.930.80
Bulgaria BG0.840.830.870.04
Slovakia SK0.830.760.880.12
Lithuania LT0.810.240.980.74
Italy IT0.800.360.920.56
Latvia LV0.790.120.900.79
Czech Republic CZ0.780.230.900.67
Poland PL0.770.240.890.65
Medium level of completeness
Hungary HU0.740.120.920.80
Netherland NL0.740.350.830.48
Greece GR0.740.340.900.55
Croatia HR0.730.260.900.63
Denmark DK0.650.100.940.84
North Macedonia MK0.570.110.840.73
Luxemburg LU0.530.290.930.63
Low level of completeness
France FR0.450.060.960.89
Slovenia SI0.450.110.850.74
Belgium BE0.420.090.970.87
Iceland IS0.410.130.550.42
Serbia RS0.350.090.830.74
Albania AL0.330.130.690.56
Turkey TR0.260.110.640.53
Montenegro ME0.120.110.120.01
Romania RO0.100.060.120.06

The level of completeness largely varies by domain and variable. It is higher for data on students and graduates, although some breakdowns by field and by mobility are more problematic. Completeness is lower for financial data on income and expenditure (around 40% on average). The lack of availability of this information is due to the absence of standardized collection procedures on the national level in some countries. R&D expenditure is available in around 33% of cases. Data on staff are in an intermediate position around 50–55% for both Head Count (HC) and Full Time Equivalent (FTE), except for academic staff breakdowns.

ETER's current approach and management of data quality

Data quality is a relevant interdisciplinary issue, studied in statistics, management and computer science. Poor data quality greatly reduces data value: inaccuracy, incompleteness, out-of-dateness may cause data to become useless (Batini-Scannapieco, 2016).

Some international standards for defining the data quality concepts and related dimensions have been proposed. ISO 25012 introduces and defines three possible levels (views) of data quality to be considered individually, namely:

Internal Data Quality, related to values and formats of data (e.g. consistency, completeness);

External Data Quality, related to characteristics of the software and hardware used to store and access data (e.g. response time, portability);

Data Quality in Use, related to the final user of data (e.g. effectiveness, level of satisfaction).

Validation and data quality controls are indeed central tasks in ETER, facing challenges rose by the specific nature of ETER data: i) micro-data at institutional level with an high level of heterogeneity, instead of aggregated data, ii) secondary data collection based on data collected nationally largely without a common reference framework. The latter is alternative to a primary data collection, hence implying also a limited control on the overall data collection process.

We hereby summarize the ETER data quality process, which is purposefully defined to address the specificities of ETER data collection. This process combines different methods, including a systematic analysis of internal quality of data (format accuracy, completeness, consistency, and timeliness), and advanced statistical methods for outlier detection and analysis of comparability, based on metadata for checking external validity by comparing ETER data with other data sources.

More specifically, the ETER Quality Validation and Reporting process consists of the following phases:

Quality Metadata Collection, contextual with data collection.

Quality indicators calculation and validation checks performed within the data collection phase on a country basis and on the whole dataset. They are described in the following.

Multi-annual checks.

Cross-sectional ratios to detect comparability problems.

Investigation of the comparability dimension on the base of the previous analysis.

Checks with external data sources, either to assess the overall coverage against official statistics, national aggregates, or explain/correct problems detected in the previous steps.

The overall ETER data quality process is depicted in Figure 1, where the darker green processes are the specific ones introduced to deal with data quality:

During the Data Collection phase, both manual and automated checks are performed.

Right after the Data Collection, a Pre-validation phase is carried out.

A dedicated Quality Review and Correction phase is later performed, based on the methodology illustrated in the next section and a relevant Quality Annotation and Reporting process, also following Eurostat indications (Eurostat, 2014, 2019). See Appendix 1 for the annotation and flags adopted.

Figure 1

Overall ETER data quality process.

The data quality indicators adopted within the ETER project belong to the ISO25012 and relate to the internal data quality, they are:

Accuracy To evaluate the conformity of the provided values to the specified format in the collected data sets.

Completeness To evaluate the number and meaning of missing values that are present in the collected data sets.

Consistency To verify possible violations of semantic rules defined over the involved data, and specifically between different variables.

Timeliness To evaluate the lapse of time between the ETER collection date and the source release date.

Here we describe briefly the checks performed to investigate accuracy, completeness, and consistency (for further details, see Lepori et al., 2018).

Accuracy checks

Accuracy checks verify that data entered have the right format foreseen by the handbook and that no logically impossible values are found. These checks are performed in the data collection sheet and on delivered data. Simple mistakes are corrected directly, whereas unclear cases are reported back to NSAs/NEs for clarification.

Completeness checks

No blank cells are allowed in the dataset, except for remarks. Blanks should be recoded correctly as missing, confidential, not applicable, or “0”. This control is extremely important for the final quality of the database. Blank cells are highlighted automatically. Clear cases are recoded directly and ambiguous cases (for example between missing and not applicable) are reported back to national experts and NSAs for clarification.

Consistency checks

These checks control for logical consistency between different variables (for example when the highest degree delivered is at ISCED 7 level, all values for students and graduates at ISCED 8 level should be not applicable). See Appendix 2 which reports the list of consistency indicators.

Further, these checks control whether the sums of breakdowns by subcategories equals the total and numerical relationships between values (example R&D expenditures lower than total expenditures). Deviant values are identified and checked. In case there are specific reasons, an explanation is added to the metadata for that specific HEI.

Check of missing data

An analysis of missing data is performed (including also issues of breakdowns by subcategories). When it is expected that data should be available, possibly with some limitations, this is requested to NE/NSAs.

Control of metadata completeness

Metadata are systematically controlled for the completeness, taking into account also issues emerging from the checks on the data. When metadata are missing or incomplete, further information is requested.

Expert checks

Expert checks based on knowledge of national systems, as well on information available on the Web and EUMIDA data, are performed in order to ensure that provided data are realistic. Potential problematic cases are notified back to national experts and NSAs. When these are related to methodological issues, the corresponding remarks are integrated in the metadata.

The data quality management system in the ETER project has been built to meet several challenges of large-scale international data collections. The first challenge was to use a reproducible scalable system, which can be applied to 37 different countries. Secondly, the process needs to keep the workload for data deliverer as small as possible and to reduce the margin of error as much as possible. Thirdly, data from different departments are collected in one template. This demands extensive control mechanisms in order to ensure that no inconsistencies between the data exist. In order to meet these challenges, the following data quality management procedure has been developed within the ETER project (see Figure 2):

Because of its widespread usage, Microsoft Excel has been chosen as tool for the perimeter validation and data collection. In the first step, national authorities identify the higher education institutions for the respective data collection year. Any demographic events are tracked and added as variables to the dataset. After confirmation of the perimeter, data collection templates are sent out. These templates are prefilled with information based on previous years, which is not expected to change (e.g. foundation year, geographical information etc.).

The data collection files already include a high number of control mechanisms in order to make the data deliverer aware of potential irregularities. These mechanisms screen the template for issues in completeness (for mandatory variables), accuracy (e.g. a NUTS 2 region need to have four digits) and consistency (e.g. sums of breakdowns equal to the aggregate variable). Specifically, consistency checks verify a possible violation of semantic rules defined over the involved data, and specifically between different variables. The list of indicators and involved variables is reported in Appendix 2. In order to prevent any overwriting of automated checks, a macro has been implemented into the data collection file. This macro allows only pasting values and therefore ensures that the preprogrammed automatism cannot be bypassed.

Already clean data are then imported into the database, where the data collection is managed henceforward. Data collection files can be produced by the ETER infrastructure if updates or changes are necessary.

Automated data validation and data quality checks are run on the imported data on the database. Data validation is an extended form of the control mechanisms already implemented in the data collection templates (completeness, accuracy, and consistency) and produces a pdf report per country and academic year. Then, an extensive automated data quality procedure is performed on the data. This internal data quality process includes a multiannual analysis as well as cross-sectional outlier detection. Suspect cases are either corrected (in cooperation with NSAs) or flagged. Additionally, the ETER data are subject to external data quality controls, where the data are compared to equivalent data like EUROSTAT national aggregates or data from U-Multirank.

The final product of this procedure is a high-quality dataset, which is published in spring of each year on the ETER web interface. Because of continuous work on the dataset, updates in terms of additions or corrections are regularly.

Figure 2

Current approach in data quality management.

The effectiveness of the presented approach has been validated by proof-of-concept and user experience. At the side of the data deliverer, the implemented process minimized the additional quality control burden focusing the interactions on a limited number of selected cases, identified by statistical analysis and automated controls.

Methodology

The overall data quality management process, described in the previous section and based on the ETER Data Quality Report (Daraio et al., 2018), combines different approaches. It includes a systematic analysis of internal quality of data (format accuracy, completeness, consistency, and timeliness), the analysis of comparability based on metadata and check of external validity by comparing ETER data with other data sources. In this section, we describe the data quality checks developed to identify outliers, extreme observations, and to detect ontological inconsistencies not described in the available meta-data which constitute the main objective of this paper.

Given the specificities of the ETER database, the methodology developed to check the consistency and stability of data over time is based on an empirical-oriented approach, which analyzes the observed distributions of the relevant variables without referring to pre-defined theoretical data distributions. This is different from what was done previously within the ETER project when outliers were identified by means of the approach implemented in the R package “extreme values”, which compares the empirical data with theoretical distributions.

In the following, we describe the logic and functioning of the multiannual checks as well as the cross-sectional checks which complement them.

Multiannual checks

Each institution Ij contains the values vji of a number of variables, some of which usually change over the time horizon used in the database (the years go from 2011 to 2016). Examples of these variables are “number of students” or “number of graduates”. To lighten the notation, when there is no ambiguity, we will denote the values of the generic time series for one of these variables with v1, v2,…, vt (without explicit reference to the index j of the institution), and the set of years indices of the time series simply with 1, 2, …, t = T.

The availability of data across different years raises the issue of longitudinal consistency of the data collected (impact of demographic events; revision of variable's categories and definitions, etc.). On the other hand, the availability of several yearly editions of data offers an additional possibility for quality control. Indeed, multi-annual checks can help to detect suspect cases where the level of variation from year to year is very large or otherwise anomalous when compared with the average changes in the sample. This type of check is particularly useful in detecting and reporting mistakes of respondents and/or changes in the methodology for data collection.

The availability of only six years of data, however, does not allow the use of methods specific for time series analysis, which requires much longer time series. Moreover, ETER dataset contains different typologies of variables (e.g. structural descriptors rather than quantitative variables) with a different propensity to change over time.

For these reasons, the methodological approach developed for the multiannual checks consists of multiple procedures and it is based on the use of different techniques:

manual check of the impact of demographic events (take-over, spin-off) on concerned institutions’ figures and respective flagging (the code “b” for breakdown in time series was already foreseen);

analytic control of descriptors and status variables supposed to be stable over time, i.e. legal status, foundation year, geographical information, lowest/highest degree awarded, etc.;

comparison of national aggregates over time for a selected number of quantitative variables already during the validation phase, with an alarm if the variation is over a pre-defined threshold;

use of measures of statistical dispersion (interquartile range comparison over time) to assess the overall stability of the distribution of quantitative variables;

statistical analysis to highlight the HEIs with annual growth outstanding from the overall distribution (outlier).

The approach proposed in this work for the checks of the described time series has been developed in order to be flexible and scalable. It is based on thresholds and parameters, which can be tuned tacking into account expert knowledge or may be determined from the empirical distribution of the observed data. This approach is easily implementable and can be executed within the most common software tools used in data management, e.g. R, Matlab or even MS Excel. This approach has been adopted in the data quality assessment of the ETER European research project in replacement of the previous methodology based on outlier analysis.

The proposed approach relies on two types of controls to identify potentially erroneous time series in the HEIs:

Check of the discontinuity: this control is aimed at identifying large variations in the values of the variable under analysis, and therefore capturing its volatility over time. It is based on the computation of the annual variations, called deltas, and on its possible normalization using a measure of the size of the institution. A scale invariance parameter is defined, in order to choose the desired level of normalization.

Check of the variance of deltas: this control is aimed at identifying fluctuations in the size of the deltas, i.e. a second order information with respect to the value appearing in the time series. This information allows to identify institutions having an overall moderate range of variation, which are not detected in the previous control, but having anomalous isolated “jumps”. Again, normalization is possible using a measure of the size of the institution, setting a scale invariance parameter.

After their identification, the HEIs containing inconsistent values in the time series should be validated or corrected by using subsequent procedures, depending on the specific case (for instance, checking external sources of the same data).

In more detail, the methodology is composed of the following four steps.

Exclusion of irrelevant HEIs. This step is performed because very small institutions may exhibit very large percentage fluctuations in the values simply because of their small size, without necessarily revealing errors. For example, the number of students in a very small university may easily double or halve from one year to the next. An analysis of similar cases would be quite complex, and on the other hand, its impact on the global situation would be negligible. Thus, small institutions are generally excluded from data quality checks. To determine where to set the division between relevant and irrelevant HEIs, we compute the geometric mean μ of the whole time series v1, v2,…, vt for the variable under analysis, and consider μ a measure of the size of the institution for that variable. μ=(i=1ttvi)1/t{\rm{\mu }} = {\left( {\prod\limits_{\mathop {i = 1}\limits_t }^t {{v_i}} } \right)^{1/t}}

Then, we compute a threshold S1 such that μ ≥ S1 for a predetermined percentage of cases (e.g. 95% or 98%). Now, any institution with μ < S1 will be considered irrelevant for the variable under analysis. Computation of the Discontinuity Measure (DM) and Jump Variance (JV) for each HEI. We call delta the difference between any two consecutive values of the time series of the variable under analysis in a given institution. To lighten the notation, we will not explicitly write the indices of the variable and of the institution, obtaining so δ1 = v2v1, δ2 = v3v2, and so on. The set of the delta values of an institution will be denoted as Δ. Within Δ, we consider the sum of the deltas having positive values, denoted as Δ+, and the sum of the deltas having negative values, denoted as Δ−. Then, we compute the Discontinuity Value (DV) of the variable under analysis in the given institution as the absolute value of the product of the two mentioned sets, as follows. DV=|Δ+Δ|{\rm{DV}} = \left| {\Delta + \Delta - } \right|

In other words, DV measures the amount of the “jumps” in the time series of the variable under analysis in the given institution, and it reduces when all jumps tend to be in the same direction. This evaluates the “discontinuity” in the time series. In order to introduce scale invariance at a controlled intensity, DV is divided by the geometric mean of the same variable of the given institution raised to a power σ, obtaining DM DM=DV/μσ{\rm{DM}} = {\rm{DV}}/{{\rm{\mu }}^{\rm{\sigma }}} When σ = 1, DV is fully “normalized” by the size of the institution, obtaining so a scale invariant measure. On the other extreme, when σ = 0, the value of DV is fully dependent on the size of the institution. Any value of σ between 0 and 1 can also be selected, and that will constitute the desired level of scale invariance.

Another measure computed with the set Δ is the Jump Variance (JV), that is the variance of the elements in Δ, computed as follows, where |Δ| denotes the cardinality of the set Δ. JV=i(δiδmean)2/|Δ|{\rm{JV}} = {\sum _{\rm{i}}}{\left( {{{\rm{\delta }}_{\rm{i}}} - {{\rm{\delta }}_{{\rm{mean}}}}} \right)^2}/\left| \Delta \right|

The aim of the JV measure is to identify time series having a not excessive value of DV, hence not highlighted by the DV measure, but containing some anomalous jumps, for example, because one isolated value in the time series contains an error. This value can again be normalized by using the power μσ of the geometric mean, with a technique very similar to the previous case, obtaining the Jump Diversification (JD) JD=JV/μσ{\rm{JD}} = {\rm{JV}}/{{\rm{\mu }}^{\rm{\sigma }}}

Issue of alarm flags. For each HEI in the relevant sample of the variable under analysis (that is, with μ ≥ S1), we determine whether to mark it with alarm flags or not by using the following criteria:

HEIs with the highest values of discontinuity measure DM for the variable under analysis (e.g. the top 5% or 10%) are flagged with Alarm 1. The demarcation value will be called S2.

HEIs with the highest values of jump diversification JD for the variable under analysis (e.g. the top 5% or 10%) are flagged with Alarm 2. The demarcation value will be called S3.

Check of the alarmed HEIs. Finally, we check the institutions which received alarm flags and therefore appear to contain one or more inconsistent series of values. Note that, due to the nature of the data, the presence of alarm flags does not guarantee the presence of errors, but only that the time series are “uncommon”. Now, depending on the specific case, correction or validation can be performed by checking external sources of the same data, by inspection, etc.

To further clarify the described approach, we analyze in detail the case of the variable “Enrolled students ISCED 5–7”. The same approach has been applied to the other relevant variables listed in Table 2.

Step i) We compute the geometric mean μ over the years for the number of students enrolled for each single HEI, and we find the threshold S1=142 to exclude 5% of the smallest HEIs.

Step ii) For each HEI not excluded by the previous step, we compute the set of the yearly variations Δ of the number of students enrolled, from which we compute DV = |Δ+ Δ−| and JV = Σii − δmean)2 / |Δ| for each HEI. Then, we select the scale invariance parameter σ = 0.5, thus we choose a partial level of scale invariance. Finally, for each HEI, we compute DM = DV / μσ and JD = JV / μσ.

Step iii) We set alarm flags for the values of students enrolled for all the relevant HEIs in the top 5% of the values of DM and for all the relevant HEIs in the top 5% of the values of JD. Note that an institution can also receive both alarm flags, but one is enough to require check. The threshold identified on DM with this procedure is S2 = 2.08 while that on JD is S3 = 1.96. The total number of flagged HEIs is 285.

Step iv) The 285 HEIs flagged at Step iii) for their values of students enrolled have been checked. In particular, every Country expert examined those belonging to his/her Country and took the adequate actions after consultation with national statistical offices and/or ministries. The actions have been of 3 possible types: (a) confirming suspect data; (b) flagging and explaining, (c) correcting the data.

List of variables considered for the multiannual checks.

Variable
Total expenditure (PPP)
Total revenues (PPP)
Total academic staff (FTE)
Total academic staff (HC)
No. of administrative staff (FTE)
Total staff (FTE)
Total staff (HC)
Total students enrolled (by ISCED level)
Total graduates (by ISCED level)
Cross-sectional checks

The multiannual checks are complemented by a check to control for cross-sectional consistency. The method is based on the analysis of the distribution of the values of a ratio between two interrelated variables, e.g. the amount of personnel expenditure and the number of staff. To take into account specificities linked to country settlement and type of institution, the method has been applied to sub-distributions of HEIs:

By country

By institutional category (university, university of applied sciences, other).

The analysis may not include HEIs below a minimum size thresholds since experience tells us that very small institutions show a number of contextual peculiarities and are sort of “outliers” by definitions (in principle very small institutions can be left outside ETER perimeter, but many countries include them). The method is applied as follows:

For each ratio Ri (the considered ratios are listed in Table 3 and indicated as R1,…, Ri, …, R8) starts the analysis by computing its values on all the institutions and sorts its values ascending;

Identify the value of the ratio that leaves out e.g. the 5% of the cases with the lowest value and call it Xl and identify the value of the ratio that cuts out the 5% of the cases with the highest values and call it Xu;

Create two sets of sub-distributions for the analysis; the first set is defined according to the institutional category which contains three sub-groups of records including respectively universities, universities of applied science, other; in the second set records are grouped by country;

For each sub-distribution calculate the value of the aggregate ratio “Rs”-obtained dividing the aggregate value of the variable at the numerator by the aggregate value of the variables used as denominator;

For each sub-distribution calculate the ratio for each record, sort the ratios ascending and identify the records with the value of ratio below Xl,Rs and above Xu,Rs where Xl and Xu are the parameters calculated on the overall distribution and “Rs” indicates the value of the aggregate ratio of the sub-distribution

Alarm cases which are either below the lower-bound thresholds (Xl,Rs) and above the upper-bound one (Xu,Rs).

List of cross-sectional ratios for checks.

CodeName
R1Enrolled Students / Academic Staff
R2Academic staff / Total staff
R3Personnel expenditure / Total staff
R4Personnel expend. / Total expenditure
R5Total expenditure / Total revenue
R6Basic Government funds / Total revenue
R7Graduates ISCED 5–7 / Enrolled students ISCED 5–7
R8Graduates ISCED 8 / Enrolled students ISCED 8

This multistep methodology allows treating in a simple way the heterogeneity of the higher education systems in different countries and across them with reference to categories of institutions.

The described controls can be applied in ETER in two different phases:

Preliminary validation checks, performed directly within the data collection phase on a country basis, in order to allow for an easy return from the respondents, to correct the data before online integration;

Further in-depth quality checks and validation, to perform more accurate controls that can also provide indications about appropriate data usage and possible quality improvements for future data collections.

Results
Results from the multiannual checks

The methodology for multiannual checks described above has been applied to 19 of the ETER variables considering all reference years (2011–2016), keeping the year 2016 as a base. Around 2,800 cases, spread in 33 countries, have been highlighted and checked in detail (Table 4a and 4b).

Outcome of the multiannual checks. Number of cases detected by variable and country.

VariableATBEBGCZDEESGRHUITNLPLPTRSSITRUKTotal
Academic Staff FTE969101091291580
Academic Staff HC20185408141635121313176
Non-Academic Staff FTE4108244212295107
Total current expenditures (NC)11414323
Total current revenues (NC)21913640
Total Graduates ISCED 58141831
Total Graduates ISCED 5–71275352221114481335198204
Total Graduates ISCED 675321445131036594104176
Total Graduates ISCED 751511897178231771129140
Total Graduates ISCED 7 long degree4116394111572
Total Graduates ISCED 8315416136411162915105
Total Staff FTE4491241716361103
Total Staff HC437202111411344110
Total Students ISCED 5109661950
Total Students ISCED 5–7931031683511261361321517205
Total Students ISCED 6224311273273525123163
Total Students ISCED 732931113922671221951423223
Total Students ISCED 7 long degree2247312268
Total Students ISCED 822131457124121131617109
All variables10112100682792551281802481328517633101201772,185

Outcome of the multiannual checks. Number of cases detected by variable and country.

VariableALCHCYDKEEFIHRIELILTLVMKMTNOSESKTotal
Academic Staff FTE12421541341
Academic Staff HC912451151341
Non-Academic Staff FTE13593122
Total current expenditures (NC)21153214
Total current revenues (NC)311216
Total Graduates ISCED 511552124
Total Graduates ISCED 5–7823821672119564
Total Graduates ISCED 641111273424
Total Graduates ISCED 75112141425430
Total Graduates ISCED 7 long degree116311
Total Graduates ISCED 8152422521
Total Staff FTE131531032
Total Staff HC735125932
Total Students ISCED 5351412227
Total Students ISCED 5–7181118161591514796
Total Students ISCED 68712641513451
Total Students ISCED 791121491313751
Total Students ISCED 7 long degree112
Total Students ISCED 82115431724
All variables181859362223358172372196171105623

The distribution by country, in general, follows the size of the country in terms of the number of institutions in ETER, with the six larger countries (DE, ES, IT, PL, TR, the UK) accounting for around one half of the cases.

The detected cases can be grouped into three categories:

Breaks in time series already known and flagged, both as a consequence of demographic events or methodological discontinuities. Examples are the change in the classification of curricula in Spain from 2013 onward, a different method for counting academic staff in Swiss UAS in 2013, etc.;

Country systemic issues, involving a large number of HEIs in one country, and therefore pointing to breaks in time series, which have not been notified or flagged before. Examples are a generalized drop of ISCED5 students and a paralle increase of ISCED6 in 2014 in Ireland, the sharp drop of academic staff HC registered in Italy in 2014, etc.;

Individual cases, which may be the consequence of errors in data reporting or to peculiarities of the institution (i.e. recently founded HEIs show tremendous growth in the first year).

All cases in categories 2 and 3 have been controlled individually by the consortium interacting with NSAs.

In terms of variables, more than two third of cases concern student population (students and graduates) but volatility emerges in all variables considered.

The multiannual checks beside the identification of individual outliers cases and mistakes in the reporting that were revised with data providers and corrected, allowed highlighting problems of comparability across waves of data collection. In several cases, changes in curricula with an impact on their classification according to the ISCED system caused a sudden increase/decrease of the number of enrolled students and degrees in specific ISCED level. The changes in the figures therefore did not reflect substantial changes of the pattern of enrollment in higher education, but a simple normative effect. During the years, several countries (e.g. Spain) were affected by these changes due to the still ongoing adaptation to the Bologna process. Other artificial changes due to the administrative rules were found concerning counting methods and rules for reporting staff figures, especially contract and part time staff (e.g. Switzerland and Italy).

Cross-sectional ratios for consistency analysis

Financial data are gathered according to ETER breakdowns, this fact may induce issues on the quality of such data since ETER itemization could mismatch with the categorization adopted by National Authorities that provide the data funneled into ETER.

Concerning revenues, national data are available with different levels of granularity; in Italy, for example, the content of some revenue breakdowns (i.e. basic core budget and other core budget) differs between state and non-state HEIs because of the different granularity of the national data available. The match with ETER in most cases is made without significant problems of attribution. However, some categories may not perfectly match. For example, referring to non-state universities in the UK, data are available in domestic statistics with lower level of details; also the content of some revenue breakdowns (i.e. basic core budget and other core budget) differs between state and non-state universities because of the different granularity of the national data. Finally, on some occasions, non-recurring revenues are not distinguishable from the others, although regular funding is only a share of total current revenues.

Considering costs and expenses, national authorities provide data in different categories sometimes including depreciation, depending on the accounting system adopted by HEIs. However, depreciation is not included in ETER reporting, since it considers capital expenses according to a cash accounting approach, as in the majority of countries. This fact may create mismatches since the perimeter of capital expenses may differ for those HEIs that adopt the accrual accounting system and consider depreciation of fixed assets in their financial reports instead of registering capital expenses. For example, in Italy, during the time span covered by ETER, there has been the progressive implementation of the reform of the university accounting rules. As a result of these changes, universities have progressively adopted an accrual accounting system and have moved to the “single budget”, consolidating the data of all the centers with managerial and administrative autonomy which make up the organizational structure of the universities. In such cases, without an expert based control on data, inconsistent information may be provided in both cases when comparing HEIs within the same country or across different countries.

Considering the total employees count, some data inconsistencies may arise since HEIs may utilize different methods in counting staff (Head Count vs FTE) or may include in the academic staff different categories of employees, for example, some university hospitals may include doctors in specialist medical training within the academic staff.

Following the previous considerations, it emerges that detecting inconsistent data and comparability issues within the same country and the same year amongst different HEIs is a very important topic.

Heterogeneity in terms of size and span of HEIs (for example, polytechnics vs general purpose universities) may hide some comparability issues. Namely, comparing numbers of different orders of magnitude may not produce useful information while, after a normalization process, comparisons may provide better insights. Thus, the approach of financial/managerial ratios (Woelfel, 1987) has been deemed suitable to compare HEIs with different orders of magnitude and span. Following this approach, it was defined as an ad hoc set of ratios. Each ratio has been defined as the relative magnitude of two selected numerical values taken from ETER, in particular ratios were built mainly with the purpose of analyzing financial and staff data.

By comparing the ratios with national standards or relevant (expert based) threshold values, data inconsistencies or comparability issues may be detected. Moreover, although ratios may not be directly comparable between HEIs that adopt different accounting methods, they could help in detecting such differences.

After several experiments and group discussions, a set of eight ratios was defined mainly considering financial and staff data, thresholds have been defined through an expert based approach in order to spot data inconsistencies both within each country and between different countries. As a result, over 2,000 cases have been detected in the first test on the last wave of data collection. Table 5 lists the set of ratios to detect comparability issues that have been proposed and implemented and it also reports the percentage of detected cases for each ratio. It emerges, at a first glance, that the majority of detected cases involved the staff counts, either as the percentage of academic staff on enrolled students (R1) or as the percentage of total staff (R2), also the percentage of staff on personnel expenditure (R3).

Cross-sectional ratios for consistency analysis.

DescriptionCode%
Enrolled Students / Academic StaffR127.6%
Academic staff / Total staffR218.7%
Personnel expenditure / Total staffR315.1%
Personnel expend. / Total expenditureR45.2%
Total expenditure / Total revenueR58.9%
Basic Government funds / Total revenueR66.1%
Graduates 5–7 / Enrolled students 5–7R715.0%
Graduates 8 / Enrolled students 8R83.2%

Table 6 reports the percentage of detected cases for each ratio and in total for each country involved in ETER.

Cross-sectional ratios – a country by country reporting.

R1R2R3R4R5R6R7R8Total
Austria8.0%1.9%0.9%8.1%1.7%3.4%
Belgium0.5%0.2%0.2%
Bulgaria4.0%1.1%0.7%4.4%1.6%
Croatia2.1%2.0%
Cyprus1.3%1.7%8.9%10.2%6.3%5.0%3.9%
Czech Republic6.7%4.7%0.2%2.7%17.3%4.0%4.6%
Estonia4.1%2.4%0.6%1.2%31.4%
Finland0.8%1.7%2.3%
Germany21.3%26.5%53.5%55.8%70.6%1.2%16.1%22.0%3.2%
Greece7.3%1.1%6.4%3.3%
Hungary1.8%13.1%2.1%2.6%
Ireland0.1%0.8%0.9%1.6%0.2%1.3%
Italy6.8%4.2%4.5%16.5%0.8%
Latvia2.6%4.2%11.5%2.0%3.6%1.4%1.1%6.4%
Lithuania0.5%2.1%11.0%1.4%2.4%0.6%0.2%1.1%4.4%
Luxembourg0.1%0.6%1.1%
Malta0.5%0.7%0.4%0.6%1.2%
Netherlands0.8%0.6%3.8%0.8%0.6%9.9%2.7%
North Macedonia0.3%0.6%9.2%
Norway1.4%1.3%0.9%0.5%
Poland11.2%3.6%19.7%9.5%1.1%3.9%
Portugal2.6%16.3%0.2%2.0%1.6%1.2%1.9%0.1%
Serbia2.2%1.9%5.5%3.9%
Slovakia0.4%1.4%2.0%0.4%11.6%6.6%15.4%0.2%
Slovenia5.9%5.5%3.7%
Spain3.6%1.9%4.0%9.9%0.2%
Sweden1.0%0.2%0.5%1.4%8.7%1.7%0.6%
Switzerland2.1%0.2%3.5%11.6%1.6%1.2%0.5%0.2%
Turkey7.5%9.7%4.4%0.3%
UK6.0%13.8%3.1%6.1%7.5%26.6%11.6%3.3%1.1%
Discussion and conclusions

As recalled, the ETER features lead to specific challenges for the data quality process due to the nature of microdata, the extreme heterogeneity of typologies of rules and HEIs’ categories across countries, the lack of control over the complete data collection process.

The methodology developed to account for these specificities combines quantitative statistical checks with an expert-based interaction with the national data providers. This approach can be complemented with the results of imputation procedures to fill in missing data (Bruni, Daraio & Aureli, 2020).

A strength of the developed approach is its empirical-oriented flexibility that allows the user to personalize the quality investigation with respect to the observed distribution of the variables considered instead of using theoretical-based distribution functions for the data analysed.

Given this flexibility, the developed methodology could be extended, mutatis mutandis, to other institutional data of Higher Educational systems of those countries for which there is a lot of information available in documents and public sources but it has not been collected and integrated yet in a unique register for the monitoring of the system over time.

The current methodology, although consolidated and assessed, could be improved in different ways. It could be useful to invest in better combining multiannual and cross-sectional checks to further reduce the number of cases to inspect manually and to pre-identify the problems or possible explanations, going towards the implementation of a fully automatized control. This ambitious goal would require the revision of the current architecture of the data and of the overall data quality management process.

Another limit of the current approach to data quality relies on the reporting of data quality information. Although flags are incorporated in the dataset for each variable, the information about the explanation of the problems and the way they should be treated (i.e. if the impact on the comparability of the data is high, low or null) is fragmented in different sections of the dataset including the notes available for groups of variables, metadata at the variable level and additional more in-depth information. This fragmentation of the relevant information and the difficulties to read together with the data and the metadata hamper a full data quality aware use of the data, especially by policy makers or analysts who are not specialists in the field.

The main challenges in ETER data collection that remain open are:

Dealing with the heterogeneity of data sources through a formal and unambiguous way of representing metadata. Computational ontologies can be a possible solution in this direction, being able to provide an harmonized view on concepts expressed in a machine-readable way.

Dealing with “advanced” quality controls. In several cases, quality checks can go beyond syntactic representation and, instead, be based on the semantics of the concepts. For instance, “total expenditures” should be properly specified in terms of mandatory and optional components, as, e.g. “R&D expenditures” are available only for a subset of countries.

Daraio et al. (2016a, 2016b) introduced the ontology based data management approach to coordinate, integrate, and maintain the data needed for science, technology and innovation policy and illustrate its potentials for specifying Science, Technology and Innovation indicators and developing science of science policies. They outline the main advantages of OBDM that are conceptual access to the data, re-usability, documentation and standardization, flexibility, extensibility, openness, interoperability, and data quality.

In the future, a possible development is inherent to addressing the above cited challenges and having a “modernized” ETER data collection that could benefit to ontologies and Semantic Web models and languages.

In particular, the use of such an ontology based approach could allow to achieve (i) a harmonized data collection, overcoming sources heterogeneities and (ii) richer quality controls, which could be specified in a declarative way and be designed on the basis of an explicit semantic representation of concepts involved in the ETER data collection.

eISSN:
2543-683X
Langue:
Anglais