When doing measurements, statistics are needed if we want to describe the data (descriptive statistics) or if we want to draw conclusions based on the data (inferential statistics). There is a vast amount of statistical methods in the literature, and the choice of method depends on what we want to know and what type of data we have. In this paper we give an overview of the most basic and the most relevant methods for bioimpedance analysis along with examples within the bioimpedance field. Because bioimpedance measurements often are done as frequency sweeps, producing large amounts of correlated and possibly redundant data, the implications for inferential statistics are discussed together with data reduction solutions. A goal of bioimpedance research is often to develop methods for prediction of a biological variable or state, and an overview is given for the most relevant methods for development and testing of a prediction model. At last, the validation of a new measurement technology employs distinct statistical methods, and an overview is given on the concepts, terms and methods for evaluating performance.
Instead of beginning with the type of measurement as a basis for selecting the statistical method, we expand the perspective by beginning with the hypothesis and research design that should come before the measurements are acquired. The reason is that we generally have an idea about what we want to investigate with our bioimpedance measurement, and we do not perform measurements completely randomly. In order to do our investigation properly, it begins with a research hypothesis where we formulate what we want to investigate in a testable way. For instance, if we want to find out whether gel electrodes provide lower bioimpedance measurement than textile electrodes, our hypothesis can be formulated as: “Bioimpedance is lower when using gel electrodes compared to using textile electrodes”. We now have a testable hypothesis, and our hypothesis can be either be accepted or rejected by experiments. It is much easier to dismiss a hypothesis than to prove a hypothesis, because it takes only one piece of solid evidence to reject it, but an endless amount to prove it correct. That is why the statistical methods are based on rejecting an opposite hypothesis, called a
The hypothesis example above is very general and the testability could be improved by making it more specific, i.e. “Trans-thoracic bioimpedance is lower when measured by gel electrodes than measured by textile electrodes using a two-electrode setup” if this is the relevant setup we want to test. It is easier to test this hypothesis because it implies only one certain type of measurement, and reduces the chance of an inconclusive result. A hypothesis should be
It is a good idea to know how many units (i.e. items or subjects) are needed in order to test our hypothesis. Unless we test all the units in a population, we are only testing a sample of the whole population. In order to make a general conclusion about the population, we need to show that the effect that we observed was not likely due to chance from random variation in our sample. If we choose too few units, we may end up with an inconclusive result and a worthless study, and if we choose too many, we are wasting resources (e.g. sacrificing more animals than needed). Hence, sample size consideration is of ethical relevance [2].
In hypothesis testing, we want to reduce the chances of two types of errors: incorrectly rejecting a true null hypothesis (Type I error), and the failure to reject a false null hypothesis (Type II error). The Type I error probability is determined by the α (i.e. 5% usually). With a given α, we can also calculate the
The
Desired α (probability of incorrectly rejecting a true null hypothesis)
Desired β (probability of failing to reject a false null hypothesis)
Expected sample distribution (type of distribution and variance)
Expected magnitude of difference or association.
Estimation of sample sizes is not an exact science, and often these inputs will be “a qualified guess”, even so it is still important to assess whether we need something like 10 or 100 subjects/items. We already have decided our type of test based on the hypothesis, and our α will conventionally be set to 0.05 with a beta somewhere above 0.8. What is left for us to provide is the effect size. If this is not known, the first place to look is in similar studies. Perhaps other investigators have published data with similar measurements on a similar sample. If no previous data are available, conducting a pilot study can give a good indication of these values. Perhaps we gather information which suggests that the sample variance may be somewhere between 100 to 500 Ohms, and that the difference between means is between 1k to 2k Ohms. In such cases it is best to account for the worst case (variance = 500 Ohms and difference between means = 1k Ohm) in the sample size determination.
In practice, the sample size calculation is not done by hand, but by computer programs (such as the free G*Power ©) which lets you choose a statistical test, asks for the necessary inputs (i.e. α, β, variance and effect size), and gives you the minimum required sample size. They can also be used to determine the power of your test given the sample size, α, β and effect size. Because of all these unknowns, it is a good idea to consult a biostatistician on these matters if possible.
Often in bioimpedance measurements, we want to examine more than one bioimpedance variable per sample or measurement. Sometimes, we have limited knowledge beforehand on effects in bioimpedance in our study. In order to maximize the changes of a finding, we may acquire several bioimpedance parameters (i.e. |Z|, G, theta) at multiple frequencies from one measurement. Say this gives us a set of 100 variables for comparing two different tissue types, it is very probable that we will find a significant difference in at least one of the variables purely due to chance. It is possible to do adjustments for such multiple comparison tests by e.g. the Bonferroni correction method [3,4], which adjusts the significance level threshold by dividing the single-comparison level by the number of multiple comparisons included in the analysis. For bioimpedance analysis, this approach is often insufficient due to the large number of comparisons. Unless the numbers of comparisons are few, a better approach is to reduce the number of variables by data reduction or model based approaches. Quite often, and especially for bioimpedance frequency sweeps, the data will be highly correlated and can be reduced into a small set of variables which account for most of the information in the measurements. A common method in the bioimpedance field is to assume that the electrical properties of the sample can be described by an electrical equivalent model (see chapter 8 in [5]) such as the Cole model, and to estimate the component values by fitting the measurement to the mathematical expression of the model. With a good agreement between the model and the measurement, this approach reduces the measurement into a few uncorrelated parameters which are easier to handle statistically. Data reduction can also be done without equivalent model assumptions. One such way to reduce the data is to computationally transform the data into a set of uncorrelated components using principal component analysis (PCA). The transformation works in the way that linear combinations of the data are used to construct components which explain as much as possible of the variance in the data, with the constraint that all components must be uncorrelated. The PCA may provide a data subset by which almost all the information (i.e. 99%) is accounted for by just a few components. The disadvantage of PCA compared to the model-based approach is that the transformation is a “black-box” and the principal components are meaningless with respect to what we are measuring.
After the data has been reduced to a practical set of parameters (if necessary), the next step is to perform a statistical analysis in order to test whether our null-hypothesis can be rejected or not. The choice of statistical method is mainly determined by the hypothesis, but the measurements may also influence the selection of the most appropriate method. Figure 2 provides a flowchart for selection of statistical method based on the type of study.
Let us go back to the example of the alpha parameter of two tissue types, with the hypothesis that the alpha is different between the two tissue types. Our natural choice of test is a two-sample Student’s t-test, which is designed to test whether the means of two sets of data are different. If however, our hypothesis is also on the
The t-test belongs to the family of
In general, the parametric tests are a better choice if possible because of a higher statistical power. In bioimpedance analysis, we often do mathematical transformations of our measurements in order to interpret or graph them differently. When doing statistical analysis, we need to keep in mind that the transformations may also change the distribution of the data. For instance, when transforming a normally distributed set of |Z| measurements to |Y|, the distribution is likely to change into a non-normal one.
In the previous example, the measurements came from independent samples. If we have pairs of tissue types with each pair coming from the same animal, we cannot consider the samples from the two tissue types independently, and we have to use statistical tests which account for the correlations within each pair, such as the paired t-test or the non-parametric Wilcoxon signed rank test. A typical situation where these tests are recommended is for testing the change in bioimpedance before versus after a treatment.
If we want to statistically compare more than two groups of measurements, another type of test is better suited - the analysis of variance (ANOVA). This test compares the variance within each group to the variance between the groups, and also overcomes the problem of multiple pairwise comparisons (as described in chapter 3). The ANOVA has the following assumptions: independence between groups, normal distribution and equal variances within the groups. For more detail on theory, testing and violations of these assumptions, see e.g. [7]. For non-normal data, most statistical packages offer rank-based ANOVAs, and the ordinary ANOVA is also regarded as robust against violations of the normality assumption [8]. The one-way ANOVA, which is used for comparing more than two independent groups, first calculates an F-statistic (based on the ratio between between-group variability and within-group variability) which together with the degrees of freedom determines a p value for the null hypothesis that the data from all groups are drawn from populations with the same mean. Further on, the difference between each pairwise combination of the groups can be tested similar to the t-test but with correction for the multiple testing.
The one-way ANOVA is useful when we study only one factor which groups the measurements (e.g. tissue type). If we for example want to study how electrode configuration in addition to tissue type affects the bioimpedance, we have a
In the same way as the unpaired or paired t-tests are suited for comparing independent and dependent groups respectively, there are also ANOVA methods which are suitable for dependent groups, the
In factorial repeated measures design, the effect of time (or the repeated experimental condition) can be investigated by including it as a factor in the two-way repeated measures ANOVA. It is important to know that the ANOVA does not consider the order of the time-points, only the difference between them, and if we want to evaluate a trend or relationship, it is better to use a regression approach.
In experiments, there may be other observable variables than the experimental factors, which have an influence on the dependent variable. This variable may be continuous, and therefore problematic to add as a factor in the design. In this case, the variable can be added as a covariate in the design. Let us use the example of measuring impedance in solutions during different chemical reactions. The temperature changes may be unknown and uncontrollable, but possibly influence the impedance. The temperature can then not be added as a factor in the analysis, but as a covariate. The appropriate statistical method is the ANCOVA (analysis of covariance), which is a combination of ANOVA and regression. With this method, we will find out whether there is a significant difference between the impedance of the different chemical reactions when also controlling for the temperature effect. In some cases, we want to examine more than one dependent variable. If they are related, such as the |Z| and phase of the same measurement, both the dependent variables can be studied in the same test while controlling for the correlation between them by the MANOVA (multivariate analysis of variance). If the dependent variables are not correlated, separate ANOVAs are appropriate. The MANOVA assesses the effect of each factor on each of the dependent variables (with p-values for each case), and also the interactions both among the independent variables and among the dependent variables. The advantages of this method is that several dependent variables can be studied in one test which avoids the increased Type I error rate from multiple comparisons, the correlations between the dependent variables will be incorporated in the analysis, and the test may even find a significant result for the combined effect of all dependent variables when the effect on each of them are not strong enough. For adding covariates to the MANOVA, the appropriate test is the MANCOVA (multivariate analysis of covariance), which is the same analysis as the MANOVA, but adds control of one or more covariates that may influence the dependent variables.
One type of test that incorporates all of the above (t-test, ANOVA, ANCOVA, MANOVA and MANCOVA and also ordinary linear regression) is the
Until now, we have discussed group comparison or factor analysis with
In some cases, our experiment may include both fixed and random factors, and the analysis model is then called a mixed-effects model. Tremendous advances have been made over the last years in the methods for mixed model analysis, and the current tools available offer a lot of different features and advantages over other “traditional” methods. For instance, a mixed model analysis is not weakened by missing values in the same way as repeated ANOVA. The mixed model can also deal with hierarchies in our data. For instance, we may study samples of different electrode types from different producers, and have different types of electrodes from each of the different producers. In the statistical terms, we have two factors (electrode type and producer) where different levels of one factor do not occur at all levels of the other factor, which is called a
Until now, we have been dealing with methods for investigating differences between groups and the effect that different factors have on these differences. Now we move over to the methods that assess associations between variables. The most basic case is testing for a linear relationship between two variables (also called bivariate association) by the Pearson Product-Moment correlation coefficient. The output of this test is the r statistic, which indicates the strength (0-1 in absolute value) and direction (positive or negative depending on the sign of r) of the relationship between the two variables. The r does not say anything about the causality or dependency of the relationship, r will be the same whether X was dependent on Y or Y was dependent on X, or if they were independent of each other but dependent on another factor. The r also does not say anything about the agreement between X and Y. You can get an r=1 (perfect correlation) with paired observations on completely different scales. This makes the r insufficient for testing agreement between two methods, but is useful for exploring associations between two variables under a linear assumption. Statistical inference (p-value and hypothesis test) on the correlation can be conducted (see e.g. eq 28.3 in Sheskin 2011). The squared r is also frequently used and is called the
The Pearson Product-Moment correlation coefficient (and also the coefficient of determination) is based on an assumption of equal variance (in statistical terms called
The relationship between two variables can also be described mathematically by
Perhaps we have found that the RMSE of our TBW prediction based on bioimpedance was rather large, and that we need to reduce this error if we want to develop a TBW device. We might know of other factors which are also related to TBW, or factors which give changes in bioimpedance but are not related to TBW (confounding variables). If these variables are independent, we may be able to reduce the prediction error by including them as predictor variables in a
Selection of the prediction variables may be done based on preexisting empirical data or theory, but there are also semi-automatic methods which can assist in sifting out redundant predictors. One of these methods is called
For data as e.g. bioimpedance frequency sweeps, when we may have many predictor variables compared to the number of observations, and/or highly correlated predictor variables, there are other type of regression methods which could be more suitable such as
Until now, we have been dealing with predictions of a continuous outcome as the dependent variable. Suppose we have investigated bioimpedance with respect to tissue type and found a significant difference between two types of tissue, and now we want to test how well bioimpedance could be used to discriminate between these tissue types.
Another classification method is the
Other classification methods used in biomedical research include
ANN is a classification method inspired by the workings of the central nervous system. By constructing a network of interconnected nodes (“neurons”) organized in layers, a classification algorithm is developed (also called trained) by optimizing the weights of each node-to-node connection, representing the connection strength between them. As new inputs of selected features are fed through the network, the output layer at the end of the network will provide the suggested classification.
Decision trees are based on an algorithm for splitting the input data in a way that maximizes the separation of the data, resulting in a tree-like structure [15]. There are algorithms that can suggest the structure of the tree such as the Hunt’s algorithm, but these algorithms usually employ a greedy strategy that grows a tree by making a series of locally optimum decisions. Another drawback with the method is that continuous variables are implicitly discretized by the splitting process, losing information along the way [16].
SVM has become popular due to the performance the method has demonstrated in problems such as handwriting recognition. The principle is based on representing the data as points in space, and then finding an optimal surface called a hyperplane, which maximizes the margin between the classes. If the classes are not linearly separable in the original data space, the data is mapped into a much higher dimensional space (called
Naïve Bayes classification uses the Bayes’ theorem together with a “naïve” independence assumption to calculate probabilities of class membership. Predicting class membership can be done directly using Bayes’ theorem with only one feature and the prior probability. As an example, consider we are investigating bioimpedance as a marker for wound healing. Suppose we gathered 100 measurements after wounding in an experiment where 50 of the wounds healed by themselves. Among the wounds that healed, the impedance increased during the healing process in 35 of the 50 wounds, and in 5 out of the 50 wounds that did not heal. We can now calculate the conditional probability of a wound belonging to the “healing” class based on whether or not the impedance increases using the Bayes’ Theorem, giving us 88% if impedance increases and 25% if the impedance does not increase. When including several conditional features, the mathematics would normally become problematic due to the relations between the features, but the naïve Bayes classifier assumes that all the features are independent which allows for easy computation.
The k-Nearest Neighbors (kNN) algorithm predicts the class of a point in a feature space based on the known attributes of the neighboring k number of points in this space. For instance, say we want to predict tissue status based on a set of independent bioimpedance features, such as the Cole parameters. Using a dataset of measurements with known tissue states, the kNN algorithm will first construct class-labeled vectors in a multidimensional space with one dimension for each feature. Class prediction of a new measurement will then be done based on the majority of class-memberships of the k number of nearest neighbors based on the distance (usually Euclidian) to the new point.
These classification methods use different principles and rules for learning and prediction of class membership, but will usually produce a comparable result. Some comparisons of the methods have been given [i.e. 17, 18]. Although the modern methods such as SVM have demonstrated very good performance, the drawback is that the model becomes an incomprehensible “black-box” which removes the explanatory information provided by e.g. a logistic regression model. However, classification performance usually outweighs the need for a comprehensible model. Principal component analysis (PCA) has been used for classification based on bioimpedance measurements. Technically, PCA is not a method for classification but rather a method of data reduction, more suitable as a parameterization step before the classification analysis.
Until now, we have been dealing with exploratory methods, where the bioimpedance measurements have been used to explore differences between groups of measurement, effects of different factors or associations between bioimpedance and other parameters. We have also been dealing with predictions of either continuous or discrete outcomes, but not the Provision of objective evidence that a given item fulfils specified requirements, where the specified requirements are adequate for an intended use [19]
For bioimpedance measurements, the performance is in most cases determined by the agreement between a developed bioimpedance parameter and a reference (“gold standard”). For example, if we develop a probe to detect breast cancer, we need to find out how often it correctly detects cancerous tissue (the sensitivity) and how often it correctly detects healthy tissue (the specificity). These two statistics are what the potential users will mainly look for when considering the method. Our approach will then be as follows. We have already explored the difference in bioimpedance between healthy and cancerous tissue by the procedure shown in figure 1, and based on our previous results we have also selected which bioimpedance parameters and algorithm we will use for discriminating between the two tissue types. We now do a new study using the selected method on a new sample of subjects. The sample size should be adequate in order to obtain an estimate with acceptable precision, and can be estimated based on the prevalence and the anticipated sensitivity and specificity [20, 21]. Say we did 500 measurements, among which 100 were confirmed positive by a reference measurement. Among these 100, our method detected 85 as positive, and among the 400 negative, our method detected 350 as negative. We can now calculate the sensitivity and specificity by:
Where TP is the number of true positives (85), FN is the number of false negatives (100-85=15), TN is the number of true negatives (350) and FP is the number of false positives (400-350=50). Our sensitivity and specificity then becomes 85% and 87.5% respectively. It is also often of interest to see how the sensitivity and specificity depends on the decision threshold (i.e. the level of our bioimpedance parameter which separates healthy and cancerous tissue). The ROC (receiver operating characteristic) curve is constructed by plotting the sensitivity (the true positive rate) against 1-specificity (the false positive rate) for the whole range of decision thresholds. This plot allows us to see what kind of sensitivity and specificity we may obtain according to what we consider important with respect to the application. The area under the curve (AOC) is usually reported together with the ROC curve as a measure of total classification performance.
A very relevant case in the field of bioimpedance is the validation of an estimate of a continuous physiological parameter, where we want to find out how well our estimate agrees with a reference measurement of this parameter. Among the methods used for evaluating agreement in medical instruments measuring continuous variables, the Bland-Altman method [22] is the most popular [23]. By this method, a plot is constructed with the means of all measurement pairs (estimate and reference pairs) on the x-axis and the difference between them on the y-axis. In addition to this, the mean difference line is plotted along with two lines representing the 95% limits of agreement (LOA), given by the mean difference ± 1.96 standard deviation of the difference. By this simple method, the reader can easily see how much the two measuring methods differ according to the magnitude of the measurement, and also inspect for systematic differences such as bias or trends. The LOA tell us that most (95%) of the measurements had a difference within the upper and lower LOA. It is not possible to give any general criterion for an acceptable LOA because it depends on the intended use of the proposed method. As an example, limits of agreement of up to ±30% have been recommended as acceptable for introducing new techniques within cardiac output measurements [24]. It is important to note that the correlation coefficients or the coefficient of determination is not sufficient for reporting agreement, as two variables may have a perfect linear relationship but at the same time be very different in magnitude. Another type of correlation which avoids this problem is the
Models should always be validated in order to avoid overfitting and inflated performance results. When a predictive model, such as a regression model, is developed and tested based on the same sample of measurements, there is a chance that the model parameters are optimized in a way that fits better with the sample than the population it comes from and produces an overoptimistic result. This is more relevant the more complex (i.e. number of independent variables) the model is. Therefore, the model should always be tested against an independent sample in order to see how well the model will generalize and perform in practice. The model can be validated by replicating the results on one or more independent samples from the same population, but in most cases it is more practical to split the data in one part which is used to develop the model parameters (the training sample) and use the remaining data (the validation sample) to test the performance of the model. The validation data can then play the role of “new data” as long as the data are independent and identically distributed [27]. This is called the
There are several terms which are important in validating a new measurement method. A list of the most relevant aspects of validation is given in table 1, along with a definition of each term and how it is usually reported. The definitions of the terms vary among different fields and standards, sometimes giving an inconsistent meaning. The table is an attempt at giving an unambiguous overview of the terms based on the most common uses.
List of important terms in the validation of new measurement technology along with the most usual and recommended ways of reporting. 1There are numerous different definitions in the literature, which can be inconsistent and confusing. These definitions provide one version with the aim of reducing ambiguity. 2Accuracy has previously been defined as the same as trueness only, but with ISO 5 725-1 [37], and reflected in the JCGM 200:2012 [19], the definition of accuracy has for the most changed to include both trueness and precision as given here. The old definition is still in use in some areas.
Term | Definition1 | Reported as |
---|---|---|
Measured quantity value minus a reference quantity value [19] | Quantity on the same scale as the measurement scale, relative error, percentwise error, mean square error, root mean square error. | |
The sensitivity of a clinical test refers to the ability of the test to correctly identify those patients with the disease. [36]. | Eq. (1) | |
The specificity of a clinical test refers to the ability of the test to correctly identify those patients without the disease. [36] | Eq. (2) | |
The degree to which scores or ratings are identical [31] | Continuous: Bland-Altman plot | |
Closeness of agreement between the average value obtained from a large series of results of measurement and a true value [37]. | Bias (i.e. the difference between the mean of the measurements and the true value) | |
Closeness of agreement between independent results of measurements obtained under stipulated conditions [37]. | Standard deviation, coefficient of variation | |
Precision determined under conditions where independent test results are obtained with the same method on identical test items in the same laboratory by the same operator using the same equipment within short intervals of time [37] | Within-subject standard deviation [38] | |
Precision determined under conditions where test results are obtained with the same method on identical test items in different laboratories with different operators using different equipment [37] | Standard deviation, coefficient of variation | |
Closeness of agreement between the result of a measurement and a true value (both trueness and precision) [37] | Bias (trueness) and standard deviation/ coefficient of variation (precision) | |
Ratio of variability between subjects or objects to the total variability of all measurements in the sample [31] | Intraclass correlation coefficient |
The concept of
Agreement and
An advantage of using reliability to compare measurement methods is that it can be used to compare methods when their measurements are given on different scales or metrics [32]. For continuous variables, reliability is usually determined by the ICC. The ICC is a ratio of variances derived from ANOVA, with a maximum value of 1.0, indicating perfect reliability. There are different types of the ICC, including one- or two-way model, fixed or random-effect model, and single or average measures (see [33] for more on selection), and the type should be reported in a reliability study [34]. For assessing reliability in categorical data,
Which of these measures to report should be chosen based on how the measurements are to be used in the future. The same goes for the importance of the measurement performance. A certain degree of measurement error may be acceptable if measurements are to be used as an outcome in a comparative study such as a clinical trial, but the same errors may be unacceptably large in individual patient management such as screening or risk prediction [32]. For some applications, there are specific ways of reporting performance which have become standard, such as the Clarke-Error Grid together with MARD (Mean absolute relative deviation) for blood glucose measurement.
At last, it is important to also mention the concept of