Following the outbreak of the coronavirus disease 2019 (COVID-19) pandemic, design, development, validation, verification and implementation of diagnostic tests were actively addressed by a large number of diagnostic test manufacturers. No test is ideal and none are 100 per cent reliable. Diagnostic tests establish the presence or absence of disease in order to make treatment decisions. A diagnostic test is carried out on symptomatic individuals or after a screen-positive confirmatory test has been obtained [1].
A new medical test must first undergo a series of be assessments before it can be introduced into general clinical use.
Is it effective? Does the test work in the laboratory? Is it clinically efficient? Does the test work in the patient population of interest? Will the test bring about health outcomes benefits [2]?
Evaluation of a new test’s diagnostic accuracy is carried out to assess how well it discriminates between patients with or without the target disease.
The accuracy of an index test cannot be evaluated without a reference standard. At the commencement of a study, there should be a consensus that the reference standard to be used is more accurate than the index test. More than one acceptable reference standard would be appropriate for use in a test accuracy study.
The test accuracy is defined as a comparison between the disease conditions (Target condition) estimated by a test of interest (Index test) and the best estimate of the actual disease state (Reference standard). It is an unequivocal acknowledgement that most tests make errors even if correctly performed.
A degree of pragmatism may be required when choosing an acceptable reference standard. The most accurate reference standard may not be feasible or ethical. Less accurate methods may have to be used. The reference standard may not always be a gold standard (vide infra); the use of a non-gold or imperfect standard may occur when there is no generally accepted reference standard for the target condition. However, using an imperfect reference standard produces reference standard bias [3].
Method of evaluating the diagnostic accuracy of a medical test with binary test results and dichotomised disease status.
Reference test (Gold standard) | ||
---|---|---|
Positive | Negative | |
Positive | True Positive | False positive |
Negative | False Negative | True Negative |
Depending on the test’s resultant characteristics, including sensitivity and specificity standards, one may determine the role the new test can play in the diagnostic schema. It may be deemed better than any existing test, a possible replacement test or used as a triage test.
The basic measures of the diagnostic accuracy of a test are sensitivity and specificity. Other measures are predictive values, likelihood values, overall accuracy, receiver operating characteristic (ROC) curve, area under the ROC curve (AUROC) ROC surface, and volume under the ROC surface (VUS). (vide infra)
Sensitivity (true-positive rate) The proportion of subjects who have the disorder (by the gold standard) who have a positive result by the new test. Specificity (true-negative rate) is the proportion of subjects who do not have the disorder and give a negative test.
The positive predictive value (PPV) is the proportion of subjects who give a positive test result and have the disease.
The negative predictive value (NPV) is the proportion of subjects who give a negative test result and do not have the disease.
The likelihood ratio for a positive test result (LR+) is how much more likely is a positive result found in a person with, as opposed to without, the disease?
The likelihood ratio for a negative test result (LR-) is how much more likely is a negative result to be found in a person with the disease than not having the disease.
Accuracy of a test: This is the proportion of subjects who give the correct result.
A false positive is an error resulting from the incorrect indication of a disease’s presence, i.e. the result is positive when, in reality, the patient is disease-free.
A false negative is an error resulting from the incorrect indication that the patient does not have the disease, i.e. the result is negative when, in reality, the patient has the disease.
Information regarding test accuracy is useful in indicating screening, diagnosis, predisposition, monitoring, prognosis, and drug effectiveness.
Screening: Which patients have an asymptomatic disease?
Diagnosis: Which patients have a symptomatic disease?
Predisposition: Which patients could develop the disease?
Monitoring:
Is the disease controlled?
How advanced is the disease?
Has the disease recurred?
Prognosis: Will the disease progress over time?
Is a drug effective?
Comparing the index test results with a reference standard for diagnosing the same target condition in the same participants allows quantifying the above-listed measures.
The
Index test: the test under evaluation for accuracy
Reference standard: the best available standard of identifying the target condition against which the index test results will be compared.
Target condition: the condition under detection
This can be a pathologically defined condition (e.g. fracture) OR a symptom requiring treatment (e.g. high blood pressure)
The population of interest must be clearly defined. It would be incorrect to appraise a diagnostic test using a population that does not represent the target population. e.g. using a population derived from a university student population to appraise a test to be used in care home patients. The ideal sample for a test accuracy study is a consecutive or randomly selected series of patients in whom the target condition is suspected, or for screening studies, the target population.
The Index test: The index test is the NEW test under evaluation.
The Reference standard: The reference standard is the standard against which the index test is compared. It is usually the best test currently available but may not necessarily be the test used routinely in practice.
The test accuracy is predicated on a one-sided comparison of the index test results and the reference standard. The reference standard is important in validating the test study’s accuracy as there is the assumption that it has a 100% accuracy. This assumption is rarely correct and represents a fundamental flaw in the test. Any inconsistency is presumed to result from errors in the index test. Therefore, the selection of the reference standard is critical to the validity of a test accuracy study, and the definition of the diagnostic threshold forms part of that reference standard. In cases where there is no consensus on the best reference test, a composite reference standard, which is considered a better indicator of actual disease status may be used.
The test accuracy compares the disease condition (target condition) estimated by a test of interest (Index test) and the best estimate of the actual disease stated by the Reference standard. It is indisputable that most tests result in errors, even if properly carried out.
The new test characteristics can be computed with values obtained for Sensitivity and Specificity, The Positive predictive value, The Negative predictive value, the Likelihood ratios, Pre-test probability and Odds, Post-test probability and Odds Receiver operating curve [4].
Each of these values should be calculated.
The four possible outcomes of cross-classification are represented in a diagnostic 2x2 contingency table.
Patient number: 1 2 3 4 5 6 7 8 9 10 >>>>>> Reference results: P P P N P N N P P N >>>>>>
Index (new test) results. P N P N P P N P N P >>>>>>>
TP FN TP TN TP FP TN TP FN FP >>>>>>
Number of TP = 4
Number of FP = 2
Number of TN = 2
Number of FN = 2
Total TP+FP+TN+FN = 10
Note: See text. The reference test is always considered to be 100% though it may not be in reality.
It is against the Reference test results that the Index test results are compared.
Results of diagnostic tests
Reference standard | |||
---|---|---|---|
Positive | Negative | ||
TP | FP | ||
FN | TN |
2 X 2 table of the results of diagnostic tests.
A false positive is an error in which a test result incorrectly indicates a disease, i.e. the result is
A false negative is an error in which a test result incorrectly indicates no presence of a disease, i.e. the result is
Sensitivity: (true –positive rate) The proportion of subjects with the disorder by the reference test who give a positive result by the Index (new) test. TP/ TP+FN
Specificity: (true-negative rate) The proportion of patients without the disorder and who give a negative test. TN/TN+FP
Positive prediction value: (PPV) The proportion of patients with a positive test who do have the disease. TP/ TP+FP
Negative prediction value: (NPV) The proportion of subjects with a negative test who do not have the disease. TN/TN+FN
The likelihood ratio for a positive test result: (LR+) How much more likely is a positive test to be found in a person with the disease compared to being without the disease: sensitivity/ 1- specificity
The likelihood ratio for a negative test result: (LR-) How much more likely is a negative test to be found in a person with the disease compared to being without the disease. 1- sensitivity / specificity
False positive rate: Is an error resulting from the incorrect indication of a disease’s presence, i.e. the result is
False negative rate: Is an error resulting from the incorrect indication that the patient does not have the disease i.e. the result is
Accuracy of a Test: The proportion of the subjects with the correct result. TP+TN/ TP+FP+FN+TN
Predictive values depend on the
An increase in the
The likelihood ratio is often more useful than predictive values and can be calculated from sensitivity and specificity numbers. The likelihood ratio remains constant even when the
The likelihood ratio indicates the number of times patients with a disease are likely to have a particular test result than patients without the disease.
** Prevalence is the proportion of a particular population affected by a medical condition or disease at a specific time.
The effect of prevalence on the Positive Predictive Value
Prevalence % | VVP % | Sensitivity | Specificity |
---|---|---|---|
0.1 | 1.8 | 90 | 95 |
1 | 15.4 | 90 | 95 |
5 | 48.6 | 90 | 95 |
50 | 94.7 | 90 | 95 |
A ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve. The area under the curve characterises the degree or measure of separability (Figure 1).
In medicine, a binary classification problem is knowing the accuracy of a test result patient concerning whether a patient has or has not got a disease. This is a probability question that requires that a threshold is chosen in order to convert this probability into an actual prediction. The threshold should be chosen with care. In medical-related situations, a frequent and important consideration is whether a patient has a disease when he is disease-free. 0.5 probability is the commonly used threshold: when the probability el is greater than 0.5, the prediction is a 1, i.e. the patient has the disease in our case, or , 0 ,the patient does not have the disease.
The probability threshold can be varied depending on the study: this produces different sets of 1s and 0s, and consequently a different set of predictions.
This value can be interpreted as follows [5].
the average value of sensitivity for all possible values of specificity;
the average value of specificity for all possible values of sensitivity;
the probability that a randomly selected individual from the positive group has a test result indicating greater suspicion than that for a randomly chosen individual from the negative group.
When the variable under study cannot distinguish between the two groups, i.e. where there is no difference between the two distributions, the area will be equal to 0.5 (the ROC curve will coincide with the diagonal). When there is a perfect separation of the values of the two groups, i.e. there no overlapping of the distributions, the area under the ROC curve equals 1 (the ROC curve will reach the upper left corner of the plot).
The 95% Confidence Interval is the interval in which the true (population) Area under the ROC curve lies with 95% confidence.
The P-value is the probability that the observed sample Area under the ROC curve is found when in fact, the true (population) Area under the ROC curve is 0.5 (null hypothesis: Area = 0.5). If P is low (P<0.05) then it can be concluded that the Area under the ROC curve is significantly different from 0.5 and that therefore there is evidence that the laboratory test does have an ability to distinguish between the two groups.
True | False | ||
Predicted labels | Positive | TP | FP |
Negative | FN | TN | |
Actual labels |
The following metrics that can be extracted from a ROC curve.
The model’s
From the above table, the
Precision is an important matrix in avoiding mistakes of True predictions, i.e. in the patients who are predicted as having the disease.
Sensitivity (true –positive rate) TP/ TP+FN is calculated using the True Column of the Actual or Real Labels. It indicates how many people who are actually sick are being identified as such. It is a measure of the % of correctly classified True data.
The model’s
Specificity (true-negative rate) TN/TN+FP
It is important in identifying the patients that do not have the disease.
Having defined the metrics that can be used, the probability threshold that gives the best performance is given by using
Increasing the sensitivity of a test is generally done to the detriment of specificity and vice versa. It is acknowledged that it is preferable for a screening test for a particular condition to be more sensitive than specific. This means that in fact only a small number of patients go undiagnosed and it is considered acceptable that a certain number of healthy subjects are declared to have that condition.
It encapsulates, in a single, succinct format all of the confusion matrices that would be obtained as the threshold varies from 0 to 1.
The representation of the ROC for a random model is frequently incorporated in ROC Curves to give a rapid comparison of how well the current model is doing. The further the ROC curve of the data under consideration is distanced from the curve of the random model, the better the distance from point A to point B should be, i.e. Ideally the curve should pass as close as possible to the top-left corner of the diagram.
The further the ROC curve of the data under consideration is distanced from the curve of the random model, the better the distance from point A to point B should be.
The points in
Pragmatically, the aim is to identify a point between B and C in the curve that fulfils success on the 0s and success on the 1s, and picking the threshold related to that point.
The area under the ROC curve (AUC is a measurement from values of 0.5 (random classifier) to 1 (perfect classifier). It signifies how well the model classifies the True and False data points. The greater AUC results in the ROC approaching the desired top-left corner. (vide supra)
Conclusion; The more area under
As most diagnostic tests are far from perfect, often a single test is insufficient. For this reason, clinicians use multiple diagnostic tests, administered either in parallel or in series. In the case of a patient with polyarthritis, for example, it can be said that she has lupus whether she has a malarial rash, or nephrotic syndrome, or thrombocytopenia, or pleural effusion, or antinuclear antibodies (ANA), etc. By applying the tests
The test accuracy is the comparison between the disease state estimated by a test of interest, the Index test, and the best estimate of the true disease state provided by the Reference standard.
Interpretation of numerical test accuracy metrics requires consideration of the number and consequences of test errors.
To decide which dimension of test accuracy is more important in a testing situation, the consequence of being an Index test positive or an Index test negative need to be considered.