Cervical cancer is a preventable disease. Effective measures are organised cervical cancer screening programme in combination with vaccination against human papilloma virus (HPV) and treatment of precancerous lesions.1 There are many risk factors, which can facilitate development of cervical dysplasia and cancer. Among them are early onset of sexual activity, multiple sex partners, parity, marital status, socioeconomic status, factors that influence persistent infection (genetics, sex hormones, immunological impairment as in human immunodeficiency virus (HIV) infection, sexually transmitted diseases (HPV, HIV, Herpes simplex virus [HSV], Chlamydia), factors related to HPV (genotype, numbers of viral copies), long term use of hormonal contraception, smoking and obesity.2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
HPV is very important risk factor necessary for development of cervical dysplasia and cancer. 14, 15 After initiation of sexual activity, almost all women acquire infection with HPV. This infection can only be transitory, clears spontaneously and does not progress to dysplasia.16 Patients aged 30–35 years are tested positive in 13.5% compared to 5.4% patients older than 35 years.17
In computer science artificial neural networks (ANN) are part of artificial intelligence and represent deep machine learning. ANN are nonlinear computational models. They are able to perform tasks, similar to human brain. Just by analysing examples (training set) can perform classification, decision-making, prediction, visualisation, recognition and other. The name neural networks came from similarities with structure and behaviour of that of human brain.18 There are many types of different ANN. They are very important tool in processing large amount of data, image processing, image recognition, computer vision and natural language processing. Because of their ability to learn and make prediction make them very useful tool in medicine.19, 20 They are used in every day clinical practice in cancer diagnostics where they help radiologists to recognise pathological features, help to predict malignant tumour response to treatment, help in triage and others.21, 22, 23, 24, 25
This study has been designed to evaluate if neural networks can help us to identify patients with higher risk for high grade squamous intraepithelial lesion (HSIL) and cervical cancer based only on the evaluation of their risk factors for cervical dysplasia and result of the last Papanicolaou smear (PAP). If neural networks are successful in predicting high risk patients, we could use them to identify and take special measures in situation when such patients became non-responders in organised cervical cancer screening programme. With such special attention, we could prevent them from acquiring cervical cancer.
Our study has been approved by Medical Ethics Committee of the Republic of Slovenia on 10. 11. 2015, No.: 0120-553/2015-2 KME63/11/15. Data from patients who had conisation in the years 1993–2005 were collected in database: age at the time of surgery, age at first intercourse, number of sexual partners, number of pregnancies (births, spontaneous and legal abortions), socio-economic status, marital status, type of contraception, smoking habits, menstrual pain, vaginal discharge, coagulopathy, colposcopic findings, result of last PAP smear, histopathology of cervical biopsy prior conisation, indication for conisation, additional smears (HPV 16, 18, 31 and 33 and possible other pathogens), vaginal therapy before conisation, type of conisation, data regarding complications after conisation if present, final histopathology and data if margins of the cone were free of disease. Records from database were anonymised and we used only data of suspected risk factors for HSIL regarding age at the time of surgery, age at menarche, age at first intercourse, number of sexual partners, number of deliveries, spontaneous and legal abortions, type of contraception, marital status, socioeconomic status, smoking habits, last PAP smear result and final histopathology of the cone. All patients with incomplete data were removed from analysis.
The sample is relatively small and is not representative of the real-life situation because more patients have dysplasia or carcinoma and only smaller portion of patients have low risk squamous intraepithelial lesion (LSIL) or no dysplasia at all. In Slovenia, healthy women without dysplasia represent majority of women who attend organised Cervical cancer screening programme ZORA. In year 2019 in Slovenia, we diagnosed 105 new cases of cervical carcinoma and 1056 cases of HSIL. In the same period, we analysed 220301 PAP smears from 206323 women.26 First line treatment for dysplastic changes on uterine cervix is conisation or large loop excision of transformation zone (LLETZ) in majority of cases.27 In 2019, we performed 2017 conisation procedures. 1334 (66%) patients had conisation because of HSIL (cervical intraepithelial neoplasm [CIN]), 400 (20%) patients because of low-grade squamous intraepithelial lesions (LSIL) and 283 (14%) had no dysplasia.26 In Slovenia number of conisations is decreasing in favour of LLETZ.28
We constructed two basic settings of our database. In
In our database are complete data of 1475 patients, 26 (1.8%) without dysplasia on final histological result of conisation, 160 (10.8%) with L-SIL and 1289 (87.4%) with HSIL. Last PAP smear was high risk in 16 patients (61.5%) without dysplasia, 127 patients (79.4%) with LSIL and in 1169 patients (90.7%) with HSIL.
Mean age of patients without HSIL was 38.6 years (13–83 years, standard deviation 10.47) and 34.9 (13–81, standard deviation 8.98) in the group of patients with HSIL. Mean age at menarche was 13.7 (10–19, standard deviation 1.84) in group of patients without HSIL and 13.5 (9–20, standard deviation 1.16) in HSIL patients. Mean age at first intercourse was 17.6 (13–25, standard deviation 1.59) in patients without HSIL and 17.4 (12–25, standard deviation 1.66) in patients with HSIL. HSIL and NO-HSIL group of patients were statistically different regarding age (p < 0.01), age at 1st intercourse (p < 0.035), number of sex partners (p < 0.004) and high risk PAP smear (p < 0.01).
In our group of patients without HSIL 57% tested HPV 16 negative and 27% positive (16% not tested) and in the group of patients with HSIL 54% tested negative and 33% positive (14% not tested). In the NO-HSIL group 65% tested HPV 18 negative, 21% positive (15% not tested) and in HSIL group 60% tested negative and 27% positive (13% not tested).
Because many patients did not have HPV testing, we decided to remove such patients from analysis. When we analysed removed patients because of no HPV testing (HPV 16, 18, 31 or 33), we discovered that numerous patients with HSIL would be missed (Table 1).
Final histology of the cone in patients without human papilloma virus (HPV) testing
Frequency | Percent | |
---|---|---|
9 | 1.8 | |
26 | 5.3 | |
27 | 5.4 | |
90 | 18.1 | |
55 | 11.1 | |
223 | 45.0 | |
55 | 11.1 | |
11 | 2.2 | |
CIN = cervical intraepithelial neoplasm
Chi-square test (χ = 1.631, p = 0.202) found no statistically difference of HPV 16, 18 status and presence of HSIL in our group of patients. In this time period we didn’t routinely tested presence of HPV infection. Because of a chance that we detected transitory infection with HPV testing and that over 400 patients with HSIL would be excluded from analysis because they were not tested against HPV, we decided to exclude HPV from further analysis. HPV 16 and 18 statuses in our patients are presented in Table 2.
Number and percentage of patients according to human papilloma virus (HPV) 16 and 18 statuses in high grade squamous intraepithelial lesion (HSIL) and NO-HSIL group
HPV 16 | HPV 18 | |||||||
---|---|---|---|---|---|---|---|---|
HSIL group | NO-HSIL group | HSIL group | NO-HSIL group | |||||
Frequency | % | Frequency | % | Frequency | % | Frequency | % | |
177 | 14 | 29 | 16 | 172 | 13 | 27 | 15 | |
693 | 54 | 106 | 57 | 775 | 60 | 120 | 65 | |
419 | 32 | 51 | 27 | 342 | 27 | 39 | 20 | |
Human neuron or nerve cell is a cell, which can be electrically or chemically excited. It has body – soma and dendrites – which lead signal to neuron and single axon which lead signal from neuron and interconnects with other neural cells. Information is transferred via electrical or chemical mechanism.29
In ANN we have different neurones. There are two main types. Input neurone called perceptron receives information. Output neurone produces final output. All neurones are arranged in layers. First layer is input layer with perceptrons, last layer is layer with output neurones. In between there can be one or many hidden layers. Every neuron interconnect with all neurones from previous and next layer.30 Diagram of simple ANN is presented in Figure 1.
As in neural cell, artificial neurones in neural networks receive information and became excited. When excitation level (weight) is reached, they promote signal to other neurones. Before weight is reached no output signal is produced. There are many different mathematical functions for neuron activation. Activation function of output neurons can be different from that of previous layers. Output of the last neuron is numerical value which can range from 0–1. Threshold for classified positive/negative is by default 0.5, meaning that cases with values > 0.5 are classified as positive and cases with value ≤ 0.5 as negative. Threshold value can be changed according to the performance of the algorithm and our goals.18
Dataset must be split in two parts-training and holdout set. Training set is used to build model, test relations between input variables and determining weights of the neurones. Algorithms are then tested on holdout set in which are instances unknown to neural network. Training set must be larger than holdout.18
In every classification process, we have actual positive and negative cases, which can be classified correctly as positives or negatives or classified incorrectly. The best way to visualize the situation is to use confusion matrix.
Effectiveness of ANN or any other classification system or algorithm can be measured. In our study we used p
It ranges between 0 (worst) and 1 (best). MCC ranges between −1 and +1. −1 meaning perfect misclassification, 0 means as expected in random guessing and +1 perfect classification.32 Precision-recall curve (PRC curve) is another measure of classification efficiency. Precision (PPV) is plotted on y-axis and Recall (TPR) on x-axis. It is more informative than ROC Curve in imbalanced data settings because it analyses fraction of true positives among all positive predictions.33
Quality of data is of vital importance – sufficient numbers of instances (collection of attributes in database) and qualitative attributes (features that measure or describe different aspect of instances). Before running classification algorithm, it is necessary to run simulation of baseline classification. We can then compare results derived from our model with baseline results and decide how good (or bad) our model is in classification and prediction.
When we have imbalanced datasets where one of the variables represents only a small proportion of the sample, baseline prediction for majority class is very high. For example – if majority class represents 88% of instances as in our case, baseline prediction is high – 88%. If prediction algorithm predicts with 92% accuracy this is not statistically significant. There are some methods, how to deal with unbalanced data:
Under-sampling: randomised reduction of majority class to match minority class
Over-sampling: n-fold replication of minority class to match majority class
SMOTE: synthetic minority over-sampling technique creates new synthetic instances, which have similar characteristics as original ones in minority class.34,35
Weka (1999–2020 The University of Waikato, Hamilton, New Zealand) is open-source application for data mining with many other possibilities beside ANN as are Bayesian networks, Logistic regression, Classification trees, K-nearest neighbours and others.36 It enables us to test classification algorithm on whole dataset, we can split dataset by percentage, test whole dataset against separate training dataset from different dataset which we import in Weka and n-fold cross validation. When we manually or randomly split dataset in training and holdout part, there is always a chance that we collect all important instances in one of the sets, especially if one kind of instances represent small proportion of all instances. N-fold cross validation is powerful option which can minimise the chance of such situation. It divides entire database into n parts. Each n-1 part is used as training and each n part as holdout set. All combinations of n and n-1 parts are then tested against each other and algorithm at the end presents the best result of tested combinations. In our experiment, we used 10-fold cross validation.37
We prepared eight data sets:
Raw set: we used as variable original risk factors and as output HSIL_Y/N.
Class set: same as raw set except age groups instead of age and PAP_HR_Y/N instead of last PAP.
Raw and class with under-sampling, over-sampling and SMOTE method for equalising imbalanced dataset.
Original dataset consisted of 186 No-HSIL and 1289 HSIL patients. To prepare over-sampling dataset we duplicated HSIL negative patients to get 558 No-HSIL and original 1289 HSIL patients. For under-sampling, we randomly selected and deleted HSIL patients to get 272 HSIL and original 186 No-HSIL patients. With SMOTE algorithm, we created data set with original 1289 HSIL patients and 744 No-HSIL patients.
Baseline prediction was calculated for each set and results for multi-layer perceptron with 10-fold cross validation was recorded. Results are presented in Table 4.
Confusion matrix for classification with all possible outcomes
Predicted pos (PP) | Predicted neg (PN) | |
---|---|---|
True positives (TP) | False negatives (FN) | |
False positives (FP) | True negatives (TN) |
Neg = negatives; Pos = positives
Results of multi-layer perceptron (MLP) classifications for different settings with baseline prediction – ZeroR, percentage of correct classification and Kappa statistic for all analysis. Results are for prediction high grade squamous intraepithelial lesion (HSIL)-Yes (Y), prediction NO-HSIL (N) and weighted average for whole model (YES and NO combined) – Weighted average (AVG). In bold-type letters are results, where prediction by MLP is better than baseline prediction ZeroR
TP Rate | FP Rate | Precision | Recall | F-Measure | MCC | ROC Area | PRC Area | Class | % Correct | Kappa | ZeroR % | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.751 | 0.634 | 0.739 | 0.751 | 0.745 | 0.118 | 0.567 | 0.735 | Yes | 82.10 | 0.0965 | 87.39 | |
0.366 | 0.249 | 0.308 | 0.366 | 0.373 | 0.118 | 0.567 | 0.377 | No | ||||
0.637 | 0.521 | 0.633 | 0.637 | 0.635 | 0.118 | 0.567 | 0.629 | Weighted Avg | ||||
0.669 | 0.559 | 0.636 | 0.669 | 0.652 | 0.112 | 0.542 | 0.608 | Yes | 57.64 | 0.1113 | 59.39 | |
0.441 | 0.331 | 0.477 | 0.441 | 0.458 | 0.112 | 0.542 | 0.448 | No | ||||
0.576 | 0.466 | 0.572 | 0.576 | 0.573 | 0.112 | 0.542 | 0.543 | Weighted Avg | ||||
0.907 | 0.828 | 0.884 | 0.907 | 0.895 | 0.086 | 0.594 | 0.905 | Yes | 81.42 | 0.0856 | 87.39 | |
0.172 | 0.093 | 0.211 | 0.172 | 0.189 | 0.086 | 0.594 | 0.174 | No | ||||
0.814 | 0.735 | 0.799 | 0.814 | 0.806 | 0.086 | 0.594 | 0.813 | Weighted Avg | ||||
0.688 | 0.575 | 0.636 | 0.688 | 0.661 | 0.115 | 0.551 | 0.614 | Yes | 58.08 | |||
0.425 | 0.313 | 0.482 | 0.425 | 0.451 | 0.115 | 0.551 | 0.466 | No | ||||
0.581 | 0.469 | 0.573 | 0.581 | 0.576 | 0.115 | 0.551 | 0.554 | Weighted Avg |
Raw = original settings; Class= class setting; overs = oversampling; SMOTE = synthetic minority over-sampling technique; unders = undersampling
In first part of analysis, we analysed original database with artificial neural network, multi-layer perceptron (MLP). We achieved 81.42% correct predictions which is worse than baseline – ZeroR prediction 87.39% (kappa = 0.08 showing no level of agreement between predicted and actual status, AUC 0.594, MCC 0.086, F-Measure 0.806, precision 0.799 and recall 0.814). When we corrected minority class with over-sampling method ZeroR prediction was 69,79%, achieved 79,21% (kappa = 0.523 showing weak level of agreement between predicted and actual status, AUC 0.837, MCC 0.525, F-Measure 0.795, precision 0.800 and recall 0.792). SMOTE performed inferior than over-sampling with baseline ZeroR 63.40% and achieved 77.87% (kappa = 0.53 showing weak level of agreement between predicted and actual status, AUC 0.814, MCC 0.533, F-Measure 0.780, precision 0.784 and recall 0.779). Under-sampling method performed worse than analysis on original dataset with ZeroR prediction 59.39%, achieved 58.08% (kappa = 0.11 showing no level of agreement between predicted and actual status, AUC 0.551, MCC 0.115, F-Measure 0.576, Precision 0.573 and Recall 0.581).
In second part of analysis, we grouped data in classes as described previously. Analysis with MLP on original data achieved 82.10% correct prediction which is less than baseline 87.39% ZeroR prediction (kappa = 0.09 showing no agreement between predicted and actual status, AUC 0.567, MCC 0.118, F-Measure 0.635, precision 0.633 and recall 0.637). Performance of MLP was better with over-sampling method, where baseline ZeroR prediction was 69.79% and MLP achieved 84.19% correct predictions (kappa = 0.64 showing moderate level of agreement between predicted and actual status, AUC 0.870, MCC 0.640, F-Measure 0.844, precision 0.849 and recall 0.842). With SMOTE method baseline ZeroR prediction was 63,40% and achieved prediction 77,08% (kappa = 0.51 showing weak level of agreement between predicted and actual status, AUC 0.802, MCC 0.515, F-Measure 0.772, precision 0.775 and recall 0.771). Under-sampling method performed worse than analysis on original data with ZeroR prediction 59.39% and 57,64% correct predictions (kappa = 0.11 showing no agreement between predicted and actual status, AUC 0.542, MCC 0.112, F-Measure 0.573, precision 0.572 and recall 0.576).
All results are presented in Table 4. MCC for all models is graphically presented in Figure 2 for prediction HSIL-Yes and NO combined. True positive rate and false positive rate for all models are graphically presented in Figure 3. ROC curve for worst performance model is represented on Figure 4 and for best performance model on Figure 5.
In medicine, we mostly deal with imbalanced classes. In such data sets baseline prediction is high for majority class. In most cases, we have situation in which we must precisely and accurately classify patients from minority class.38 Misclassification of patient with severe disease as negative means that we potentially endanger their health and because of delayed diagnosis, disease can progress to life-threatening situation or death. Such situation endangers only patient involved. In case that we classify patients, for example, who have very contagious disease, misclassification as negative means that such false negative patients will spread the disease and endanger other healthy people. Misclassification of healthy patients as positive results in further diagnostic tests and eventually leads to correct diagnosis. Unnecessary procedures result in greater stress for patient, higher expenses and bigger load for health system. Good classification algorithms therefore must have very high sensitivity and specificity.
Cervical cancer is preventable disease.1 Artificial intelligence (AI) and deep learning methods are used for optimisation of screening, diagnostic and treatment procedures and are also present in the field of cervical cancer. Cervical cytology is of vital importance in screening programmes. Mango
Sompawong
Holmström
Bao
Colposcopy is very important diagnostic procedure. Clinical experience is important for accurate colposcopic result.44 With the use of AI - deep convolutional networks it is possible to analyse colposcopic images with higher accuracy than subjective assessment by human. In his study Chandran and colleagues published 92,4% sensitivity, 96.2% specificity and kappa 0.88 which showed strong association between predicted and actual status of colposcopic changes.45 It is important, that women referred for colposcopy are correctly selected to prevent overload in colposcopic clinics. Such overload with improper patients can result in miss diagnostics, unnecessary procedures and can be a threat for subsequent pregnancies.46 K arakitsos
In our study we used MLP, which is back propagation artificial neural network on our dataset of patients, which had conisation surgery in University Gynaecologic clinic Maribor in years 1993–2005. As input layer, we used known risk factors for development of cervical dysplasia and carcinoma, High-risk dysplasia CIN2+ Yes/No as output layer. Risk factors are important and increase risk for development of disease but not all patients with risk factors develop disease.49 All patients with incomplete data were removed from analysis as are in majority of studies. Original dataset was imbalanced and patients without HSIL represented minority class. To our knowledge this is first study with such settings.
MLP performed worse on original dataset in comparison with baseline prediction. Such outcome can be expected in dataset where data are imbalanced.36 There are several methods to equalise imbalanced data. We can reduce the majority class by randomly selecting and removing instances from majority class with under-sampling method.34 When we balanced dataset with under-sampling method, prediction did not improve and stayed below baseline. Reason for this may be in removing instances with important variables from training and/or testing set. We prepared dataset with under-sampling method few more times but with all settings, we could not achieve better performance. MLP correctly classified 57.64% cases which is inferior compared to baseline zeroR 59.39% and also kappa statistic 0.1113 showed no agreement between real and predicted status.
SMOTE and over-sampling methods improved performance of MLP.35 With over-sampling method we multiplicate instances from minority class to match that of majority class. In this case is always a chance, that we can find equal instances in training and testing set.34 SMOTE method uses k-nearest neighbour algorithm to create new synthetic instances which are all unique.35 In best performance model where baseline prediction ZeroR was 69,79% MLP correctly classified 84,19% cases and kappa statistic 0.64 showed moderate agreement between real and predicted status.
In real clinical practice, many patients have multiple risk factors but never develop disease or, many with only a few became ill. It is possible that patients do not tell the truth about risk factors because they are too intimate, they are ashamed or they do not remember. Collection of all risk factors from patients participating in screening or other programme in nationwide database is also questionable because of ethical considerations.50 With our experiment we proved, that with the use of ANN we can predict more patients who will develop HSIL based only on the analysis of their risk factors for developing HSIL and result of last PAP smear than with baseline prediction. But performance and classification accuracy of ANN is not high enough for every day clinical practice.