Identification of women with high grade histopathology results after conisation by artificial neural networks

Cervical cancer is a preventable disease. Effective measures are organised cervical cancer screening programme in combination with vaccination against human papilloma virus (HPV) and treatment of precancerous lesions.¹ There are many risk factors, which can facilitate development of cervical dysplasia and cancer. Among them are early onset of sexual activity, multiple sex partners, parity, marital status, socioeconomic status, factors that influence persistent infection (genetics, sex hormones, immunological impairment as in human immunodeficiency virus (HIV) infection, sexually transmitted diseases (HPV, HIV, Herpes simplex virus [HSV], Chlamydia), factors related to HPV (genotype, numbers of viral copies), long term use of hormonal contraception, smoking and obesity.^{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}

HPV is very important risk factor necessary for development of cervical dysplasia and cancer. ^{14, 15} After initiation of sexual activity, almost all women acquire infection with HPV. This infection can only be transitory, clears spontaneously and does not progress to dysplasia.¹⁶ Patients aged 30–35 years are tested positive in 13.5% compared to 5.4% patients older than 35 years.¹⁷

In computer science artificial neural networks (ANN) are part of artificial intelligence and represent deep machine learning. ANN are nonlinear computational models. They are able to perform tasks, similar to human brain. Just by analysing examples (training set) can perform classification, decision-making, prediction, visualisation, recognition and other. The name neural networks came from similarities with structure and behaviour of that of human brain.¹⁸ There are many types of different ANN. They are very important tool in processing large amount of data, image processing, image recognition, computer vision and natural language processing. Because of their ability to learn and make prediction make them very useful tool in medicine.^{19, 20} They are used in every day clinical practice in cancer diagnostics where they help radiologists to recognise pathological features, help to predict malignant tumour response to treatment, help in triage and others.^{21, 22, 23, 24, 25}

This study has been designed to evaluate if neural networks can help us to identify patients with higher risk for high grade squamous intraepithelial lesion (HSIL) and cervical cancer based only on the evaluation of their risk factors for cervical dysplasia and result of the last Papanicolaou smear (PAP). If neural networks are successful in predicting high risk patients, we could use them to identify and take special measures in situation when such patients became non-responders in organised cervical cancer screening programme. With such special attention, we could prevent them from acquiring cervical cancer.

Patients and methods

Our study has been approved by Medical Ethics Committee of the Republic of Slovenia on 10. 11. 2015, No.: 0120-553/2015-2 KME63/11/15. Data from patients who had conisation in the years 1993–2005 were collected in database: age at the time of surgery, age at first intercourse, number of sexual partners, number of pregnancies (births, spontaneous and legal abortions), socio-economic status, marital status, type of contraception, smoking habits, menstrual pain, vaginal discharge, coagulopathy, colposcopic findings, result of last PAP smear, histopathology of cervical biopsy prior conisation, indication for conisation, additional smears (HPV 16, 18, 31 and 33 and possible other pathogens), vaginal therapy before conisation, type of conisation, data regarding complications after conisation if present, final histopathology and data if margins of the cone were free of disease. Records from database were anonymised and we used only data of suspected risk factors for HSIL regarding age at the time of surgery, age at menarche, age at first intercourse, number of sexual partners, number of deliveries, spontaneous and legal abortions, type of contraception, marital status, socioeconomic status, smoking habits, last PAP smear result and final histopathology of the cone. All patients with incomplete data were removed from analysis.

The sample is relatively small and is not representative of the real-life situation because more patients have dysplasia or carcinoma and only smaller portion of patients have low risk squamous intraepithelial lesion (LSIL) or no dysplasia at all. In Slovenia, healthy women without dysplasia represent majority of women who attend organised Cervical cancer screening programme ZORA. In year 2019 in Slovenia, we diagnosed 105 new cases of cervical carcinoma and 1056 cases of HSIL. In the same period, we analysed 220301 PAP smears from 206323 women.²⁶ First line treatment for dysplastic changes on uterine cervix is conisation or large loop excision of transformation zone (LLETZ) in majority of cases.²⁷ In 2019, we performed 2017 conisation procedures. 1334 (66%) patients had conisation because of HSIL (cervical intraepithelial neoplasm [CIN]), 400 (20%) patients because of low-grade squamous intraepithelial lesions (LSIL) and 283 (14%) had no dysplasia.²⁶ In Slovenia number of conisations is decreasing in favour of LLETZ.²⁸

We constructed two basic settings of our database. In Raw setting we used previously mentioned risk factors with age in years and last PAP result. For better classification performance we constructed another classification (Class) setting in which we grouped patients by Age at the time of surgery in 15 age groups with 5 years interval and divided Last PAP smear result in two groups (high risk PAP smear Yes: PAP III–V and No: PAP I–II). We divided Final histopathology result of conisation in two groups (HSIL: CIN 2, 3, CIS [carcinoma in situ], Ca [carcinoma] and NO-HSIL: CIN 1, 1–2 and non-dysplastic changes).

In our database are complete data of 1475 patients, 26 (1.8%) without dysplasia on final histological result of conisation, 160 (10.8%) with L-SIL and 1289 (87.4%) with HSIL. Last PAP smear was high risk in 16 patients (61.5%) without dysplasia, 127 patients (79.4%) with LSIL and in 1169 patients (90.7%) with HSIL.

Mean age of patients without HSIL was 38.6 years (13–83 years, standard deviation 10.47) and 34.9 (13–81, standard deviation 8.98) in the group of patients with HSIL. Mean age at menarche was 13.7 (10–19, standard deviation 1.84) in group of patients without HSIL and 13.5 (9–20, standard deviation 1.16) in HSIL patients. Mean age at first intercourse was 17.6 (13–25, standard deviation 1.59) in patients without HSIL and 17.4 (12–25, standard deviation 1.66) in patients with HSIL. HSIL and NO-HSIL group of patients were statistically different regarding age (p < 0.01), age at 1^st intercourse (p < 0.035), number of sex partners (p < 0.004) and high risk PAP smear (p < 0.01).

In our group of patients without HSIL 57% tested HPV 16 negative and 27% positive (16% not tested) and in the group of patients with HSIL 54% tested negative and 33% positive (14% not tested). In the NO-HSIL group 65% tested HPV 18 negative, 21% positive (15% not tested) and in HSIL group 60% tested negative and 27% positive (13% not tested).

Because many patients did not have HPV testing, we decided to remove such patients from analysis. When we analysed removed patients because of no HPV testing (HPV 16, 18, 31 or 33), we discovered that numerous patients with HSIL would be missed (Table 1).

Table 1

Final histology of the cone in patients without human papilloma virus (HPV) testing

	Frequency	Percent
NO dysplasia	9	1.8
CIN 1	26	5.3
CIN 1–2	27	5.4
CIN 2	90	18.1
CIN 2–3	55	11.1
CIN 3	223	45.0
CIS	55	11.1
invasive ca	11	2.2
Total	496	100.0

CIN = cervical intraepithelial neoplasm

Chi-square test (χ = 1.631, p = 0.202) found no statistically difference of HPV 16, 18 status and presence of HSIL in our group of patients. In this time period we didn’t routinely tested presence of HPV infection. Because of a chance that we detected transitory infection with HPV testing and that over 400 patients with HSIL would be excluded from analysis because they were not tested against HPV, we decided to exclude HPV from further analysis. HPV 16 and 18 statuses in our patients are presented in Table 2.

Table 2

Number and percentage of patients according to human papilloma virus (HPV) 16 and 18 statuses in high grade squamous intraepithelial lesion (HSIL) and NO-HSIL group

	HPV 16				HPV 18

	HSIL group		NO-HSIL group		HSIL group		NO-HSIL group

	Frequency	%	Frequency	%	Frequency	%	Frequency	%
not performed	177	14	29	16	172	13	27	15
negative	693	54	106	57	775	60	120	65
positive	419	32	51	27	342	27	39	20
Total	1289	100	186	100	1289	100	186	100

Human neuron or nerve cell is a cell, which can be electrically or chemically excited. It has body – soma and dendrites – which lead signal to neuron and single axon which lead signal from neuron and interconnects with other neural cells. Information is transferred via electrical or chemical mechanism.²⁹

In ANN we have different neurones. There are two main types. Input neurone called perceptron receives information. Output neurone produces final output. All neurones are arranged in layers. First layer is input layer with perceptrons, last layer is layer with output neurones. In between there can be one or many hidden layers. Every neuron interconnect with all neurones from previous and next layer.³⁰ Diagram of simple ANN is presented in Figure 1.

Schematic of simple neural network with input, output and three hidden layers.

As in neural cell, artificial neurones in neural networks receive information and became excited. When excitation level (weight) is reached, they promote signal to other neurones. Before weight is reached no output signal is produced. There are many different mathematical functions for neuron activation. Activation function of output neurons can be different from that of previous layers. Output of the last neuron is numerical value which can range from 0–1. Threshold for classified positive/negative is by default 0.5, meaning that cases with values > 0.5 are classified as positive and cases with value ≤ 0.5 as negative. Threshold value can be changed according to the performance of the algorithm and our goals.¹⁸

Dataset must be split in two parts-training and holdout set. Training set is used to build model, test relations between input variables and determining weights of the neurones. Algorithms are then tested on holdout set in which are instances unknown to neural network. Training set must be larger than holdout.¹⁸

In every classification process, we have actual positive and negative cases, which can be classified correctly as positives or negatives or classified incorrectly. The best way to visualize the situation is to use confusion matrix.

Effectiveness of ANN or any other classification system or algorithm can be measured. In our study we used precision (positive predicted value; PPV), recall (sensitivity, true positive rate; TPR), receiver operator characteristic curve (ROC curve), area under the ROC curve (AUC).³¹ F-measure and Matthews correlation coefficient (MCC) are another measure for efficiency. F-measure is combined measure of precision and recall:

F = \frac{2 * Precission * Recall}{Precision + Recall} .

$$\mathrm{F}=\frac{2 * \text { Precission } * \text { Recall }}{\text { Precision }+\text { Recall }}.$$

It ranges between 0 (worst) and 1 (best). MCC ranges between −1 and +1. −1 meaning perfect misclassification, 0 means as expected in random guessing and +1 perfect classification.³² Precision-recall curve (PRC curve) is another measure of classification efficiency. Precision (PPV) is plotted on y-axis and Recall (TPR) on x-axis. It is more informative than ROC Curve in imbalanced data settings because it analyses fraction of true positives among all positive predictions.³³

Quality of data is of vital importance – sufficient numbers of instances (collection of attributes in database) and qualitative attributes (features that measure or describe different aspect of instances). Before running classification algorithm, it is necessary to run simulation of baseline classification. We can then compare results derived from our model with baseline results and decide how good (or bad) our model is in classification and prediction.

Dealing with unbalanced data

When we have imbalanced datasets where one of the variables represents only a small proportion of the sample, baseline prediction for majority class is very high. For example – if majority class represents 88% of instances as in our case, baseline prediction is high – 88%. If prediction algorithm predicts with 92% accuracy this is not statistically significant. There are some methods, how to deal with unbalanced data:

Under-sampling: randomised reduction of majority class to match minority class

Over-sampling: n-fold replication of minority class to match majority class

SMOTE: synthetic minority over-sampling technique creates new synthetic instances, which have similar characteristics as original ones in minority class.^34,35

Experiment with WEKA

Weka (1999–2020 The University of Waikato, Hamilton, New Zealand) is open-source application for data mining with many other possibilities beside ANN as are Bayesian networks, Logistic regression, Classification trees, K-nearest neighbours and others.³⁶ It enables us to test classification algorithm on whole dataset, we can split dataset by percentage, test whole dataset against separate training dataset from different dataset which we import in Weka and n-fold cross validation. When we manually or randomly split dataset in training and holdout part, there is always a chance that we collect all important instances in one of the sets, especially if one kind of instances represent small proportion of all instances. N-fold cross validation is powerful option which can minimise the chance of such situation. It divides entire database into n parts. Each n-1 part is used as training and each n part as holdout set. All combinations of n and n-1 parts are then tested against each other and algorithm at the end presents the best result of tested combinations. In our experiment, we used 10-fold cross validation.³⁷

Preparation of datasets for analysis

We prepared eight data sets:

Raw set: we used as variable original risk factors and as output HSIL_Y/N.

Class set: same as raw set except age groups instead of age and PAP_HR_Y/N instead of last PAP.

Raw and class with under-sampling, over-sampling and SMOTE method for equalising imbalanced dataset.

Original dataset consisted of 186 No-HSIL and 1289 HSIL patients. To prepare over-sampling dataset we duplicated HSIL negative patients to get 558 No-HSIL and original 1289 HSIL patients. For under-sampling, we randomly selected and deleted HSIL patients to get 272 HSIL and original 186 No-HSIL patients. With SMOTE algorithm, we created data set with original 1289 HSIL patients and 744 No-HSIL patients.

Baseline prediction was calculated for each set and results for multi-layer perceptron with 10-fold cross validation was recorded. Results are presented in Table 4.

Table 3

Confusion matrix for classification with all possible outcomes

	Predicted pos (PP)	Predicted neg (PN)
Actual pos (P)	True positives (TP)	False negatives (FN)
Actual neg (N)	False positives (FP)	True negatives (TN)

Neg = negatives; Pos = positives

Table 4

Results of multi-layer perceptron (MLP) classifications for different settings with baseline prediction – ZeroR, percentage of correct classification and Kappa statistic for all analysis. Results are for prediction high grade squamous intraepithelial lesion (HSIL)-Yes (Y), prediction NO-HSIL (N) and weighted average for whole model (YES and NO combined) – Weighted average (AVG). In bold-type letters are results, where prediction by MLP is better than baseline prediction ZeroR

	TP Rate	FP Rate	Precision	Recall	F-Measure	MCC	ROC Area	PRC Area	Class	% Correct	Kappa	ZeroR %
Class_orig–Y	0.751	0.634	0.739	0.751	0.745	0.118	0.567	0.735	Yes	82.10	0.0965	87.39
Class_orig–N	0.366	0.249	0.308	0.366	0.373	0.118	0.567	0.377	No
Class_orig–AVG	0.637	0.521	0.633	0.637	0.635	0.118	0.567	0.629	Weighted Avg
Class_overs–Y	0.860	0.201	0.908	0.860	0.884	0.640	0.870	0.920	Yes	84.19	0.6376	69.79
Class_overs–N	0.799	0.140	0.712	0.799	0.753	0.640	0.870	0.703	No
Class_overs–AVG	0.842	0.182	0.849	0.842	0.844	0.640	0.870	0.855	Weighted Avg
Class_SMOTE–Y	0.797	0.274	0.834	0.797	0.815	0.515	0.802	0.850	Yes	77.08	0.5141	63.40
Class_SMOTE–N	0.726	0.203	0.673	0.726	0.699	0.515	0.802	0.669	No
Class_SMOTE–AVG	0.771	0.248	0.775	0.771	0.772	0.515	0.802	0.784	Weighted Avg
Class_unders–Y	0.669	0.559	0.636	0.669	0.652	0.112	0.542	0.608	Yes	57.64	0.1113	59.39
Class_unders–N	0.441	0.331	0.477	0.441	0.458	0.112	0.542	0.448	No
Class_unders–AVG	0.576	0.466	0.572	0.576	0.573	0.112	0.542	0.543	Weighted Avg
RAW_orig–Y	0.907	0.828	0.884	0.907	0.895	0.086	0.594	0.905	Yes	81.42	0.0856	87.39
RAW_orig–N	0.172	0.093	0.211	0.172	0.189	0.086	0.594	0.174	No
RAW_orig–AVG	0.814	0.735	0.799	0.814	0.806	0.086	0.594	0.813	Weighted Avg
RAW_overs–Y	0.825	0.285	0.870	0.825	0.847	0.525	0.837	0.905	Yes	79.21	0.523	69.79
RAW_overs–N	0.715	0.175	0.639	0.715	0.675	0.525	0.837	0.661	No
RAW_overs–AVG	0.792	0.252	0.800	0.792	0.795	0.525	0.837	0.831	Weighted Avg
RAW_SMOTE–Y	0.800	0.258	0.843	0.800	0.821	0.533	0.814	0.867	Yes	77.87	0.5318	63.4
RAW_SMOTE–N	0.742	0.200	0.681	0.742	0.710	0.533	0.814	0.691	No
RAW_SMOTE–AVG	0.779	0.237	0.784	0.779	0.780	0.533	0.814	0.802	Weighted Avg
RAW_unders–Y	0.688	0.575	0.636	0.688	0.661	0.115	0.551	0.614	Yes	58.08	0.1144	59.39
RAW_unders–N	0.425	0.313	0.482	0.425	0.451	0.115	0.551	0.466	No
RAW_unders–AVG	0.581	0.469	0.573	0.581	0.576	0.115	0.551	0.554	Weighted Avg

Raw = original settings; Class= class setting; overs = oversampling; SMOTE = synthetic minority over-sampling technique; unders = undersampling

Results

In first part of analysis, we analysed original database with artificial neural network, multi-layer perceptron (MLP). We achieved 81.42% correct predictions which is worse than baseline – ZeroR prediction 87.39% (kappa = 0.08 showing no level of agreement between predicted and actual status, AUC 0.594, MCC 0.086, F-Measure 0.806, precision 0.799 and recall 0.814). When we corrected minority class with over-sampling method ZeroR prediction was 69,79%, achieved 79,21% (kappa = 0.523 showing weak level of agreement between predicted and actual status, AUC 0.837, MCC 0.525, F-Measure 0.795, precision 0.800 and recall 0.792). SMOTE performed inferior than over-sampling with baseline ZeroR 63.40% and achieved 77.87% (kappa = 0.53 showing weak level of agreement between predicted and actual status, AUC 0.814, MCC 0.533, F-Measure 0.780, precision 0.784 and recall 0.779). Under-sampling method performed worse than analysis on original dataset with ZeroR prediction 59.39%, achieved 58.08% (kappa = 0.11 showing no level of agreement between predicted and actual status, AUC 0.551, MCC 0.115, F-Measure 0.576, Precision 0.573 and Recall 0.581).

In second part of analysis, we grouped data in classes as described previously. Analysis with MLP on original data achieved 82.10% correct prediction which is less than baseline 87.39% ZeroR prediction (kappa = 0.09 showing no agreement between predicted and actual status, AUC 0.567, MCC 0.118, F-Measure 0.635, precision 0.633 and recall 0.637). Performance of MLP was better with over-sampling method, where baseline ZeroR prediction was 69.79% and MLP achieved 84.19% correct predictions (kappa = 0.64 showing moderate level of agreement between predicted and actual status, AUC 0.870, MCC 0.640, F-Measure 0.844, precision 0.849 and recall 0.842). With SMOTE method baseline ZeroR prediction was 63,40% and achieved prediction 77,08% (kappa = 0.51 showing weak level of agreement between predicted and actual status, AUC 0.802, MCC 0.515, F-Measure 0.772, precision 0.775 and recall 0.771). Under-sampling method performed worse than analysis on original data with ZeroR prediction 59.39% and 57,64% correct predictions (kappa = 0.11 showing no agreement between predicted and actual status, AUC 0.542, MCC 0.112, F-Measure 0.573, precision 0.572 and recall 0.576).

All results are presented in Table 4. MCC for all models is graphically presented in Figure 2 for prediction HSIL-Yes and NO combined. True positive rate and false positive rate for all models are graphically presented in Figure 3. ROC curve for worst performance model is represented on Figure 4 and for best performance model on Figure 5.

Matthews correlation coefficient (MCC) for categorisation squamous intraepithelial lesion (HSIL)-combined for YES and NO prediction for different equalisation methods (no correction of minority class, under-sampling, oversampling and synthetic minority over-sampling technique [SMOTE]) for both RAW and Class settings. Best performance of multi-layer perceptron (MLP) is on dataset with data organised in classes and over-sampling method for minority class – MCC = 0.64. Lowest performance is with original dataset without correction for minority class – MCC = 0.086.

True positive and False positive rate for different settings for prediction Yes and No combined and for different equalisation methods (no correction of minority class, under-sampling, over-sampling and synthetic minority over-sampling technique [SMOTE]) for both RAW and Class settings. Best performance model from Figure 2 has 0.842 true positive rate and 0.182 false positive rate. Lowest performance model from Figure 2 has high 0.814 true positive rate which is almost as high as best performance model but also high false positive rate 0.735.

Raw = original settings; Class = class setting; FPR = false positive rate; HSIL = high grade squamous intraepithelial lesion; overs = oversampling; TPR = true positive rate; unders = undersampling; SMOTE = synthetic minority over-sampling technique

Receiver operator characteristic (ROC) curve for multi-layer perceptron (MLP) performance on dataset without grouping in classes and no correction for minority class where X axis represent 1- specificity (false positive rate) and Y axis represents sensitivity (true positive rate). Area under the ROC curve (AUC) = 0.594. AUC for categorisation with random guessing is 0.5. This Figure represents model with lowest performance of MLP from our study.

Receiver operator characteristic (ROC) curve for multi-layer perceptron (MLP) performance on dataset with patients grouping in classes and synthetic minority over-sampling technique (SMOTE) correction for minority class where X axis represent 1- specificity (false positive rate) and Y axis represents sensitivity (true positive rate). Area under the ROC curve (AUC) = 0.802 which is well above classification with random guessing where AUC is 0.5. This Figure represents best performance model of MLP from our study.

Discussion

In medicine, we mostly deal with imbalanced classes. In such data sets baseline prediction is high for majority class. In most cases, we have situation in which we must precisely and accurately classify patients from minority class.³⁸ Misclassification of patient with severe disease as negative means that we potentially endanger their health and because of delayed diagnosis, disease can progress to life-threatening situation or death. Such situation endangers only patient involved. In case that we classify patients, for example, who have very contagious disease, misclassification as negative means that such false negative patients will spread the disease and endanger other healthy people. Misclassification of healthy patients as positive results in further diagnostic tests and eventually leads to correct diagnosis. Unnecessary procedures result in greater stress for patient, higher expenses and bigger load for health system. Good classification algorithms therefore must have very high sensitivity and specificity.

Cervical cancer is preventable disease.¹ Artificial intelligence (AI) and deep learning methods are used for optimisation of screening, diagnostic and treatment procedures and are also present in the field of cervical cancer. Cervical cytology is of vital importance in screening programmes. Mango et Laurie³⁹ published article of computer assisted cervical cancer screening using neural networks in 1993. They used robotic arm for loading and un loading slides of PAP smears from storage container, automated microscope and automated high-definition camera for imaging the slide. Multiple pictures from each slide were recorded. In the review station cytologists examined pictures. They used ANN to recognise different cells from images. After training neural network on sample pictures overall ANN sensitivity for all cytologic findings was 96% compared to 81% of that of cytologists.³⁹

Sompawong et al. used ANN on images of liquid-based cytology (LBC) PAP smears to detect and analyse features of nucleus of the cervical cell and to screen normal and abnormal morphological features. In his study they achieved 57.8% mean average precision and 91.7% accuracy, sensitivity and specificity. This could help technicians and cytologists in their work.⁴⁰

Holmström et al. tested the use of ANN to analyse PAP smears to detect pathological changes in rural Kenya where cervical cancer represent significant health burden with high mortality rate. PAP smears were digitalised with portable scanner, uploaded to cloud and analysed in regional medical centre. Sensitivity of ANN was 95.7% and specificity 84.7% compared to 100% sensitivity and 78.4% specificity of human examinator. AUC for ANN was 0.94. NPV was very high 99–100% particularly for HSIL. They concluded, that such model can be very helpful in cervical cancer screening in areas with low resources of health care professionals.⁴¹

Bao et al. ⁴² and Turic et al. ⁴³ published study of AI assisted cytology in cancer screening programme in China. They digitalised LBC images of cervical smears and analysed them with AI. PAP smears were also analysed by cytologists. Agreement between AI and manual reading was 94.7 with kappa 0.92 which is almost perfect agreement and AI assisted cytology was more sensitive for detection CIN2+ lesions than manual reading by 5,8% with slight reduction in specificity.

Colposcopy is very important diagnostic procedure. Clinical experience is important for accurate colposcopic result.⁴⁴ With the use of AI - deep convolutional networks it is possible to analyse colposcopic images with higher accuracy than subjective assessment by human. In his study Chandran and colleagues published 92,4% sensitivity, 96.2% specificity and kappa 0.88 which showed strong association between predicted and actual status of colposcopic changes.⁴⁵ It is important, that women referred for colposcopy are correctly selected to prevent overload in colposcopic clinics. Such overload with improper patients can result in miss diagnostics, unnecessary procedures and can be a threat for subsequent pregnancies.⁴⁶ K arakitsos et al.⁴⁷ used learning vector quantizer neural network to identify patients who need referral for colposcopy. They analysed PAP smear using LBC and several markers of HR-HPV infection. All women had colposcopic directed biopsy performed by experienced colposcopist and histologic result was golden standard to determine if colposcopy was necessary or not. They did not only identified more patients in need for immediate colposcopy with the use of AI but also reduced number of patients with clinical insignificant lesions compared to other methods. Combined sensitivity for training and testing set was 85.16% with specificity 98.01%, PPV 85.71%, NPV 97.92% and overall accuracy of 96.42%. ANN are very good in recognising pathological morphological features on images and all parameters are very good in all studies.⁴⁷ P ouliakis et al. obtained similar results with study of classification and regression trees (CART) for the triage of women for referral to colposcopy and risk estimation for CIN. They used LBC and several markers of HR-HPV infection. This study is important because they used missing data, which can be a problem and most studies exclude them from analysis. CART has 83.28% sensitivity, 94.26% specificity, 79.04 PPV, 95.06 NPV and 100%valid cases while other methods have only 67.75%-96.25% valid cases depending on the method used. CART performed superiorly compared to cytology alone when used ASCUS+ threshold level (p < 0.0001).⁴⁸

In our study we used MLP, which is back propagation artificial neural network on our dataset of patients, which had conisation surgery in University Gynaecologic clinic Maribor in years 1993–2005. As input layer, we used known risk factors for development of cervical dysplasia and carcinoma, High-risk dysplasia CIN2+ Yes/No as output layer. Risk factors are important and increase risk for development of disease but not all patients with risk factors develop disease.⁴⁹ All patients with incomplete data were removed from analysis as are in majority of studies. Original dataset was imbalanced and patients without HSIL represented minority class. To our knowledge this is first study with such settings.

MLP performed worse on original dataset in comparison with baseline prediction. Such outcome can be expected in dataset where data are imbalanced.³⁶ There are several methods to equalise imbalanced data. We can reduce the majority class by randomly selecting and removing instances from majority class with under-sampling method.³⁴ When we balanced dataset with under-sampling method, prediction did not improve and stayed below baseline. Reason for this may be in removing instances with important variables from training and/or testing set. We prepared dataset with under-sampling method few more times but with all settings, we could not achieve better performance. MLP correctly classified 57.64% cases which is inferior compared to baseline zeroR 59.39% and also kappa statistic 0.1113 showed no agreement between real and predicted status.

SMOTE and over-sampling methods improved performance of MLP.³⁵ With over-sampling method we multiplicate instances from minority class to match that of majority class. In this case is always a chance, that we can find equal instances in training and testing set.³⁴ SMOTE method uses k-nearest neighbour algorithm to create new synthetic instances which are all unique.³⁵ In best performance model where baseline prediction ZeroR was 69,79% MLP correctly classified 84,19% cases and kappa statistic 0.64 showed moderate agreement between real and predicted status.

In real clinical practice, many patients have multiple risk factors but never develop disease or, many with only a few became ill. It is possible that patients do not tell the truth about risk factors because they are too intimate, they are ashamed or they do not remember. Collection of all risk factors from patients participating in screening or other programme in nationwide database is also questionable because of ethical considerations.⁵⁰ With our experiment we proved, that with the use of ANN we can predict more patients who will develop HSIL based only on the analysis of their risk factors for developing HSIL and result of last PAP smear than with baseline prediction. But performance and classification accuracy of ANN is not high enough for every day clinical practice.

eISSN:: 1581-3207
Lingua:: Inglese

Frequenza di pubblicazione:: 4 volte all'anno
Argomenti della rivista:: Medicine, Clinical Medicine, Internal Medicine, Haematology, Oncology, Radiology

Feed RSS della rivista