Published Online: Feb 03, 2025
Received: Sep 16, 2024
Accepted: Jan 04, 2025
DOI: https://doi.org/10.2478/amns-2025-0008
Keywords
© 2025 Xia Wu, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Accurate image recognition is crucial in medical diagnosis. Traditional medical image recognition requires manual processing and judgment by experienced doctors, which usually requires a lot of time and energy. However, with the rapid development of artificial intelligence technology, especially the advancement of image recognition technology, the application of AI image recognition technology in medical diagnosis is gradually increasing [1–2].
The medical field is one of the important application areas of image recognition technology because the patient’s condition is often diagnosed and treated by medical imaging technology (e.g., x-ray, CT, MRI, etc.). However, these cutting-edge technologies are not widely used due to the difficulty of operating medical imaging technologies. At the same time, the professional background and experience of doctors also affect the diagnostic accuracy of medical imaging technology [3–4]. Therefore, the application of image recognition technology provides doctors with a more accurate, efficient and faster treatment process, which helps them to develop more accurate and targeted treatment plans in a shorter time [5]. However, in order to give full play to the advantages of image recognition technology, further research and innovation are needed to solve the problems of large data volume, data labeling, privacy security, etc., and with the continuous development and improvement of the technology it is believed that the application of AI image recognition technology in medical diagnosis will become more and more extensive and bring great potential for the development of the medical industry [6–8].
Literature [9] describes the development of virtual neural networks and analyzes DLA, emphasizing the fact that both have shown great potential for applications in clinical imaging, while most DLA implementations are focused on computed tomography images, X-ray images, and so on. Literature [10] aims to examine the impact of computer vision applications on patients and the challenges they face in medical imaging. The literature review concluded that deep learning methods in medical imaging computer vision can improve the diagnostic accuracy of healthcare professionals, it is harmless and safe. Literature [11] fused deep learning and CML to propose a hybrid intelligence-driven medical image recognition framework based on loMT. The image is a feature extracted by a convolutional neural network, and the features are dimensionality reduced using the CML technique to create a strong classifier to output the recognition results. The experimental results verify the feasibility of the proposed scheme, which effectively improves the recognition efficiency. Literature [12] describes the development and application of information technology in medical diagnostics, stating that the application of information technology has aroused the interest of professionals in various industries who are seeking solutions to problems that traditional techniques cannot solve. Literature [13] proposed the implementation of a computerized decision support system based on the problem that gastric cancer cannot be detected, aiming to assist physicians in identifying gastric cancer lesion areas in order to make a better diagnosis of gastric cancer and experimentally verify the effectiveness of the system. Literature [14] discusses the application and challenges faced by deep learning algorithms in cancer diagnosis, especially in the detection of abdominal images, mammogram images, etc. and discusses the advantages and disadvantages of various architectures of deep learning neural networks. Literature [15] outlines the application of Convolutional Neural Networks with machine learning algorithms in medical image analysis, including medical image classification, localization, detection, etc., and points out the barriers, trends, and directions of its research. Literature [16] describes the application of deep learning in medical image processing. It also discusses the impact of deep learning on automated disease detection organ and lesion segmentation and emphasizes the need to overcome the residual knowledge gap to achieve high performance of automated medical image processing systems. Literature [17] describes the design and development of a pneumonia detection system based on a deep learning approach that can evaluate chest CT scan images and facilitate the pneumonia exposure process. Comparison tests specified that the proposed method is beneficial in reducing the perceived error rate caused. Literature [18] describes that deep learning is not only used in various fields but is also emphasized in the medical field. The optimization of advanced deep learning architectures for medical image segmentation and classification and the challenges they face are discussed. Literature [19] proposed a multimodal medical image fusion method incorporating deep learning based on the needs of medical diagnosis. The method is analyzed and verified to show superiority in terms of various quantitative evaluation criteria in addition to visual quality.
Accompanied by the continuous impact of computer technology, the challenges and opportunities of clinical medical diagnosis coexist, in this context, this paper integrates deep learning algorithms into medical diagnosis to explore the opportunities of medical diagnosis in the development of the new era. Medical image diagnosis requires efficient real-time performance, but the large scale and high computational complexity of neural networks will slow down inference speed. This paper proposes a neural network model that is optimized using MobileNet V2 and introduces the distraction (SA) mechanism module for image recognition classification. After the model construction, in order to facilitate the use of clinical medical diagnosis, this paper also designs and establishes a CAD medical diagnosis system and then studies the application value of this paper’s method through empirical analysis.
Deep learning has a very important role in the field of machine learning. It automatically learns the distribution patterns of data from a large amount of data by constructing deep neural network models. In the development process of deep learning, the emergence of a convolutional neural network (CNN) has a significant significance. It is widely used in computer vision, natural language processing and other fields and has achieved many breakthroughs, which has laid a solid foundation for the excellent performance of deep learning in medical image analysis, data analysis, and assisted decision-making.
CNN’s working principle is based on multiple modules that include convolutional layers, pooling layers, fully connected layers, and activation functions. The CNN model works by extracting features from an image using multiple convolutional layers, activation functions, and pooling layers, and then classifying and predicting them using fully connected layers. Through training and optimization of the model, CNN can achieve good results in the fields of image classification and target detection. Convolutional Layer The convolutional layer extracts features from the input image through convolution operations. The convolution operation is what makes a convolutional neural network work. A feature map in a top-down, left-to-right order is used in a convolution kernel for the convolutional operation. The feature map is obtained by performing a convolution operation on the input image. To extract different features, different sized convolution kernels need to be selected for the convolution operation. Input data in the field of image processing is usually image data, to a single-channel image data, for example, define the image data as The process of convolution operation is the process of calculating the sum of the products of the domain pixels of each pixel point of the input image and the corresponding element of the convolution kernel by the convolution kernel and using it as a new feature map pixel point. The process of computing the first element of the feature map is:
Activation Layer The implementation of convolutional neural networks requires the use of appropriate activation functions, which are often different for different tasks. In the convolutional neural network, the activation function is used to add nonlinear factors to improve the network’s expressive ability and can also effectively solve the problem of gradient disappearance. Common activation functions include the Sigmoid function, Tanh function, ReLU function and Leaky ReLu function. Sigmoid activation function is the most commonly used class of activation function in the early days, it can transform the function value to between [0,1], the disadvantage is that the function is computationally large, easy to cause the phenomenon of gradient disappearance. The specific calculation is:
Tanh function can solve the sigmoid function mean value is not 0 problem. Since the Tanh function and sigmoid function are both power operations so there are function calculations large problem. tanh function formula is:
ReLU function was created to solve the problem of vanishing gradient. The advantage of ReLU function is that the calculation is fast. The disadvantage of ReLU function is that in the process of back propagation, if the value of the input is negative, then the gradient will be completely zero, the formula is equation (5). The leaky ReLU function is an improvement of the ReLU function the formula is as in equation (6), and the ReLU function is different from the function is that a slope is assigned to the function at Batch normalization layer Convolutional neural network retraining process, the input data is constantly changing, which will affect the learning ability of the network, so it is necessary to normalize the changed features obtained during training by batch normalization. The formula for batch normalization is expressed as:
Pooling layer The convolution process is followed by the pooling layer, which is considered the core component of the convolutional neural network. The purpose of the pooling layer is to reduce the dimensionality of the input feature maps, which reduces the amount of data and thus increases the running speed of the network during training or inference. Since the features of an image are invariant and the local features of an image are consistent across feature maps of different sizes, the features of an image can be well preserved after dimensionality reduction by the pooling layer, and the pooled feature maps are smaller with fewer network parameters. Fully Connected Layer The fully connected layer (FC) plays the role of transforming the feature space and can integrate previously learned local features. Convolution and other processes flatten feature maps and then feed them into the fully connected layer, which converts them into a one-dimensional vector output based on the number of classification categories. Softmax is generally used to process the output one-dimensional vector after the fully connected layer to obtain the score for each category. The softmax function can complete the data normalization operation; specifically, the softmax function will convert the input data to (0,1) values. Therefore, the softmax function is usually used as the last layer of the network, receiving inputs from the previous layer and converting them into probabilities. Softmax is given by:
The traditional CNN is unable to perform feature extraction on data in non-Euclidean space. For these data with unstructured characteristics of non-Euclidean space, the convolutional neural network is generalized on graph-structured data, and a special network structure is proposed to deal with graph-structured data, i.e., graph convolutional network (GCN). Graph Convolutional Network is a deep learning network on graph-structured data. Graph-structured data is a nonlinear data structure consisting of nodes and edges, which can well portray one-to-many inter-data relationships in non-Euclidean space.
Graph convolutional network has a similar network structure as CNN. Graph convolutional network is mainly composed of a graph convolutional layer and other functional layers, and its graph convolutional layer mainly accomplishes the extraction of corresponding features from the input graph structure data. In order to express the process of generating feature maps by the graph convolution layer, the input and output of the graph convolution layer are defined to satisfy the nonlinear function
The formula in the graph convolution layer is:
In the development of deep learning, researchers generally update the model by increasing the number of network layers because increasing the number of layers improves the expressiveness and performance of the model so that the network can learn more features and layers. However, in practice, with the increase of the number of network layers, the accuracy of the model is not necessarily improved, but instead, many problems such as gradient vanishing and gradient explosion, excessive training time, overfitting, and degradation may occur. To avoid the above problems, this paper improves the network architecture. Batch Normalization Batch normalization (BN) is proposed to solve the problem of internal covariate bias during neural network training, and it is widely used as an optimization technique in the field of deep learning. Its principle is to make the data distribution more stable by normalizing the input data of each layer, so as to improve the training speed and generalization ability of neural networks, and its specific process will be described below. First, a batch of data to be processed is defined as After the normalization process, two parameters The data distribution of the neural network will be stabilized after batch normalization, which makes the internal covariate bias problem occurring in the training process well solved. In addition, the data distribution after BN will be in the interval where the nonlinear function is more sensitive to the input value, which amplifies the gradient change and avoids the problem of gradient disappearance in the network so that the training speed of the network is accelerated while the generalization ability is improved. Dropout regularization To prevent overfitting, dropout regularization is a common regularization technique used in the training process of neural networks. The core idea of Dropout is to randomly turn off some neurons during each training iteration, i.e., to set their outputs to zero, in this way, to reduce the interdependence between neurons, thus reducing the risk of overfitting. Residual learning In convolutional neural networks, the increase in the depth of the network may lead to the emergence of gradient disappearance and gradient explosion, and this is generally used to solve these problems using regularization and batch normalization. However, this solution does not solve the degradation problem that occurs when the network depth increases. Residual learning, as shown in Figure 1, residual learning can be a good solution to this problem, which can directly connect the shallow layer of the network with the deeper layer of the network through a branch line for jumping, and the residual information of the shallow layer can be transported to the deeper layer of the network so that the deeper layer of the network can learn the information of the shallow layer, which reduces the loss of the shallow layer of information. In the composition structure of residual learning, in which the weight layer in the convolution operation and batch normalization processing, Attention Mechanism In recent years, attention mechanism has become a popular technique used in deep learning to enhance model performance. Its core idea is similar to the attention in human vision, according to their own needs to put more attention on those parts that they think are important and selectively ignore the information that they think is not critical so that they can obtain information of higher value under the conditions of limited time and resources. The same principle applies to the attention mechanism in deep learning, which was first proposed in the field of natural language processing and which, when processing an input sequence, allows the model to focus on certain parts of the input sequence rather than considering the sequence as a whole. Focusing on the most relevant contextual information allows the model to improve prediction accuracy.

Residual learning
Medical image diagnosis necessitates effective real-time performance, but conventional neural networks slow down inference because of their large size and high computational complexity. Therefore, neural network models for medical image processing need to be lightweight so that they can be easily integrated into medical devices. The traditional deeper and more complex models are no longer applicable as a result of the demand for lightweight network models for mobile devices. The lightweight network faces challenges when dealing with the complex medical image classification problem, particularly when dealing with medical pictures with unclear features. For this reason, this chapter proposes a neural network model based on MobileNet V2 optimization to achieve the classification and recognition of medical diagnosis pictures.
The MobileNet V2 network model can be seen in Fig. 2. MobileNet V2 introduces an innovative building block called “Reverse Residuals with Linear Bottlenecks”, which consists of a reverse residuals block starting with a lightweight depth-separable convolutional layer, followed by a linear bottleneck layer containing 1×1 convolutional and ReLU activations for increasing the number of channels and ending with another depth-separable convolutional layer. Of 1 × 1 convolution and ReLU activation, and ends with another depth-separable convolutional layer. The linear bottleneck aims to preserve information flow by using a linear activation function instead of non-linearity in the bottleneck layer, which helps to prevent information degradation during forward propagation. Layer-hopping connections are used within the building blocks of MobileNet V2 to link inputs and outputs, facilitating gradient flow and reducing the gradient vanishing problem. To increase the number of channels, an expansion layer is introduced to expand the number of channels of the input feature map using 1 × 1 convolution before depth-separable convolution. Finally, MobileNet V2 replaces the end fully connected layer by Global Average Pooling (GAP) to reduce the spatial dimensions and generate one-dimensional feature vectors for classification.

Mobile Net V2 network model
The inverted residual structure in MobileNet V2 is combined with a linear bottleneck block, which is not the same as the residual structure used in traditional residual networks. In a standard residual network, the input is first passed through a narrower layer (e.g., a convolutional layer with fewer channels), followed by a wider layer (e.g., a convolutional layer with more channels), and finally summed to the input. This design helps reduce computational costs and model size while improving performance.
After profiling the original structure of MobileNet V2, in order to pay more attention to the salient features of the medical diagnostic regions in the image, a distraction (SA) mechanism module is introduced, which establishes associations in the channel dimension by integrating the spatial information of the input feature map. By dividing the input feature map into subgroups and rearranging the features in each subgroup, this rearrangement operation helps to increase the mixing in the feature map and facilitates information transfer across channels and spatial locations.
The combination of the SA module and the feature rearrangement operation enhances the learning ability of the neural network model for spatial and channel information, which in turn demonstrates higher performance in image processing and computer vision tasks.
The A-module initially divides
One option for channel attention to fully capture channel dependencies is to utilize SE blocks. However, this introduces too many parameters for the design of a more lightweight attention module. Furthermore, generating channel weights by performing a faster one-dimensional convolution of size
Channel Attention, Channel Attention replaces the feature information for that channel by embedding global information through Global Average Pooling, such that the feature map
An accurate and adaptively selectable feature was also created that operates through
The main methods used to recognize medical images manually are prone to time-consuming, subjective, inefficient, and laboratory-sensitive due to the complexity and variability of medical images. To solve these problems, researchers have imported images into computers and built systems to automatically analyze medical data, i.e., image-based computer-aided diagnosis (CAD) systems. The traditional CAD system’s overall process has not yet been fully completed. Only in the initial step of lung region segmentation have some results been obtained and there is still a lot of research to improve accuracy. The subsequent feature extraction and drug resistance recognition have not yet entered the automatic processing stage and are still in the semi-automated stage. For this reason, this study uses deep learning image recognition to improve the auxiliary diagnosis system (CAD) on the basis of the optimization of deep learning image recognition above to improve the value of practical applications in medical diagnosis.
The architecture of the CAD diagnosis system that uses deep learning image recognition is depicted in Fig. 3. In the CAD system, the whole process is divided into two main tasks, i.e., the detection task and the classification and recognition task. The detection task involves detecting medical diagnostic regions in the foreground human body image, and segmenting and processing these diagnostic image regions. The classification and recognition task refers to extracting the features of the diagnostic image after the diagnostic image region has been segmented and then inputting them into a classifier to classify them and identify whether they have lesions or other problems. The initial step of the CAD system, the preprocessing stage, aims to normalize the image and enhance the clarity of the image and takes measures such as adjusting the size of the image to a uniform size, noise reduction, contrast enhancement, and so on.

Cad system framework based on deep learning recognition
After that, certain features are extracted from the processed lung segmentation region. The features are mainly categorized into three types: 2D features, 3D features, and other manually collected features. The 2D features mainly contain LBP features, gradient features, pixel mean and variance values, structural features, and so on. 3D features mainly consist of multidimensional matrices. To reduce the number of features after extracting them, Principal Component Analysis (PCA) is necessary. Then, new refined features are obtained, and they are sent to the classifier to train the classifier. The final results are obtained by applying these trained classifiers to the test set of target diagnostic images.
In this paper, the segmented image is divided into background and medical diagnostic regions. Image segmentation has multiple evaluation metrics to measure the algorithm’s accuracy. In this paper, the evaluation criteria are Pixel Accuracy (PA), Mean Pixel Accuracy (MPA), Mean Intersection and Union Ratio (MIoU) and Frequency Weighted Intersection and Union Ratio (FWIoU). The experimental setup is software version TensorFlow 2.1, CUDA 10.1 and CUDNN 7.6.TensorFlow uses CUDA and CUDNN to accelerate NVIDIA GPU training speed. Hardware is Ge Force RTX 2070 SUPER (8GB video memory). Conventional preprocessing settings such as image uniform size are used. The input image resolution of the model is 416*608, the batch size is set to 2, and the model is trained for 50 epochs. The model training results can be seen in Fig. 4. The accuracy rate of the network model in this paper increases steadily during the training period and eventually stabilizes around 0.975. The loss rate decreases steadily and eventually stabilizes around -0.969.

Model training results
To verify whether the effect of the MobileNet V2 network model in this paper is optimal or not, the related Python library is used to quickly test the image segmentation and calculate the results. MobileNet, VGG, and ResNet-50 are used as the skeleton network to compare with this paper’s method, and the decoder of U-Net is used as the up-sampling network, respectively, and the three models are trained and tested under the same hyper-parameter settings. The results of the model performance comparison are shown in Table 1. The network model down-sampled by this paper’s method is optimal and has the best results in the four metrics of frequency-weighted intersection and merger ratio (0.904), average pixel accuracy (0.881), background intersection and merger ratio (0.941) and diagnostic region intersection and merger ratio (0.807). The image segmentation model proposed in this paper will be effectively and efficiently used to extract regions from medical diagnostic images.
Performance comparison results
Model | FWIoU | MPA | Background FWIoU | Diagnostic FWIoU |
---|---|---|---|---|
MobileNet | 0.834 | 0.783 | 0.886 | 0.679 |
VGG | 0.856 | 0.838 | 0.896 | 0.786 |
Resnet50 | 0.898 | 0.868 | 0.928 | 0.805 |
Ours | 0.904 | 0.881 | 0.941 | 0.807 |
To further validate the effect of this paper’s deep learning-based image recognition model in medical diagnosis, the model is evaluated after training is completed, and the comprehensive evaluation metrics used are Sensitivity, Specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), AUC, and F1-score. Unlike other classifications, more evaluation metrics are considered for the classification of medical data, and the performance of the model often ends up being the impact it has on the person, so the model should be evaluated from multiple perspectives while continuously improving performance. This image recognition experiment takes lung disease as an example, and the lung medical image diagnosis is analyzed into three categories: diseased, normal, and viral pneumonia. The comprehensive assessment results of image recognition are shown in Table 2. The F1-score evaluation indexes of the three types of lung diseases reached over 97%, and the AUC was as high as 99%. The three lung diseases had average mean values of 0.973, 0.990, 0.978, 0.990, 0.996 and 0.977, with outstanding performance. This paper’s method resulted in a performance improvement by focusing more on the lung diagnostic region than the model without image segmentation.
Comprehensive evaluation of image recognition
Sensitivity | Specificity | PPV | NPV | AUC | F1score | |
---|---|---|---|---|---|---|
Diseased | 0.965 | 0.998 | 0.995 | 0.984 | 0.999 | 0.976 |
Normal | 0.981 | 0.985 | 0.959 | 0.995 | 0.999 | 0.979 |
Viral Pneumonia | 0.974 | 0.987 | 0.981 | 0.992 | 0.999 | 0.975 |
Mean | 0.973 | 0.990 | 0.978 | 0.990 | 0.996 | 0.977 |
Average (no picture segmentation) | 0.966 | 0.985 | 0.967 | 0.983 | 0.996 | 0.964 |
To validate the effectiveness of the CAD medical diagnosis system in this paper, the LIDC-IDRI database was used in the experiments as the sample set of lung CT slices for training and testing, and a total of 800 patients’ complete lung CT slice data were used for training, and 176 patients’ complete lung CT slice data were used for testing. Then, 10,000 lung nodule pseudo-colored small block images extracted from the CT images of 800 patients with 12,000 healthy tissue pseudo-colored small block images were used as the training data sample set. A deep learning model is used to learn and analyze the data, and a predictive model is then trained. The model was tested using CT images of another 176 patients not included in the training set as a test set. The test set contained 321 lung nodules.
The trained model was tested using a threshold value of 0.5, and experimental results were obtained according to the different sizes of the lung nodules. The experimental detection results are shown in Fig. 5, where it can be found that the size of the undetected lung nodules is all smaller than 10 mm, and the sensitivity is lower when the size of the nodule is smaller. The sensitivity of nodules smaller than 8 mm was 87.1 and 87.3, which was significantly lower than those with sizes higher than 8 mm. Taken together, the average sensitivity of the method in this paper for medical detection of tuberculosis is 94.6%, which can meet the needs for medical diagnosis.

Experimental results
The CAD system of this paper is compared with other CAD systems such as Zhang et al., Ye et al., Choi et al., Setio et al., Ma et al., and Liu et al. The results of the CAD system comparison experiment are shown in Figure 6. Through the experimental results, it can be found that the sensitivity obtained by the proposed CAD medical diagnosis system based on the deep learning image recognition in this paper is 95.62, which is higher than some other CAD systems, and the average false positive is 4.88, which is able to be lower than most of the detection systems, and is only higher than Liu et al. and Choi et al., which further proves that the method of this paper performs well.

Cad system comparison experimental results
In the context of the computer age, this paper establishes a CAD medical diagnosis system based on depth-based image recognition. The main research of this paper is summarized as follows:
In order to solve the demand problem of medical diagnosis, this paper optimizes the network model, and the optimized network model of this paper has a smooth accuracy and loss rate of around 0.975 left and -0.969, respectively, in training. Compared with other network models, the four indexes of this paper’s model, namely, frequency weight intersection and merger ratio (0.904), average pixel accuracy (0.881), background intersection and merger ratio (0.941) and diagnostic region intersection and merger ratio (0.807), are optimal, and the optimization effect is obvious. In this paper, based on deep learning image recognition in recognition of three lung diseases, the F1-score evaluation indexes all reach more than 97%, and the AUC is as high as 99%. The mean values of Sensitivity, Specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), AUC, and F1-score for the three lung diseases are 0.973, 0.990, 0.978, 0.990, 0.996, and 0.977, respectively, which is an excellent performance, and the performance improvement is better than that of the model that does not use image segmentation of the model achieved a performance improvement. In the empirical analysis using tuberculosis as an example, the CAD medical diagnosis system in this paper is less sensitive when the size of the nodule is smaller. The sensitivity of nodules that are less than 8 mm is 87.1 and 87.3, respectively. However, the average sensitivity of this paper’s method for the medical detection of tuberculosis is 94.6%, and thus, it can meet the needs of displaying medical diagnosis. Compared with other CAD systems such as Zhang et al., Ye et al., Choi et al., Setio et al., Ma et al., and Liu et al., the sensitivity obtained by this paper’s CAD medical diagnostic system was 95.62, which was higher than other CAD systems. Meanwhile, the average false positive was 4.88, which was only higher than that of Liu et al. and Choi et al., further proving the feasibility and practical application value of the method in this paper.