1. bookAHEAD OF PRINT
Detalles de la revista
License
Formato
Revista
eISSN
2444-8656
Primera edición
01 Jan 2016
Calendario de la edición
2 veces al año
Idiomas
Inglés
Acceso abierto

Study on the influence of adolescent smoking on physical training vital capacity in eastern coastal areas

Publicado en línea: 15 Dec 2022
Volumen & Edición: AHEAD OF PRINT
Páginas: -
Recibido: 25 Mar 2022
Aceptado: 14 Apr 2022
Detalles de la revista
License
Formato
Revista
eISSN
2444-8656
Primera edición
01 Jan 2016
Calendario de la edición
2 veces al año
Idiomas
Inglés
Introduction

The measurement of pulmonary function index is one of the important means to study the harm caused by smoking. At present, the research on the impact of smoking on pulmonary ventilation is not completely consistent. Many different research results have appeared because there are many pulmonary function indices and different research emphases. At present, the proportion of smokers remains high, and there is a trend of younger people being drawn towards smoking. The smoking situation of teenagers in the eastern coastal areas is more obvious. Although there are many studies on smoking and vital capacity at home and abroad at the current stage, few of them take teenagers in the eastern coastal areas as the research object, and most of them take the elderly with pulmonary function impairment as the research object. The research population is not typical, and the research results cannot be applied to the public. Most of the research methods also use the existing statistical software for analysis, and there is no in-depth discussion on the relationship and rules between pulmonary function indices. Therefore, in order to better study the harm that smoking causes to lung function, we also need to combine data mining algorithms to mine smoking data and lung function index data, analyse and classify the data, and find the rules between the data.

Based on this background, this paper studies the impact of adolescent smoking on physical training vital capacity in the eastern coastal areas, and is divided into four sections. Section 1 briefly introduces the current research on smoking and vital capacity in the field of medicine, and briefly introduces the section arrangement of this study. The second section focuses on the application of various algorithms in the medical field, introduces the existing data mining algorithms and improved algorithms, and summarises the shortcomings of the existing research. In Section 3, taking the adolescents in the eastern coastal area as an example, the pulmonary function indices are measured, the k-means algorithm is used to classify the data and this algorithm is improved. In the measurement of pulmonary function index, the improved decision tree algorithm is used to realise the classification of data. In Section 4, the algorithm proposed in this paper is simulated and applied to the analysis of adolescent vital capacity index. The experimental results show that compared with the traditional algorithm, the proposed algorithm has a better effect in the end of clustering and simpler grouping, and runs under different data sets; further, the running time of the improved algorithm is shorter than that of the traditional algorithm, and can realise the analysis of data, with a classification accuracy of 96.2%.

The innovation of this paper is to propose a data mining algorithm for adolescent vital capacity. On the one hand, it improves the k-means clustering algorithm, improves the k value and shortens the running time. On the other hand, in the decision tree algorithm, an improved discretisation process is proposed based on the conventional improvement idea, and this is used to replace the optimal threshold. In the process of calculating the average entropy, the correlation coefficient is introduced to calculate the information gain rate. The simplified McLaughlin formula is used to simplify the calculation formula, reduce the logarithmic calculation and effectively improve the operation efficiency of the algorithm.

State of the art

In the analysis of public health indicators, data mining has always been a problem that cannot be ignored. If we cannot grasp the data changes in time, it will not only affect the accuracy of test results, but also cause a lot of trouble in the research of diseases. Data mining can not only analyse the relationship between data, but also play a lot of roles in the analysis of public physical quality. At present, there are many research algorithms concerning data mining. For example, in their research and analysis, Tang Y. et al. screened the biochemical test and TCM Constitution questionnaire measurement data from 2014 to 2016 according to the inclusion and exclusion criteria, established the prediction matrix according to the longitudinal data for 3 consecutive years and established the prediction model by using treenet machine learning algorithm to reveal the dependence between physiological indices, TCM Constitution and MS [6]. Scaria et al. [7] proposed a user-friendly rule-based classification model, generated classification rules by using cuckoo search optimisation algorithm, pruned the rules through association rule mining and classified them by using the pruned rules. Janowski et al. [8] identified subjects from the pH registry of the University of Arizona in their study, excluded patients with missing haemodynamic RV function data (n = 23), performed unsupervised clustering (k-means) on systolic and diastolic ventricular function variables and determined the weight of variables in cluster allocation by linear discriminant analysis (LDA). In terms of algorithm improvement, Shan [9] used the grey gradient maximum entropy method to extract features from the image, uses the k-means method to classify the image and employs the average accuracy (AP) and joint intersection (IU) evaluation methods to evaluate the results. Wei et al. [10] used k-means algorithm to analyse face features. First, the biological features of the human face are extracted, and then the face features are clustered by k-means method. Finally, the support vector machine method is used for classification. Vera et al. [11] analysed the usefulness of multidimensional scale related to k-means clustering on dissimilarity matrix when the dimension of the object is unknown, studied the linear invariance in square dissimilarity k-means clustering and the use of multidimensional scale to determine the cluster members of observations, and solved the problem of selecting the number of clusters for dissimilarity matrix in k-means. Tuncer et al. [12] proposed two fully automated kidney segmentation methods, processed the image to determine the coordinates of the spine, obtained the kidney field using connected component labelling (CCL) and k-means clustering algorithm, and segmented the kidney with different filters according to this method. Zhou et al. [13] proposed two methods to improve k-means clustering algorithm by designing two different adjacency matrices. Velastin et al. [14] proposed a fast and effective method to realise the automatic grading of apples based on a variety of image features and weighted k-means clustering algorithm, and provided a new method to distinguish apple defects, stems and calyx by using four images (top, bottom and both sides); under this method, the average grey value of each apple Pra (proportion of red area) and defect area are carefully selected to improve the practicability of the method.

To sum up, it can be seen that at present, most medical data at home and abroad use classification algorithms to realise data analysis, and the types of data studied are also diversified, covering cardiac function, liver function and lung function indicators. However, most of these studies focus on the medical field, and there is little in-depth analysis on the use of algorithms. Although there are many data mining algorithms and related improvement strategies, such as decision tree algorithm, cluster analysis algorithm, ant colony algorithm and so on, they are relatively one-sided and rarely applied to medical data mining. Based on this, it is of great practical significance to carry out an analysis on the influence of adolescent smoking on physical training vital capacity in the eastern coastal areas.

Methodology
Research object and method

Adolescents in the eastern coastal area were selected as the research object and divided into two groups according to the smoking situation. The non-smoking teenagers were the control group and the smoking teenagers were the observation group. The vital capacity indices of the two groups were measured. The measurement indices included gender, age, maximum vital capacity and forced inspiratory vital capacity in the first second.

The research data were analysed by clustering algorithm. The present situation is that clustering algorithm has been studied in the literature for many years, and it is widely used in medicine. Among various algorithms, k-means algorithm analyses the similarity of data objects, and artificially places the data with good similarity close to the data in one class [15]. The algorithm belongs to segmentation algorithm, which is simple to implement and has high efficiency. Generally, the object is selected from the cluster data as the initial point, the distance is calculated, and then it is assigned to the corresponding class; and the cluster centre is repeatedly selected until it has ceased to change [16]. Assuming that there are n data objects, all of which are non-data sets, in the calculation, we artificially divide the k value, determine the initial clustering centre, select the corresponding data from the data set and then calculate the distance. According to the shortest distance principle, the data are divided into corresponding classes, and then the cluster centre is modified. Assuming that the given sample set is described by D = {x1,x2,…, xm}, after k-mean clustering, the class is represented by C = {c1,c2,…,cm}. The measurement index is European distance, and the formula is: disted(xi,xj)=[u=1n|xixj|2]12 dis{t_{ed}}({x_i},{\kern 1pt} {x_j}) = {\left[ {\sum\limits_{u = 1}^n {{\left| {{x_i} - {x_j}} \right|}^2}} \right]^{{1 \over 2}}}

We use this formula to calculate the error. The formula is: E=i=1kxcixηi2 E = \sum\limits_{i = 1}^k \sum\limits_{x \in {c_i}} {\left\| {x - {\eta _i}} \right\|^2} ηi=1|Ci|xCix {\eta _i} = {1 \over {\left| {{C_i}} \right|}}\sum\limits_{x \in {C_i}} x where ηi refers to the mean vector, which can also be called the centroid. The smaller the E value, the higher the similarity in the class, and thus this value needs to be taken as the minimum value. The derivative of x can get the minimum E only in x = ηi. After one iteration, the E value will decrease.

k-mean algorithm is one of the most commonly used clustering algorithms at present. This algorithm is relatively simple, easy to use, easy to combine with other algorithms and has fast convergence speed. At the same time, this kind of algorithm also has many defects, and the question of whether defects arise or not depends on the initial value and may have no solution. Using the gradient method to calculate the objective function, if the initialisation centre itself is close to the local minimum, it might easily fall into the local optimum. This algorithm is sensitive to outliers, and it will take a long time to process massive data.

K-mean algorithm is mainly to calculate and analyze the value range of k after improvement. According to the cluster analysis, there are two statistics. The total SSE value is the sum of squares of deviations of all cluster variables. The formula is expressed as: SSE=i=1kxCdist(ci,x)2 SSE = \sum\limits_{i = 1}^k \sum\limits_{x \in C} dist{({c_i},{\kern 1pt} x)^2} where ci represents the cluster and x represents the points in the cluster. The total SSB is only the sum of the squares of the deviations between different categories. The calculation formula is: SSB=i=1kmdist(ci,c)2 SSB = \sum\limits_{i = 1}^k mdist{({c_i},{\kern 1pt} c)^2} where m is the cluster size, c is the mean value and dist is the distance. Based on the k value determined, it is generally considered that the smaller the SSE value is, the better, and the larger the SSB value is, the better. Considering the influence of the number of clusters and sample size on the calculation results, it needs to be corrected, and the correction is completed by using the following formula: SSBSSE=nkk1 {{SSB} \over {SSE}} = {{n - k} \over {k - 1}}

The formula has the advantages of simple calculation and shorter running time and has great advantages in improving k value.

Cluster analysis of vital capacity index based on decision tree

To achieve the classification of vital capacity, we also need to use the decision tree algorithm to classify the data. The decision tree algorithm is based on the development probability of things, calculates the expected value and judges the possibility of things. Decision tree is a kind of machine learning and belongs to prediction model [17]. In the process of generating the decision tree, it is necessary to divide the attribute values of the data set, but the results are different: some are simple and some are very bloated. Therefore, the decision tree algorithm has strong applicability, but it often needs the optimal method. This kind of algorithm takes the root node as the starting point, selects the optimal attribute and generates branches, with the same number of branch nodes [18]. The decision tree algorithm itself is a continuous iterative algorithm, and the calculation will end only when the following conditions are met: the stage attributes belong to the same category; the empty set is marked with the most common attribute value in the parent node; and node to node are marked with the most common attribute values.

ID3 algorithm calculates the information gain on the decision tree node, selects the largest node as the node and detects all attributes. ID3 algorithm belongs to the classic algorithm of decision tree algorithm. The generation rules are easy to understand. In the construction process, from top to bottom, it reduces the number of tests and can process discrete data. However, the disadvantage of this algorithm is also obvious. It can only maintain one solution, and there may be multiple attribute values in the information gain. This algorithm can deal with non-incremental data and cannot deal with other data. C4.5 algorithm is an improved algorithm of ID3 algorithm. It is realised by using information gain rate when selecting attributes. It can deal with continuous value attributes and missing cases. It is easy to understand and has high accuracy. There are two steps in the calculation. The first step is to establish a reasonable model, taking the data to be analysed as the training set and other data as the sample set. The second step is data analysis. At present, there are many algorithms used, and these are needed to calculate the accuracy [19]. In the algorithm, the information gain rate needs to be calculated. Assuming that the data set is represented by D, and the samples in this set are represented by d, which contains m class attributes, the amount of classification information can be expressed as: lnfo(D)=i=1mpilog2pi \ln fo(D) = \sum\limits_{i = 1}^m {p_i}\mathop {\log }\nolimits_2 {p_i} where pi represents the proportion of attributes. Assuming that A is one of the attributes and there are v values, the data set is divided into n subsets. Assuming that the total number of samples of category C is represented by c, the information entropy of attribute A can be calculated by the following formula: lnfo(D)=i=1mc1j+c2j++cmjdlnfo(c1j,c2j,,cmj) \ln fo(D) = \sum\limits_{i = 1}^m {{{c_{1j}} + {c_{2j}} + \cdots + {c_{mj}}} \over d}\ln fo({c_{1j}},{c_{2j}}, \ldots ,{c_{mj}})

The other information calculation formula of Dj is: lnfo(d1j,d2j,,dmj)=i=1mpijlog2pij \ln fo({d_{1j}},{d_{2j}}, \cdots ,{d_{mj}}) = \sum\limits_{i = 1}^m {p_{ij}}\mathop {\log }\nolimits_2 {p_{ij}} where pij represents the proportion of samples. The information gain of attribute A can be expressed as: Gain(A)=lnfo(D)lnfoA(D) Gain(A) = \ln fo(D) - \ln f{o_A}(D) where Gain(A) represents the information gain. We calculate the entropy of A with the formula: SplitlnfoA(D)=j=1vpilog2pi Split\ln f{o_A}(D) = \sum\limits_{j = 1}^v {p_i}\mathop {\log }\nolimits_2 {p_i} where p represents the proportion of A in the data set. The information gain rate formula of A can be expressed as: GainRatio(A)=lnfo(D)lnfoA(D)SplitlnfoA(D) GainRatio(A) = {{\ln fo(D) - \ln f{o_A}(D)} \over {Split\ln f{o_A}(D)}}

C4.5 algorithm is to test the information gain rate from the alternative data set; we take it as the partition attribute, and construct the decision tree through iteration, which is a widely used tactic. However, this algorithm also has its own defects. Although the improvement rate ID3 algorithm has the problem of information gain bias, it also reduces the interpretability of information theory. In the construction process, analyzing the number according to the classification attributes will lead to many empty branches and increase the number of branches of the decision tree [20]. Therefore, this algorithm needs to be improved. Assuming that there is expectation for two-dimensional random variables, it is expressed as: E{[XE(X)][YE(y)]} E\{ [X - E(X)][Y - E(y)]\}

This expectation is the covariance. The formula is: Cov(x,y)=E{[XE(X)][YE(y)]} Cov(x,{\kern 1pt} y) = E\{ [X - E(X)][Y - E(y)]\}

If this random variable is discrete, the probability distribution is: P{X=xi,Y=yi}=pij P\{ X = {x_i},{\kern 1pt} Y = {y_i}\} = {p_{ij}} where i, j is a natural number. The variance of two random variables can be expressed as: Cov(x,y)=i,jE{[xiE(X)][yiE(Y)]} Cov(x,{\kern 1pt} y) = \sum\limits_{i,j} E\{ [{x_i} - E(X)][{y_i} - E(Y)]\} where X,Y represents two random variables. If both random variables are discrete, the covariance can be expressed as: Cov(x,y)=++E{[xiE(X)][yiE(Y)]}f(x,y)dxdy Cov(x,y) = \int_{ - \infty }^{ + \infty } \int_\infty ^{ + \infty } E\{ [{x_i} - E(X)][{y_i} - E(Y)]\} f(x,{\kern 1pt} y)dxdy

The relationship between variance and covariance of random variables can be described as follows: D(X+Y)=D(X)+D(Y)+2cov(x,y) D(X + Y) = D(X) + D(Y) + 2{\mathop{cov}} (x,{\kern 1pt} y)

Assuming that there are random variables that can promote the establishment of D(X) > 0, D(Y) > 0, there are: ρX,Y=Cov(X,Y)D(X)D(Y) {\rho _{X,{\kern 1pt} Y}} = {{Cov(X,{\kern 1pt} Y)} \over {\sqrt {D(X)D(Y)} }}

If this value is 0, the two random variables are considered irrelevant. There are many improved algorithms based on ID3 algorithm, including C4.5, and these are also based on the amount of information [21]. The algorithm considers the correlation of attributes and calculates the sum of correlation coefficients between test attributes and other attributes, which is expressed as: ρ=fFCov(A,f)D(A)D(f) \rho = \sum\limits_{f \in F} {{Cov(A,{\kern 1pt} f)} \over {\sqrt {D(A)D(f)} }} where ρ represents the sum of correlation coefficients of test attributes and other attributes, that is, redundancy, F represents all test attributes and f is an element of F; ρ represents the correlation degree between attributes and other test attributes, and the average correlation coefficient between test attributes and other attributes can be expressed as: ρ¯=fFCov(A,f)D(A)D(f)n \bar \rho = {{\sum\limits_{f \in F} {{Cov(A,{\kern 1pt} f)} \over {\sqrt {D(A)D(f)} }}} \over n}

Due to C4.5 in the calculation of the algorithm, an information gain rate is used as the test standard. To balance other attributes, the average correlation coefficient needs to be added. The improved information gain rate formula is expressed as: GainRatio=Gain(A)ρSplitl(A) GainRatio = {{Gain(A)} \over {\rho Splitl(A)}}

According to this formula, if the correlation between the test attribute and other attributes is very low, the smaller the redundancy and the greater the gain rate.

For C4.5, in terms of algorithm, the information gain rate is the core of the calculation. Each time a node is selected, it needs to be calculated; and every time a calculation is made, the need to calculate the logarithm is involved. If the scale is relatively small, the amount of calculation will not be large. However, in many cases, there is a large amount of data, and logarithmic calculation wastes time [22]. To shorten the running time, it is also necessary to simplify the formula by using Taylor's theorem. Assuming that the function has a derivative in an interval, there is: f(x)=f(x0)+f(x0)(xx0)+f(x0)2(xx0)2++f(n)(x0)n(xx0)n+Rn(x) f(x) = f({x_0}) + {f^\prime}({x_0})(x - {x_0}) + {{{f^{''}}({x_0})} \over 2}{(x - {x_0})^2} + \cdots + {{{f^{(n)}}({x_0})} \over n}{(x - {x_0})^n} + {R_n}(x) where x represents the value of the function. Assuming that x takes 0, the formula is further simplified as: f(x)=f(0)+f(0)x+f(n)2x2++f(n)(0)nxn f(x) = f(0) + {f^\prime}(0)x + {{{f^{''}}(n)} \over 2}{x^2} + \cdots + {{{f^{(n)}}(0)} \over n}{x^n}

Then we use McLaughlin formula to convert and eliminate logarithmic calculation. The formula can be rewritten as: f(p)=ln(1+p)=pp22+p33+(1)n1pnn f(p) = \ln (1 + p) = p - {{{p^2}} \over 2} + {{{p^3}} \over 3} - \cdots + {( - 1)^{n - 1}}{{{p^n}} \over n}

When p is at 0~1, according to the limit theorem, with the increase of n, this term becomes smaller, and thus the formula is approximated as follows: lnp=p1 \ln p = p - 1

The calculation formula of information entropy can be described as: lnfo(D)=i=1mpilog2(pi) \ln fo(D) = \sum\limits_{i = 1}^m {p_i}\mathop {\log }\nolimits_2 ({p_i})

We calculate the information entropy according to the above formula: lnfo(D)=i=1mcidlog2cid \ln fo(D) = \sum\limits_{i = 1}^m {{{c_i}} \over d}\mathop {\log }\nolimits_2 {{{c_i}} \over d}

The previous formula can be simplified and expressed as: lnfo(D)=1d2ln2i=1mci(cid) \ln fo(D) = {1 \over {{d^2}\ln 2}}\sum\nolimits_{i = 1}^m {c_i}({c_i} - d)

Finally, the simplified calculation formula of information gain rate is obtained: GainRatio(A)=1ρi=1mci(cid)dj=1vi=1mcij(cijdj)djj=1vdj(djd) GainRatio(A) = {1 \over \rho }{{\sum\nolimits_{i = 1}^m {c_i}({c_i} - d) - d\sum\nolimits_{j = 1}^v \sum\nolimits_{i = 1}^m {{{c_{ij}}({c_{ij}} - {d_j})} \over {{d_j}}}} \over {\sum\nolimits_{j = 1}^v {d_j}({d_j} - d)}}

It can be seen that after improvement, the algorithm no longer needs to calculate a logarithm, but rather the calculation involved is rendered simple, and thus the efficiency stands improved. This improved method considers the problems of information gain and redundancy at the same time, and the attribute result is more reasonable.

Result analysis and discussion
Simulation analysis

In the application of k-means algorithm to cluster analysis, the selection of initial point will directly affect the convergence speed and stability of the algorithm, and there is no fixed starting point. Therefore, in the simulation test, k value is randomly selected as the initial point. Using the method of selecting the value as far away as possible, we first select one randomly, and then select the point with the largest distance from the carton as the second starting point, and so on until k starting points are found. The clustering analysis algorithm is simulated and analysed, and the parameter indices are shown in Figure 1. Dunn coefficient reflects the shortest distance between different classes. It is generally believed that the larger this value, the better the clustering effect. The contour coefficient reflects the average distance between the object and other objects. The smaller the value, the more compact the cluster. It can be seen from the data in Table 1 that the contour coefficient and Harabasz index of the improved k-means algorithm have been improved, and the Dunn coefficient has decreased, indicating that the improved algorithm has a better effect in the end of clustering and simpler grouping.

Fig. 1

Comparison of algorithm index coefficients.

Comparative analysis of running time (ms).

Data setC4.5 algorithmImproved C4.5 algorithmThis paper's improvement of the algorithm
cleanl51.345.911.2
diabetes494.564.210.4
glass31.429.67.6
wine20.517.46.4
sonar45.830.58.6
iris16.519.35.3

Taking the clean data set, diabetes data set, glass data set, wine data set, sonar data set and iris data set as sample data sets, this paper compares and analyses the traditional C4.5. The running times of the algorithm and the improved algorithm are shown in Table 1. From the data in the table, it can be seen that under different data sets, the running time of the improved algorithm is shorter than that of the traditional algorithm.

The results of comparative analysis of algorithm classification are shown in Table 2. From the data in the table, it can be seen that under different data sets, the error rate of the improved algorithm proposed in this paper is lower than that of the other two algorithms, indicating that the algorithm can also reduce the classification error rate based on shortening the running time.

Analysis of error rate results (%).

Data setC4.5 algorithmImproved C4.5 algorithmThis paper's improvement of the algorithm
cleanl7.126.490.98
diabetes8.144.390.87
glass19.749.340.92
wine7.341.850.94
sonar4.741.290.87
iris9.782.940.64
Vital capacity cluster analysis

The processing of medical indicators is more troublesome, and the stored information contains a lot of non-numerical data, which involve personal privacy. Therefore, a combination of multiple methods needs to be adopted in data processing. Moreover, in the system, many data are stored in different databases, and thus the relationship between data cannot be ignored. Firstly, the improved clustering analysis algorithm is used to divide the sample data into different categories. The k-means improved algorithm divides the data set into six categories, and the corresponding centroid vectors are 63.54, 60.76, 70.54, 76.84, 82.54 and 87.52, respectively.

When extracting data, it is necessary to establish files, record the original data, make the original office conversion table, and list the training data set and detection data set. We take the vital capacity data of adolescents in the eastern coastal area as the training set and copy it into the actual worksheet. The information gain degree is used to select the attributes that can be distinguished. Generally, indicators with high probability have stronger prediction ability. Using the automatic screening function of the table, the vital capacity indices of non-smoking adolescents were divided into different grades. Taking the test results as the test set, the test results show that the classification accuracy is 96.2% and the qualified rate of vital capacity is 64.6%.

Conclusion

In recent years, for the analysis of medical data, the existing software has been generally used for research and analysis, and there has been no change in the parameters used. Therefore, the analysis speed is slow, the data processing is troublesome and the results are relatively one-sided; thus, the information contained in the data cannot be obtained in depth. Based on this, this paper studies the influence of adolescent smoking on physical training vital capacity in the eastern coastal area. A data mining algorithm is established to mine inactive data. Different algorithms are adopted according to different data information, and the shortcomings of the existing algorithms are improved. The average value of entropy is used as the optimal threshold, and the correlation coefficient is introduced into the calculation of information gain rate. The simulation results show that the improved algorithm proposed in this paper shows its advantages in different data sets. The algorithm is applied to the mining of vital capacity. The k-means improved algorithm divides the data sets into six categories. The test results show that the classification accuracy is 96.2%. The algorithm can classify different data types accurately and improve the reliability of the results. It should be pointed out that this paper mainly improves the discretisation of continuous attributes, and there is no corresponding standard in the selection of threshold. Moreover, in the field of data mining, in addition to the decision tree algorithm and k-means algorithm proposed in this paper, there are many other kinds, such as association rule algorithm, which need further research and analysis.

Fig. 1

Comparison of algorithm index coefficients.
Comparison of algorithm index coefficients.

Analysis of error rate results (%).

Data set C4.5 algorithm Improved C4.5 algorithm This paper's improvement of the algorithm
cleanl 7.12 6.49 0.98
diabetes 8.14 4.39 0.87
glass 19.74 9.34 0.92
wine 7.34 1.85 0.94
sonar 4.74 1.29 0.87
iris 9.78 2.94 0.64

Comparative analysis of running time (ms).

Data set C4.5 algorithm Improved C4.5 algorithm This paper's improvement of the algorithm
cleanl 51.3 45.9 11.2
diabetes 494.5 64.2 10.4
glass 31.4 29.6 7.6
wine 20.5 17.4 6.4
sonar 45.8 30.5 8.6
iris 16.5 19.3 5.3

Alinejad-Rokny H, Sadroddiny E, Scaria V. Machine learning and data mining techniques for medical complex data analysis [J]. Neurocomputing, 2017, 276(7):1–1. Alinejad-RoknyH SadroddinyE ScariaV Machine learning and data mining techniques for medical complex data analysis [J] Neurocomputing 2017 276 7 1 1 10.1016/j.neucom.2017.09.027 Search in Google Scholar

Jad A, Jmba B, Mip A, et al. Use of data mining in the establishment of age-adjusted reference intervals for parathyroid hormone - ScienceDirect [J]. Clinica Chimica Acta, 2020, 508:217–220. JadA JmbaB MipA Use of data mining in the establishment of age-adjusted reference intervals for parathyroid hormone - ScienceDirect [J] Clinica Chimica Acta 2020 508 217 220 10.1016/j.cca.2020.05.03032417213 Search in Google Scholar

N Wisittipanit, Pulsrikarn C, Wutthiosot S, et al. Application of machine learning algorithm and modified high resolution DNA melting curve analysis for molecular subtyping of Salmonella isolates from various epidemiological backgrounds in northern Thailand [J]. World Journal of Microbiology and Biotechnology, 2020, 36(7):1–13. WisittipanitN PulsrikarnC WutthiosotS Application of machine learning algorithm and modified high resolution DNA melting curve analysis for molecular subtyping of Salmonella isolates from various epidemiological backgrounds in northern Thailand [J] World Journal of Microbiology and Biotechnology 2020 36 7 1 13 10.1007/s11274-020-02874-732613458 Search in Google Scholar

Tian X, Xu D, Guo L, et al. Improved local search algorithms for Bregman k-means and its variants [J]. Journal of Combinatorial Optimization, 2021, (10):1–18. TianX XuD GuoL Improved local search algorithms for Bregman k-means and its variants [J]. Journal of Combinatorial Optimization 2021 10 1 18 10.1007/s10878-021-00771-9 Search in Google Scholar

Larijani M R, Asli-Ardeh E A, Kozegar E, et al. Evaluation of image processing technique in identifying rice blast disease in field conditions based on KNN algorithm improvement by K-means [J]. Food science & nutrition, 2019, 7(12):3922–3930. LarijaniM R Asli-ArdehE A KozegarE Evaluation of image processing technique in identifying rice blast disease in field conditions based on KNN algorithm improvement by K-means [J] Food science & nutrition 2019 7 12 3922 3930 10.1002/fsn3.1251692431031890170 Search in Google Scholar

Tang Y, Zhao T, Huang N, et al. Identification of traditional Chinese medicine constitutions and physiological indexes risk factors in metabolic syndrome: a data mining approach [J]. Evidence-based Complementary and Alternative Medicine, 2019, 2019:1–10. TangY ZhaoT HuangN Identification of traditional Chinese medicine constitutions and physiological indexes risk factors in metabolic syndrome: a data mining approach [J] Evidence-based Complementary and Alternative Medicine 2019 2019 1 10 10.1155/2019/1686205637802130854002 Search in Google Scholar

Scaria L T, Christopher T. A bio-inspired algorithm based multi-class classification scheme for microarray gene data [J]. Journal of Medical Systems, 2019, 43(7):208–220. ScariaL T ChristopherT A bio-inspired algorithm based multi-class classification scheme for microarray gene data [J] Journal of Medical Systems 2019 43 7 208 220 10.1007/s10916-019-1353-y31144036 Search in Google Scholar

Janowski A M, Ravellette K S, Garcia J G, et al. Unsupervised K-means clustering to identify signatures of right ventricle dysfunction in patients with pulmonary hypertension [J]. Circulation, 2020, 142(3):17308–17308. JanowskiA M RavelletteK S GarciaJ G Unsupervised K-means clustering to identify signatures of right ventricle dysfunction in patients with pulmonary hypertension [J] Circulation 2020 142 3 17308 17308 10.1161/circ.142.suppl_3.17308 Search in Google Scholar

Tuncer S A, Alkan A. Spinal cord based kidney segmentation using connected component labeling and K-Means clustering algorithm [J]. Traitement du Signal, 2019, 36(6):1–33. TuncerS A AlkanA Spinal cord based kidney segmentation using connected component labeling and K-Means clustering algorithm [J] Traitement du Signal 2019 36 6 1 33 10.18280/ts.360607 Search in Google Scholar

Shan P. Image segmentation method based on K-mean algorithm [J]. EURASIP Journal on Image and Video Processing, 2018, 2018(1):1–9. ShanP Image segmentation method based on K-mean algorithm [J] EURASIP Journal on Image and Video Processing 2018 2018 1 1 9 10.1186/s13640-018-0322-6 Search in Google Scholar

Wei P, Zhou Z, Li L, et al. Research on face feature extraction based on K-mean algorithm [J]. EURASIP Journal on Image and Video Processing, 2018, 2018(1):1–9. WeiP ZhouZ LiL Research on face feature extraction based on K-mean algorithm [J] EURASIP Journal on Image and Video Processing 2018 2018 1 1 9 10.1186/s13640-018-0313-7 Search in Google Scholar

Vera J F, Macías R. On the behaviour of k-means clustering of a dissimilarity matrix by means of full multidimensional scaling [J]. Psychometrika, 2021, 86(2):489–513. VeraJ F MacíasR On the behaviour of k-means clustering of a dissimilarity matrix by means of full multidimensional scaling [J] Psychometrika 2021 86 2 489 513 10.1007/s11336-021-09757-234008128 Search in Google Scholar

Zhou J, Liu T, Zhu J. Weighted adjacent matrix for K-means clustering [J]. Multimedia Tools and Applications, 2019, 78(23):33415–33434. ZhouJ LiuT ZhuJ Weighted adjacent matrix for K-means clustering [J] Multimedia Tools and Applications 2019 78 23 33415 33434 10.1007/s11042-019-08009-x Search in Google Scholar

Yu Y, Velastin S A, Yin F. Automatic grading of apples based on multi-features and weighted K-means clustering algorithm [J]. Information Processing in Agriculture, 2020, 7(4):556–565. YuY VelastinS A YinF Automatic grading of apples based on multi-features and weighted K-means clustering algorithm [J] Information Processing in Agriculture 2020 7 4 556 565 10.1016/j.inpa.2019.11.003 Search in Google Scholar

Lakshmi R, Baskar S. DIC-DOC-K-means: dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering [J]. Journal of Information Science, 2019, 45(6):818–832. LakshmiR BaskarS DIC-DOC-K-means: dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering [J] Journal of Information Science 2019 45 6 818 832 10.1177/0165551518816302 Search in Google Scholar

Maithri C and Chandramouli H. Implementation of parallelized K-means and K-Medoids++ clustering algorithms on hadoop map reduce framework [J]. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 2019, 9(2s):530–535. MaithriC ChandramouliH Implementation of parallelized K-means and K-Medoids++ clustering algorithms on hadoop map reduce framework [J] International Journal of Innovative Technology and Exploring Engineering (IJITEE) 2019 9 2s 530 535 10.35940/ijitee.B1045.1292S19 Search in Google Scholar

Li Tao et al. Economic granularity interval in decision tree algorithm standardization from an open innovation perspective: towards a platform for sustainable matching [J]. Journal of Open Innovation: Technology, Market, and Complexity, 2020, 6(4):149. LiTao Economic granularity interval in decision tree algorithm standardization from an open innovation perspective: towards a platform for sustainable matching [J] Journal of Open Innovation: Technology, Market, and Complexity 2020 6 4 149 10.3390/joitmc6040149 Search in Google Scholar

Weiying Zhang. Research on English score analysis system based on improved decision tree algorithm and fuzzy set [J]. Journal of Intelligent & Fuzzy Systems, 2020, 39(4):1–13. ZhangWeiying Research on English score analysis system based on improved decision tree algorithm and fuzzy set [J] Journal of Intelligent & Fuzzy Systems 2020 39 4 1 13 10.3233/JIFS-189046 Search in Google Scholar

Wang Peng and Zhang Ningchao and Patnaik Srikanta. Decision tree classification algorithm for non-equilibrium data set based on random forests [J]. Journal of Intelligent & Fuzzy Systems, 2020, 39(2):1639–1648. WangPeng ZhangNingchao SrikantaPatnaik Decision tree classification algorithm for non-equilibrium data set based on random forests [J] Journal of Intelligent & Fuzzy Systems 2020 39 2 1639 1648 10.3233/JIFS-179937 Search in Google Scholar

Fatima Es Sabery and Abdellatif Hair. A MapReduce C4.5 Decision tree algorithm based on fuzzy rule-based system [J]. Fuzzy Information and Engineering, 2020, 11(4):1–28. SaberyFatima Es HairAbdellatif A MapReduce C4.5 Decision tree algorithm based on fuzzy rule-based system [J] Fuzzy Information and Engineering 2020 11 4 1 28 10.1080/16168658.2020.1756099 Search in Google Scholar

Mohammad Gohari and Amir Mohammad Eydi. Modelling of shaft unbalance: modelling a multi discs rotor using K-Nearest neighbor and decision tree algorithms [J]. Measurement, 2020, 151(C):107253. GohariMohammad EydiAmir Mohammad Modelling of shaft unbalance: modelling a multi discs rotor using K-Nearest neighbor and decision tree algorithms [J] Measurement 2020 151 C 107253 10.1016/j.measurement.2019.107253 Search in Google Scholar

Bai Xiulian, Bayar, Hasqichig. Remote Sensing Technology and Application, 2014, 29(02):338–343. BaiXiulian BayarHasqichig Remote Sensing Technology and Application 2014 29 02 338 343 Search in Google Scholar

Artículos recomendados de Trend MD

Planifique su conferencia remota con Sciendo