1. bookAHEAD OF PRINT
Détails du magazine
License
Format
Magazine
eISSN
2444-8656
Première parution
01 Jan 2016
Périodicité
2 fois par an
Langues
Anglais
Accès libre

Automatic Knowledge Integration Method of English Translation Corpus Based on Kmeans Algorithm

Publié en ligne: 15 Jul 2022
Volume & Edition: AHEAD OF PRINT
Pages: -
Reçu: 19 Jan 2022
Accepté: 28 Mar 2022
Détails du magazine
License
Format
Magazine
eISSN
2444-8656
Première parution
01 Jan 2016
Périodicité
2 fois par an
Langues
Anglais
Abstract

We propose a feature extraction method based on the Kmeans algorithm based on the text characteristics in the English translation corpus. The article first uses a sparse autoencoder unsupervised learning method to reduce dimensionality. It then uses the Kmeans clustering algorithm for text clustering. The experimental results prove that the text features extracted by the sparse autoencoder based on the Kmeans algorithm can be used for English translation corpus knowledge clustering to achieve automatic integration. And this method can effectively solve the problems of high-dimensional, sparse, and noisy texts in the English translation corpus. The algorithm mentioned in the article can significantly improve the accuracy of the clustering results.

Keywords

MSC 2010

Introduction

People's exploration of the general characteristics of the translated text and the translation process has gone through the following stages: empirical research on the translation process-introspection of vocal thinking-Translog-comprehensive induction, and deduction. This laid a methodological foundation for the later establishment of a translation corpus to analyze the characteristics of the translation. The advocacy and development of descriptive research methods by polysystem theorists and Mona Baker's systematic research on the general characteristics of translation led to the birth of the translated English corpus. Because of people's thinking, the divergence of the divergence, and the randomness of the publishing method, the structure of the short text in English translation is extremely inconsistent. The information provided by a single short text is very limited [1]. There is a problem of high sparseness when processing a large amount of short text. Effectively organizing and analyzing the massive, irregular, and sparse short texts in English translation has become a challenging research hotspot. Aiming at the problem of feature extraction and clustering of English translation short texts, we use the idea of deep learning to conduct research. The article uses automated encoder processing technology and the Kmeans algorithm to extract hidden features in short texts. In this way, more accurate English translation short text clustering results can be obtained.

Algorithm flow
Basic ideas

The basic idea of the Sparse Automatic Encoding (DSAE) text clustering algorithm is to use the automatic encoding process of deep learning [2]. This part also includes the regularization process and the noise-adding process. The algorithm flow is shown in Figure 1.

Figure 1

The basic flow of a noisy, sparse auto-encoded text clustering algorithm

After the short text is cleaned and segmented, the word bag that constitutes these short texts is obtained [3]. The performance is as follows: x=(t1,t2,t3,,ti,tm) x = ({t_1},{t_2},{t_3}, \cdots,{t_i} \cdots,{t_m}) m represents the total number of words in the bag of words. ti represents whether the short text contains the i word. If it contains the word, then ti =1, otherwise ti = 0.

Basic autoencoder

The basic autoencoder accepts an input vector x and first changes it linearly. Under the action of the activation function, an encoding result y is obtained. We choose the sigmoid function as the activation function, and the calculation method is shown in formula (2). The encoding result y is the reconstructed vector z under the action of the decoder, and the calculation formula is shown in equation (3). y=fθ(x)=s(Wx+b) y = {f_\theta}(x) = s(Wx + b) z=gθ(y)=s(Wy+b) z = {g_{\theta '}}(y) = s(W'y + b')

Minimize the loss caused by reconstruction to obtain the optimal parameters θ* and θ* as equation (4). The loss function used in this paper is the Kullback-Leibler divergence as equation (5). θ*,θ*=argminθ,θL(x,z)=argminθ,θL(x,gθ(fθ(x))) {\theta ^*},\theta {'^*} = \mathop {\arg \min}\limits_{\theta,\theta '} \,L(x,z) = \mathop {\arg \min}\limits_{\theta,\theta '} \,L(x,{g_{\theta '}}({f_\theta}(x))) L(x,z)=KL(x||z) L(x,z) = KL(x||z)

The autoencoder uses the classic stochastic gradient descent algorithm for training. In each iteration process, use equation (6) to update the weight matrix: WWlL(x,z)W W \leftarrow W - l{{\partial L(x,z)} \over {\partial W}}

Where l is the learning rate. b and b′ are updated in the same way. The structure of the automatic encoder is shown in Figure 2. The process of encoding and decoding completes the feature extraction of text information.

Figure 2

Basic autoencoder structure

L1 normal form regularization

The powerful nonlinear expression ability fully describes the unique characteristics of individual objects. The structure of short texts is quite different and has many unique features. If the autoencoder algorithm is used directly, the final extracted feature vector will not reflect the public distribution characteristics of the short text [4]. This article restricts the learning ability of the autoencoder.

This paper adopts this idea and uses the absolute value function as a penalty term to compress the coefficients of the autoencoder. In this way, the coefficients with smaller absolute values are automatically compressed to 0 to ensure the sparsity of the parameters. This can avoid over-learning insignificant features in short texts. We adjust formula (5) to formula (7) and (8) to calculate. L(x,z)=KL(x||z)+Lasso(θ) L(x,z) = KL(x||z) + Lasso(\theta) Lasso(θ)=λj=0|θ||θj| Lasso(\theta) = \lambda \sum\limits_{j = 0}^{|\theta |} {|{\theta _j}|}

λ is the parameter of the L1 paradigm. Its value needs to be debugged several times based on actual data. It can help the model achieve a balance between fitting ability and generalization ability.

Noise processing

The autoencoder can get a better feature extraction effect when the output layer dimension is greater than or equal to the input layer dimension. It will directly copy the sparse input vector and output it to the decoder. In this way, the purpose of extracting abstract features in short texts cannot be achieved.

In response to these problems, the method adopted in this paper is to add a certain amount of noise to the short text vector and then input it into the encoder for training. On the one hand, this paper selects part of the data to be forced to 0; on the other hand, it also randomly selects a certain percentage of data to be forced to 1. The former considers that there may be some missing data in the high-dimensional input vector, and the trained autoencoder should be able to restore these missing features [5]. The latter is to take into account the irregularity of the input of the English translation corpus, so no one can guarantee that the model avoids the influence of personalized or irrelevant input. After adding noise, the input vector x becomes x¯ \bar x . The calculation method of stochastic gradient descent algorithm optimization is as follows: θ*,θ*=argminθ,θL(x,z)=argminθ,θL(x,gθ(fθ(x˜))) {\theta ^*},\theta {'^*} = \mathop {\arg \min}\limits_{\theta,\theta '} \,L(x,z)\mathop {= \arg \min}\limits_{\theta,\theta '} \,L(x,{g_{\theta '}}({f_\theta}(\tilde x)))

We superimpose multiple noisy sparse autoencoders to form a deep learning network. In the training process, the input of the K layer network is the short text vector output by the encoder in the K − 1 layer network. The K layer network continuously adjusts the parameters by minimizing the loss function to make the input and the reconstructed result of the decoder as close as possible. After reaching the optimal solution, the K layer network discards the decoder and uses the abstracted low-dimensional feature vector output by the encoder as the input of the K + 1 layer. This loop constitutes a short text feature extraction model based on a deep noise sparse encoder [6]. Its structure is shown in Figure 3.

Figure 3

Short text deep learning structure

Short text clustering

After the above-mentioned auto-encoder processing and the regularization and noise addition process, the short text low-dimensional feature vector is obtained. This lays the foundation for further text mining. This article applies the above results to the cluster analysis of short texts and discusses its impact on the clustering effect.

K-means algorithm is a simple and efficient data clustering algorithm, which is widely used in text clustering. This paper uses the K-means algorithm to randomly select K short text vectors from the low-dimensional short text feature vectors obtained by training as the initial cluster centers. According to the distance from the cluster center, we assign each other short text vectors to the nearest cluster [7]. Then we recalculate the mean of each cluster. Then use these new cluster centers to redistribute each short text vector. Until the distribution of the short text vector no longer changes, the final clustering result of the short text is obtained.

Experiment and result analysis
Experimental data

This article selects more representative English translation corpus data as the analysis object. At the same time, we use the short text clustering algorithm based on deep noise sparse autoencoder for unsupervised learning [8]. Then it is compared with the results of manual annotation and existing similar research to verify the algorithm's effectiveness. The data comes from the English translation corpus on the big data sharing platform-Datang.

Evaluation Index

This paper uses Entropy and Precision to evaluate the effectiveness of the algorithm. Entropy measures the purity of a clustering result. The smaller the value, the higher the purity of the cluster. The calculation method of Entropy is shown in equations (10) and (11): Entrpy=k=1G|Ak|Nj=1Gpjk×log(pjk) Entrpy = - \sum\limits_{k = 1}^{G'} {{{|{A_k}|} \over N}} \sum\limits_{j = 1}^G {{p_{jk}} \times \log ({p_{jk}})} pjk=1|Ak||{di|label(di=cj)}| {p_{jk}} = {1 \over {|{A_k}|}}|\{{d_i}|label({d_i} = {c_j})\} |

Where G represents the number of clusters obtained by the algorithm. G′ represents the actual number of categories. Ai represents a certain cluster class in the cluster. Among them, the actual mark of each English translation diA, i = 1,…|A| is label(di), and its value is equal to the standard class mark Cj (j = 1,…|G).

Precision measures the accuracy of the clustering results. The larger the value, the higher the quality of the clustering. It assumes that the largest number of actual class IDs in each cluster class is the ID of the cluster class, so the Precision of each cluster class is equal to the proportion of the largest actual class ID. As shown in formula (12): Precision(Ai)=1|A|max(|{di|label(di)=cj}|) Precision\,({A_i}) = {1 \over {|A|}}\max (|\{{d_i}|label({d_i}) = {c_j}\} |)

The total Precision of the clustering results is the weighted average of all cluster Precision: Precision=k=1G|Ak|NPrecision(Ak) Precision = \sum\limits_{k = 1}^{G'} {{{|{A_k}|} \over N} Precision({A_k})}

Experimental process

We preprocess the data in the English translation corpus. Because the links in the English translation are all automatically generated short links, these are not helpful to the analysis of the English translation content. However, the English translations of machine users often deliberately add more irrelevant links. So we are here to remove the short link string and only keep the “HTTP” keyword. Then we use the common Chinese typos to replace the wrong words in the English translation. Finally, we use the NLPIR2014 Chinese word segmentation system to segment the cleaned-up English translation [9]. Combining the English translation content characteristics, establish a stop word list and remove the stop words in the word segmentation results to obtain the English translation corpus data word bag.

The experiment is divided into 4 parts: We directly cluster the text feature vectors. We use Wikipedia entry content to expand the text feature vector before clustering. The feature vector is constructed based on introducing the semantic dictionary and merging similar words, and then the clustering is carried out. We use the method of this paper (that is, after the deep noise sparse auto-encoder processing) before clustering. The clustering methods all use the K-Means algorithm.

The result of word segmentation is 17,171 different vocabularies. We conducted word frequency statistics and found that 6698 words appeared only once. This is of little significance for analyzing the similarity between short texts. The first 5000 words can fully reflect the content of the short text, and more words will not influence the experimental results. Therefore, this experiment also selects the first 5000 words as the feature set of each English translation. According to the method described above, the corresponding space vector is established, and then the K-Means algorithm is used for clustering directly.

We first use the method of Wikipedia entry content to expand the text feature vector. The article downloads the Chinese version of the Wikipedia library released on August 23, 2020, from Wikipedia and imports it into the Mysql database. Then use Lucene to create an index of Wikipedia content. There is no space between Chinese words as natural delimiters. If you put the entire short text directly in Wikipedia, you often won’t get any query results. We search for each word in each English translation word bag as a keyword to obtain a Wikipedia document for expanding information [10]. The article combines the word segmentation result of the document with the original English translation bag of words. At the same time, we use the information gain value index to sort. Take the first 5000 words as the feature set of each English translation and build a space vector. Finally, we use K-Means for clustering.

The process of the deep noise sparse autoencoder method proposed in this paper is as follows: We use the Python-based Theano library to implement. As the number of hidden layers of deep learning increases, the time spent training will increase rapidly. After many debugging, this experiment selects a deep learning structure with three hidden layers. The number of nodes is 5000, 3000, and 1000, respectively. When the set 0 and 1 parameters are too large, the autoencoder will have a very high error due to too much information missing. In addition, experiments found that setting 0 and 1 can achieve better results when decreasing layer by layer. The parameter set to 1 should not be too large. In the end, the probability of adding 0 noise selected in this experiment is 0.3, 0.2, and 0.1, and the probability of adding 1 noise is 0.03, 0.02, and 0.01, respectively. The penalty term coefficient of the L1 normal form regularization tested in the experiment has 10−1, 10−2, K, 10−8. Among them, 10−4 it performs best. When training the model, the random gradient descent, the selection of the algorithm learning rate is also very important. If the learning rate is too large, it is easy to cause the model to converge to the optimal local solution. Too small will make the training time too long. This experiment also tested 8 different parameters, such as 10−1, 10−2, K, 10−8, and finally found that the result of learning rate 10−2 is the best.

Experimental results and analysis

The clustering results obtained by the above four methods are shown in Tables 1 to 4, and the corresponding information entropy and accuracy are shown in Figures 4 and 5. Kmeans represent the unprocessed result. Wiki+K-Means represents the result of Wikipedia expansion. Cilin+K-Means represents the result of introducing the synonym word forest. DSAE+K-means represents the result processed by the deep noise sparse autoencoder. Combining information entropy and accuracy of the two measurement standards, we can see that the effects of the four methods are ranked from low to high. The order is K-means, Wiki+K-Means, Cilin+K-Means, and DSAE+K-means.

K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 777 301 286
Cluster 2 364 1072 348
Cluster 3 359 127 866

Wiki+K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 926 233 237
Cluster 2 384 1026 167
Cluster 3 190 241 1096

Cilin+K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 1120 192 194
Cluster 2 134 1278 207
Cluster 3 246 30 1099

DSAE+K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 1317 64 103
Cluster 2 23 1433 225
Cluster 3 160 3 1172

Figure 4

Information entropy of clustering

Figure 5

Accuracy of clustering

It can be seen from the experimental results that the effect of directly using K-Means is the worst. Its comprehensive information entropy is 0.457, and its weighted accuracy rate is only 61.4%. This shows that the high-dimensional and sparse characteristics of short text greatly influence the traditional space vector method, which makes it lose the value of practical application [11]. The method of Wiki+K-Means is only slightly better than K-Means. The main reason is that the short text data used in the previous experiments are all news, and when using vocabulary, English translation is more casual and changeable than news. When using Wikipedia to expand the information of words in English translation, some words can accurately find the corresponding entries. But for the other two types of English translation, if we use Wikipedia's expansion method, it may be that the part of the entry information found is quite different from the content that the author wants to reflect, so the clustering effect cannot be improved. The overall clustering effect of Cilin+K-Means is not bad. This shows that using synonyms for word frequency merging to achieve dimensionality reduction of short text vectors is more effective. This makes the clustering effect impossible to achieve the best. On the other hand, it has higher requirements for thesaurus, and unregistered words will have a greater impact on the method. DSAE+K-Means achieved the best results among several methods, with a comprehensive information entropy of 0.207 and a complete accuracy of 87.8%. The purity of the clustering results is the highest, and the accuracy is also the best. This shows that the deep noise sparse autoencoder can use its nonlinear characteristics to learn low-dimensional abstract features from high-dimensional underlying features. This can dig out the essence of the short text space vector and significantly improve clustering performance.

Conclusion

This paper proposes a deep noise sparse autoencoder algorithm based on the high-dimensional and sparse characteristics of the English translation corpus space vector. We add regularization of the L1 paradigm to avoid overfitting and control sparsity. At the same time, we add 0 and 1 noise to the input data. This effectively extracts the essential features of the data while reducing the dimension of the short text space vector. It achieves a good clustering effect in the cluster analysis of short texts.

Figure 1

The basic flow of a noisy, sparse auto-encoded text clustering algorithm
The basic flow of a noisy, sparse auto-encoded text clustering algorithm

Figure 2

Basic autoencoder structure
Basic autoencoder structure

Figure 3

Short text deep learning structure
Short text deep learning structure

Figure 4

Information entropy of clustering
Information entropy of clustering

Figure 5

Accuracy of clustering
Accuracy of clustering

DSAE+K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 1317 64 103
Cluster 2 23 1433 225
Cluster 3 160 3 1172

Cilin+K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 1120 192 194
Cluster 2 134 1278 207
Cluster 3 246 30 1099

K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 777 301 286
Cluster 2 364 1072 348
Cluster 3 359 127 866

Wiki+K-means clustering results

Clustering result IT Finance Healthy
Cluster 1 926 233 237
Cluster 2 384 1026 167
Cluster 3 190 241 1096

Lydia, E. L., Kumar, P. K., Shankar, K., Lakshmanaprabu, S. K., Vidhyavathi, R. M., & Maseleno, A. Charismatic document clustering through novel K-Means non-negative matrix factorization (KNMF) algorithm using key phrase extraction. International Journal of Parallel Programming., 2020; 48(3): 496–514 LydiaE. L. KumarP. K. ShankarK. LakshmanaprabuS. K. VidhyavathiR. M. MaselenoA. Charismatic document clustering through novel K-Means non-negative matrix factorization (KNMF) algorithm using key phrase extraction International Journal of Parallel Programming. 2020 48 3 496 514 10.1007/s10766-018-0591-9 Search in Google Scholar

Santoso, J., Setiawan, E. I., Yuniarno, E. M., Hariadi, M., & Purnomo, M. H. Hybrid Conditional Random Fields and K-Means for Named Entity Recognition on Indonesian News Documents. International Journal of Intelligent Engineering and Systems., 2020; 13(3): 233–245 SantosoJ. SetiawanE. I. YuniarnoE. M. HariadiM. PurnomoM. H. Hybrid Conditional Random Fields and K-Means for Named Entity Recognition on Indonesian News Documents International Journal of Intelligent Engineering and Systems. 2020 13 3 233 245 10.22266/ijies2020.0630.22 Search in Google Scholar

Sengupta, S., Pandit, R., Mitra, P., Naskar, S. K., & Sardar, M. M. Word sense induction in bengali using parallel corpora and distributional semantics. Journal of Intelligent & Fuzzy Systems., 2019; 36(5): 4821–4832 SenguptaS. PanditR. MitraP. NaskarS. K. SardarM. M. Word sense induction in bengali using parallel corpora and distributional semantics Journal of Intelligent & Fuzzy Systems. 2019 36 5 4821 4832 10.3233/JIFS-179030 Search in Google Scholar

Kryštufek, B., Shenbrot, G., Klenovšek, T., & Janžekovič, F. Geometric morphometrics of mandibular shape in the dwarf fat-tailed jerboa: relevancy for trinomial taxonomy. Zoological Journal of the Linnean Society., 2021; 192(4): 1363–1372 KryštufekB. ShenbrotG. KlenovšekT. JanžekovičF. Geometric morphometrics of mandibular shape in the dwarf fat-tailed jerboa: relevancy for trinomial taxonomy Zoological Journal of the Linnean Society. 2021 192 4 1363 1372 10.1093/zoolinnean/zlaa130 Search in Google Scholar

Wibawa, A. P., Fithri, H. K., Zaeni, I. A. E., & Nafalski, A. Generating Javanese Stopwords List using K-means Clustering Algorithm. Knowledge Engineering and Data Science., 2020; 3(2): 106–111 WibawaA. P. FithriH. K. ZaeniI. A. E. NafalskiA. Generating Javanese Stopwords List using K-means Clustering Algorithm Knowledge Engineering and Data Science. 2020 3 2 106 111 10.17977/um018v3i22020p106-111 Search in Google Scholar

Fang, H., Shi, H., & Zhang, J. Heuristic Bilingual Graph Corpus Network to Improve English Instruction Methodology Based on Statistical Translation Approach. Transactions on Asian and Low-Resource Language Information Processing., 2021; 20(3): 1–14 FangH. ShiH. ZhangJ. Heuristic Bilingual Graph Corpus Network to Improve English Instruction Methodology Based on Statistical Translation Approach Transactions on Asian and Low-Resource Language Information Processing. 2021 20 3 1 14 10.1145/3406205 Search in Google Scholar

Hsieh, T. J. T., Kuriki, I., Chen, I. P., Muto, Y., Tokunaga, R., & Shioiri, S. Basic color categories in Mandarin Chinese revealed by cluster analysis. Journal of vision., 2020; 20(12): 6–6 HsiehT. J. T. KurikiI. ChenI. P. MutoY. TokunagaR. ShioiriS. Basic color categories in Mandarin Chinese revealed by cluster analysis Journal of vision. 2020 20 12 6 6 10.1167/jov.20.12.6767186033196769 Search in Google Scholar

Nurhachita, N., & Negara, E. S. A Comparison Between Naïve Bayes and The K-Means Clustering Algorithm for The Application of Data Mining on The Admission of New Students. Jurnal Intelektualita: Keislaman, Sosial dan Sains., 2020; 9(1): 51–62 NurhachitaN. NegaraE. S. A Comparison Between Naïve Bayes and The K-Means Clustering Algorithm for The Application of Data Mining on The Admission of New Students Jurnal Intelektualita: Keislaman, Sosial dan Sains. 2020 9 1 51 62 10.19109/intelektualita.v9i1.5574 Search in Google Scholar

Yang, M., Liu, S., Chen, K., Zhang, H., Zhao, E., & Zhao, T. A hierarchical clustering approach to fuzzy semantic representation of rare words in neural machine translation. IEEE Transactions on Fuzzy Systems., 2020; 28(5): 992–1002 YangM. LiuS. ChenK. ZhangH. ZhaoE. ZhaoT. A hierarchical clustering approach to fuzzy semantic representation of rare words in neural machine translation IEEE Transactions on Fuzzy Systems. 2020 28 5 992 1002 10.1109/TFUZZ.2020.2969399 Search in Google Scholar

Çitil, H. Important Notes for a Fuzzy Boundary Value Problem. Applied Mathematics and Nonlinear Sciences., 2019; 4(2): 305–314 ÇitilH. Important Notes for a Fuzzy Boundary Value Problem Applied Mathematics and Nonlinear Sciences. 2019 4 2 305 314 10.2478/AMNS.2019.2.00027 Search in Google Scholar

Sharifi, M. & Raesi, B. Vortex Theory for Two Dimensional Boussinesq Equations. Applied Mathematics and Nonlinear Sciences., 2020; 5(2): 67–84 SharifiM. RaesiB. Vortex Theory for Two Dimensional Boussinesq Equations Applied Mathematics and Nonlinear Sciences. 2020 5 2 67 84 10.2478/amns.2020.2.00014 Search in Google Scholar

Articles recommandés par Trend MD

Planifiez votre conférence à distance avec Sciendo