1. bookAHEAD OF PRINT
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2444-8656
Pierwsze wydanie
01 Jan 2016
Częstotliwość wydawania
2 razy w roku
Języki
Angielski
Otwarty dostęp

Professional English Translation Corpus Under the Binomial Theorem Coefficient

Data publikacji: 15 Jul 2022
Tom & Zeszyt: AHEAD OF PRINT
Zakres stron: -
Otrzymano: 07 Mar 2022
Przyjęty: 08 May 2022
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2444-8656
Pierwsze wydanie
01 Jan 2016
Częstotliwość wydawania
2 razy w roku
Języki
Angielski
Introduction

Text data mining is the process of discovering hidden information and new knowledge from a large amount of unstructured text. The commercial value of this method is high. It is closely related to natural language processing. The most critical of them is to extract helpful text features. Mining methods commonly include text classification, clustering, association analysis, and trend prediction. The maturity and perfection of statistical algorithms promote the continuous development of text mining technology. Computing text similarity is the key and foundation of mining other text data. It is getting more and more attention [1]. A translation is also a text. By comparing the similarity of translated texts, we can provide critical quantitative indicators for translation quality assessment and translation comparison.

Study Design
Research questions

The translation of English passive sentences is critical in developing online machine translation [2]. What is the similarity between online machine translation and human translation in translating English passive sentences into Chinese passive sentences? What method can be used to measure this similarity objectively and scientifically? These are the two research questions of this paper.

Study Design

The method of text data mining may be desirable. Its commonly used methods include text classification, clustering, feature extraction, and information compression. The vector space model (VSM) is one of the most frequently used models. Information compression usually uses principal component analysis. This study attempts to combine VSM and PCA to measure and compare the similarity between online machine translation and human translation in the corresponding translated passive sentences. In this way, a new method can be found to quantify and measure the similarity of the translation of particular sentence patterns in different versions [3]. This study intends to build a parallel corpus. We use the method of data mining, combining quantitative and qualitative analysis to compare the similarity between online machine translation and human translation in the translation of passive sentences.

Research methods
Corpus selection

The corpus selected in this paper is the self-built one-to-five English-Chinese bilingual parallel corpus of Pride and Prejudice. The total storage capacity is about 1.1 million words. The oeuvre includes a completely original English text, an online machine translation, and four human translations [4]. The manual translators were Document A (Shanghai Translation Publishing House, 1980), Document B (Yilin Publishing House, 1985), Document C and Zhang Yanghe (1995 People's Literature Publishing House), and Document D (Zhejiang Literature and Art Publishing House, 2004). Baidu online translation was selected as the online translation in this study. We automatically generate it online through the website http://fanyi.baidu.com/. The article processes these translations manually and by computer. We clean and process the corpus of human and online translations for parallel alignment. In this way, a one-to-five parallel corpus can be constructed.

The primary research process

The traditional VSM feature extraction is mainly based on the frequency of high-frequency words, which ignores the semantic and syntactic structure. The usual high-frequency words primarily function words such as auxiliary words. Corresponding translation studies of passive sentences need to integrate different vocabulary, syntax, and semantics. Therefore, the feature selection of this study corresponds to multiple types of passive marker words in translation [5]. We use the parallel retrieval software CUC_Paraconc3.0 to extract English passive sentences and their corresponding Chinese translations based on an English-Chinese bilingual parallel corpus. Remove undesired corpus from search results. In this way, a multi-version parallel corpus containing only passive sentences is constructed. We extract similar sentence pairs by editing the regular expressions of English passive sentences and then statistically compare the specific data of the corresponding translations in each translation. The article then explores the similarity between online and human translations. At the same time, we propose a similarity calculation formula that combines PCA and spatial vector distance. This paper also improves the traditional distance formula. This paper calculates the similarity between translations by a new space vector distance calculation formula. This way, an objective and quantitative description of the translation similarity of passive sentences between translations is achieved.

Vector space model
Vector cosine distance

VSM is often used in the field of information retrieval. It has strong operability and computability. The main principle of VSM is to represent documents as vectors. We take the weights of the extracted feature items as each dimension of the vector. This paper calculates the similarity by word frequency statistics and vector dimensionality reduction. It is a standard measure of text similarity [6]. The cosine angle value usually expresses the vector similarity. The larger the matter, the smaller the angle. This indicates a higher similarity between the two. So VSM can generally be used to calculate the similarity between texts. The feature vector of each translation corresponds to a point in the space, and their relative positions determine the similarity between the two translations. Measure the cosine of the angle between different translation vectors. It can be used to characterize their similarity.

First, n features are extracted in each document. These n features form a feature space, and each text can be represented as a dimensional vector. Each dimension in the vector represents the weight of the corresponding component in the text. Assuming that the feature space vector of one text is x = (x1 x2xN) and the other text vector y = (y1 y2yN), the formula for calculating the cosine angle between the two text vectors is as follows: cos(θ)=xy|x||y|=i=1Nxiyii=1Nxi2i=1Nyi2 \cos \left( \theta \right) = {{\left\langle {x\,y} \right\rangle } \over {\left| x \right|\,\left| y \right|}} = {{\sum\nolimits_{i = 1}^N {{x_i}{y_i}} } \over {\sqrt {\sum\nolimits_{i = 1}^N {x_i^2} } \,\sqrt {\sum\nolimits_{i = 1}^N {y_i^2} } }}

Where 〈 〉 represents the inner product operation. | | represents the length of the vector. The cosine of the angle between the vectors is equal to the inner product of the unit vectors of the two vectors x and y. The larger the cos(θ) value, the more similar the two texts are [7]. The smaller the cos(θ) matter, the more significant the difference between the two texts. This paper uses the angle cosine distance (CD) to calculate. Define dc as the CD of the feature vector between two texts, then dc=1cos(θ) {d_c} = 1 - \cos \left( \theta \right)

Vector Euclidean Distance

Euclidean distance (ED) refers to the absolute distance between two points in space. It is determined by where each point is located. It can reflect the fundamental difference of numerical values, which aligns with the traditional cognitive concept of distance [8]. Accordingly, we use ED to estimate the similarity between the two translations. Assuming two text vectors are x = (x1 x2xN) and y = (y1 y2yN), the ED between them is: dc=(x1y1)2+(x2y2)2++(xNyN)2=iN=1(xiyi)2 {d_c} = \sqrt {{{\left( {{x_1} - {y_1}} \right)}^2} + {{\left( {{x_2} - {y_2}} \right)}^2} + \cdots + {{\left( {{x_N} - {y_N}} \right)}^2}} = \sqrt {\sum\nolimits_i^N { = 1{{\left( {{x_i} - {y_i}} \right)}^2}} }

The larger the value of the Euclidean distance de, the farther the distance between the two vectors, the smaller the similarity between the two.

Improved vector distance

The vector CD can represent the distance in the two vector directions, while the ED can reflect the length [9]. We need to include both direction and altitude. Therefore, this study proposes a method that combines the two distances.

Because the vectors CD and ED dimensions are different, we must first normalize the two distances. Obtain the normalized CD as dcn and ED as den, where dcn=dc/(maxdc) {d_{cn}} = {d_c}/\left( {\max \,{d_c}} \right) den=de/(maxde) {d_{en}} = {d_e}/\left( {\max \,{d_e}} \right)

We perform a weighted average of the two distances. At this point, we construct a new distance formula as follows: d=(dcn+den)/2 d = \left( {{d_{cn}} + {d_{en}}} \right)/2

Application of principal component analysis

Suppose there are M samples (translations). It is represented in vector form X = [x1 x2xM]. Each sample extracts features, denoting as xi = [xi1 xi2xiN]N. where [ ]T stands for transpose. The basic process of PCA is as follows:

Data Normalization

Since the data features we extract may have different dimensions, their numerical values are quite different and will affect the results. So we need to eliminate orders of magnitude from the original data and normalize the data. i.e., we transform it into normalized data with mean 0 and variance 1. When the dimensions of the original data are the same or the value difference is slight, we only need to perform the zero-mean operation. xij=(xikxi)/σi x_{ij} = \left( {{x_{ik}} - {x_i}} \right)/{\sigma _i} Where xi=(k=1Mxik)/N,i=1,2,,N {x_i} = \left( {\sum\nolimits_{k = 1}^M {{x_{ik}}} } \right)/N,i=1,2,\cdots,N is the mean of the i feature of different samples. σi2=[k=1M(xikxi)2]/M \sigma _i^2 = \left[ {\sum\nolimits_{k = 1}^M {{{\left( {{x_{ik}} - {x_i}} \right)}^2}} } \right]/M is the variance of the i component of other models.

Computing the covariance matrix
Cx=E(xxT)=1Ni=1NxixiT {C_x} = E\left( {xxT} \right) = {1 \over N}\sum\nolimits_{i = 1}^N {{x_i}x_i^T}
Computing Eigenvalues and Eigenvectors

We calculate its eigenvalues λi through the eigenvalues of the covariance matrix Cx. It is sorted from largest to smallest, λ1λ2 ≥ ⋯ ≥ λN ≥ 0 (since Cx is non-negative definite). We solve the corresponding unit eigenvectors as u1, u2, ⋯, uN to obtain the eigenvector matrix U = (u1, u2, ⋯, uN).

Calculating the Compression Dimension

The principal component is the projection of the original data on the new coordinate system according to the principle of maximum variance, and the eigenvalue reflects its disagreement [10]. We introduce the cumulative contribution of eigenvalues. Its calculation formula is as follows: Tp=j=1pλj/jN=1λj {T_p} = \sum\nolimits_{j = 1}^p {{\lambda _j}/} \sum\nolimits_j^N { = 1{\lambda _j}}

We set Tp ≥ 85%. In this way, we retain 85% of the information of the original data to obtain the smallest p value that satisfies this formula. Then we extract p principal components.

Computing principal components

The principal components we extract can be obtained by projecting the zero-mean data on the coordinates of the first p eigenvectors Up = [u1 u2up]: Y=UpTx˜ Y = U_p^T\tilde x

The principal components extracted by PCA are also applicable to the vector space model. Therefore, the translated text vector can be reduced to the primary component feature vector [11]. We use the vector distance formula (6) to calculate the actual distance between the two translations. In this way, we can comprehensively measure the translation similarity between the two.

Results and Discussion

After the above steps, we finally extracted 1524 sentences of passive structure in the valid English original text. However, the number of passive sentences in the Chinese translation is far apart. The number of passive sentences in English and Chinese is not equal and asymmetrical. It is known from research that only some English passive sentences are processed into Chinese passive sentences with passive markers. From this point of view, it is an excellent strategy to translate English passive sentences into Chinese passive sentences. It can reflect different translation characteristics.

Comparison of the similarity of passive sentences between online machine translation and human translation

In this study, 18 types of corresponding translated Chinese passive sentences were extracted. The specific data are shown in Table 1. We treat each translation as an 18-dimensional text vector. This enables translation vectorization. After that, the eigenvalues and eigenvectors are obtained by principal component analysis. We arrange the eigenvalues in descending order, calculate the cumulative amount of information, and finally extract the first few main components contributing more. Under normal circumstances, the cumulative contribution rate of the principal components needs to reach more than 85%. The purpose of taking the threshold of 90% in this paper is to retain more information. The variance contribution rate of the first principal component is 87.77%, and the second main component is 11.59%. The cumulative variance contribution rate of the first two principal components reaches 99.36%. This method preserves most of the information content of the original text vector. So we can replace the actual 18 dimensions with two new principal components. The correlation coefficient between the two new main components is 0, indicating that the two are mutually orthogonal. This removes the correlation and collinearity among the original features. It also shows the correctness of the algorithm. This paper uses these two principal components to establish a new two-dimensional orthogonal coordinate system, and the spatial position of each translation is shown in Figure 1.

Various Chinese passive sentences translated into English passive sentence pairs

Serial number Literature A Literature B Literature C Literature D Artificial average Online machine translation
1 27 23 18 28 24 247
2 6 2 0 5 3.25 0
3 16 29 32 16 23.25 2
4 21 23 21 1 16.5 4
5 5 4 5 2 4 1
6 27 26 25 11 22.25 10
7 1 1 0 1 0.75 1
8 17 13 19 7 14 28
9 3 3 6 1 3.25 2
10 36 33 44 22 33.75 18
11 5 2 3 1 2.75 0
12 4 1 2 0 1.75 0
13 1 3 2 0 1.5 0
14 0 0 0 0 0 9
15 3 4 2 0 2.25 1
16 4 7 8 3 5.5 8
17 4 3 4 4 3.75 2
18 19 11 11 12 13.25 2
199 188 202 114 175.75 175.75

Figure 1

Principal Component Plot

Their respective spatial locations determine the distance between the translations. It can be seen from Figure 1 that the four artificial translations are relatively concentrated in space. This shows that the classification and clustering of the two types of translations can be successfully achieved using principal component analysis. The line connecting the position of each translation and the origin constitutes the text vector of the translation. It can be seen that the distance and the included angle between the four artificial translations are minimal. This shows that the similarity of human translation is high. However, the distance between the human translation and the online machine translation is relatively far, and the angle is large. This shows that there is still a big gap between online machine translation and manual translation. Therefore, we use principal component analysis to study the characteristics of the corresponding translations of passive sentences in each translation. We can distinguish and classify human translations and online machine translations. This shows a significant difference between online machine translation and manual translation in the corresponding translation of English passive sentences.

Vector distance between two types of translations

We have extracted two principal components by dimensionality reduction. Its cumulative contribution rate reached 99.36%. This explains the vast majority of the amount of information. Therefore, we can project the original 18-dimensional text vector into a 2-dimensional space, and each translation can be reduced to a two-dimensional vector (Table 2).

Principal component eigenvectors

Literature A Literature B Literature C Literature D Artificial average Online machine translation
21.6985 18.992 13.8412 24.944 19.5938 236.525
38.9658 41.4521 49.8598 22.1225 38.3195 −14.8382

We pass the eigenvectors composed of two principal components and use the modified vector distance formula (6). In this way, the actual vector distances between the translations are obtained (Table 3).

Distances between passive sentence vectors in each translation

Literature A Literature B Literature C Literature D Artificial average Online machine translation
Literature A 0
Literature B 0.0147 0
Literature C 0.0502 0.0147 0
Literature D 0.0725 0.0502 0.0263 0
Artificial average 0.0051 0.0113 0.044 0.0907 0
Online machine translation 0.7276 0.9017 1 0.6271 0.753 0

The vector distance values are all between 0 and 1. The larger the deal, the farther the distance is, and the less similar the corresponding translations of the passive sentence are. Distances are symmetrical. The text vector sum is the same distance as the sum. Therefore, only the left half of the distance values are listed in Table 3. It can be seen from Table 3 that the distance between the online translation and the manual translation is above 0.6. The cosine distance between human translations is basically below 0.2. This shows a big difference between online machine translation and human translation. In addition, the vector distance between the translation of document B and the translation of document A is only 0.0147. This shows that the two translators have the highest similarity in the translation of passive sentences. From Table 1, it can be roughly seen that the data structures and changes of the two translations are also relatively consistent.

The distance between the document D version and the online version is 0.6271. This is the minimum distance between a human translation and an online translation. The maximum value of the distance between human translations is 0.1866. The difference between the two values is about three times, which indicates that the similarity between the online version and the human understanding is less, and the similarity between the human versions is higher. Online machine translation relies on the limitation of bilingual corpus and statistical rules, making it incapable of being as flexible as human translation when translating passive English sentences. This will produce a large number of single “be” words. And some of the “big” words in the machine translation do not conform to the pragmatic color of Chinese. It is less readable, which suggests that machine translation needs more improvement. The human translators are very close to document A version. Therefore, the average artificial translation is also the most relative to Document A. We can also see from the line distances and trends of the corresponding translations in Figure 2. The point corresponding to the distance between Document D and the online machine translation of 0.6271 is a central turning point. This shows that the similarity between the two in passive sentence translation is significantly higher than that between other human translators and machine translation. Similarly, document C has the farthest distance from the machine translation (space is 1). This shows that the passive sentence translations of the two are the least similar.

Conclusion

This study proposes a vector distance measurement method that combines a vector space model and principal component analysis based on corpus translation studies and text data mining methods. The study found that Chinese passive sentences are significantly less than English ones. There is a significant asymmetry in the translation of English-Chinese passive sentences. We use the method of text data mining to compare the similarity of the translation features of the passive sentences of the two types of translations. The results show that the distance between the manual and online translations is more significant, and the distance between the manual translations is smaller. This indicates that there is still a big gap between online translation and manual translation quality.

Figure 1

Principal Component Plot
Principal Component Plot

Principal component eigenvectors

Literature A Literature B Literature C Literature D Artificial average Online machine translation
21.6985 18.992 13.8412 24.944 19.5938 236.525
38.9658 41.4521 49.8598 22.1225 38.3195 −14.8382

Various Chinese passive sentences translated into English passive sentence pairs

Serial number Literature A Literature B Literature C Literature D Artificial average Online machine translation
1 27 23 18 28 24 247
2 6 2 0 5 3.25 0
3 16 29 32 16 23.25 2
4 21 23 21 1 16.5 4
5 5 4 5 2 4 1
6 27 26 25 11 22.25 10
7 1 1 0 1 0.75 1
8 17 13 19 7 14 28
9 3 3 6 1 3.25 2
10 36 33 44 22 33.75 18
11 5 2 3 1 2.75 0
12 4 1 2 0 1.75 0
13 1 3 2 0 1.5 0
14 0 0 0 0 0 9
15 3 4 2 0 2.25 1
16 4 7 8 3 5.5 8
17 4 3 4 4 3.75 2
18 19 11 11 12 13.25 2
199 188 202 114 175.75 175.75

Distances between passive sentence vectors in each translation

Literature A Literature B Literature C Literature D Artificial average Online machine translation
Literature A 0
Literature B 0.0147 0
Literature C 0.0502 0.0147 0
Literature D 0.0725 0.0502 0.0263 0
Artificial average 0.0051 0.0113 0.044 0.0907 0
Online machine translation 0.7276 0.9017 1 0.6271 0.753 0

Lockwood, E., Caughman, J. S., & Weber, K. An essay on proof, conviction, and explanation: Multiple representation systems in combinatorics. Educational Studies in Mathematics., 2020; 103(2): 173–189 LockwoodE. CaughmanJ. S. WeberK. An essay on proof, conviction, and explanation: Multiple representation systems in combinatorics Educational Studies in Mathematics 2020 103 2 173 189 10.1007/s10649-020-09933-8 Search in Google Scholar

Weng, T. RETRACTED ARTICLE: Non-point source pollution in river basin based on Bayesian network and intelligent translation system of English books. Arabian Journal of Geosciences., 2021; 14(16): 1–11 WengT. RETRACTED ARTICLE: Non-point source pollution in river basin based on Bayesian network and intelligent translation system of English books Arabian Journal of Geosciences 2021 14 16 1 11 10.1007/s12517-021-07928-0 Search in Google Scholar

Beeley, P. ‘There are great alterations in the geometry of late’. The rise of Isaac Newton's early Scottish circle. British Journal for the History of Mathematics., 2020; 35(1): 3–24 BeeleyP. ‘There are great alterations in the geometry of late’. The rise of Isaac Newton's early Scottish circle British Journal for the History of Mathematics 2020 35 1 3 24 10.1080/26375451.2019.1701862 Search in Google Scholar

Hu, Z., Cui, Y., Zhang, J., & Eviston-Putsch, J. Shalosh B. Ekhad: a computer credit for mathematicians. Scientometrics., 2020; 122(1): 71–97 HuZ. CuiY. ZhangJ. Eviston-PutschJ. ShaloshB. Ekhad: a computer credit for mathematicians Scientometrics 2020 122 1 71 97 10.1007/s11192-019-03305-7 Search in Google Scholar

Barnes, L. P., Inan, H. A., Isik, B., & Özgür, A. rtop-k: A statistical estimation approach to distributed sgd. IEEE Journal on Selected Areas in Information Theory., 2020; 1(3): 897–907 BarnesL. P. InanH. A. IsikB. ÖzgürA. rtop-k: A statistical estimation approach to distributed sgd IEEE Journal on Selected Areas in Information Theory 2020 1 3 897 907 10.1109/JSAIT.2020.3042094 Search in Google Scholar

Popescu, F. Paronyms and Other Confusables and the ESP Translation Practice. Analele Universităţii Ovidius din Constanţa. Seria Filologie., 2019; 30(1): 220–232 PopescuF. Paronyms and Other Confusables and the ESP Translation Practice. Analele Universităţii Ovidius din Constanţa Seria Filologie 2019 30 1 220 232 Search in Google Scholar

Rastogi, P., Poliak, A., Lyzinski, V., & Van Durme, B. Neural variational entity set expansion for automatically populated knowledge graphs. Information Retrieval Journal., 2019; 22(3): 232–255 RastogiP. PoliakA. LyzinskiV. Van DurmeB. Neural variational entity set expansion for automatically populated knowledge graphs Information Retrieval Journal 2019 22 3 232 255 10.1007/s10791-018-9342-1 Search in Google Scholar

Xie, T., Liu, R., & Wei, Z. Improvement of the Fast Clustering Algorithm Improved by-Means in the Big Data. Applied Mathematics and Nonlinear Sciences., 2020; 5(1): 1–10 XieT. LiuR. WeiZ. Improvement of the Fast Clustering Algorithm Improved by-Means in the Big Data Applied Mathematics and Nonlinear Sciences 2020 5 1 1 10 10.2478/amns.2020.1.00001 Search in Google Scholar

O'Neill, E. R., Parke, M. N., Kreft, H. A., & Oxenham, A. J. Role of semantic context and talker variability in speech perception of cochlear-implant users and normal-hearing listeners. The Journal of the Acoustical Society of America., 2021; 149(2): 1224–1239 O'NeillE. R. ParkeM. N. KreftH. A. OxenhamA. J. Role of semantic context and talker variability in speech perception of cochlear-implant users and normal-hearing listeners The Journal of the Acoustical Society of America 2021 149 2 1224 1239 10.1121/10.0003532789553333639827 Search in Google Scholar

İlhan, E., & Kıymaz, İ. O. A generalization of truncated M-fractional derivative and applications to fractional differential equations. Applied Mathematics and Nonlinear Sciences., 2020; 5(1): 171–188 İlhanE. Kıymazİ. O. A generalization of truncated M-fractional derivative and applications to fractional differential equations Applied Mathematics and Nonlinear Sciences 2020 5 1 171 188 10.2478/amns.2020.1.00016 Search in Google Scholar

Meng, Y., Yang, N., Qian, Z., & Zhang, G. What makes an online review more helpful: an interpretation framework using XGBoost and SHAP values. Journal of Theoretical and Applied Electronic Commerce Research., 2020; 16(3): 466–490 MengY. YangN. QianZ. ZhangG. What makes an online review more helpful: an interpretation framework using XGBoost and SHAP values Journal of Theoretical and Applied Electronic Commerce Research 2020 16 3 466 490 10.3390/jtaer16030029 Search in Google Scholar

Polecane artykuły z Trend MD

Zaplanuj zdalną konferencję ze Sciendo