1. bookAHEAD OF PRINT
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2444-8656
Pierwsze wydanie
01 Jan 2016
Częstotliwość wydawania
2 razy w roku
Języki
Angielski
Otwarty dostęp

Calculation and Performance Evaluation of Text Similarity Based on Strong Classification Features

Data publikacji: 15 Jul 2022
Tom & Zeszyt: AHEAD OF PRINT
Zakres stron: -
Otrzymano: 12 Feb 2022
Przyjęty: 11 Apr 2022
Informacje o czasopiśmie
License
Format
Czasopismo
eISSN
2444-8656
Pierwsze wydanie
01 Jan 2016
Częstotliwość wydawania
2 razy w roku
Języki
Angielski
Preface

If you only compare the text code string of BIG or ASIC, it is almost impossible to obtain the semantic similarity of the text, such as “Today is sunny.” and “It is sunny today.” Between the two strings, it is difficult to achieve a comparison of its semantics without the depth machine Learning. Moreover, it is even more impossible to compare its semantic similarity using traditional methods between the strings “Today is sunny.” and “the sun is warm in Winter.” If we use any neural network architecture to compare the above strings, it is hard to achieve sufficiently accurate comparison results of the text semantic similarity. Therefore, the comparison algorithm, combined with extraction of the text semantic feature string from the process of pre-machine learning and post-machine learning, has been widely used for the current text similarity comparison in recent years.

Liu Sihua et al. (2020) have conducted in-depth research on the modal semantic comparison of “neng” and “hui” and the modal semantics of related words in Chinese language. If they are translated into English at the same time, the word “will” can be used to express the modality, but you can basically understand the comparison results of “active future time” and “passive future time” in Chinese semantics when using machine learning analysis[1]. Wang Youliang (2019) analyzes the problem of the strength and weakness of adjectives in Chinese language. For example, “warm”, “hot”, “fiery” and other adjective judgments that express subjective temperature should clearly distinguish the strength of their language[2]. Zhu Jing (2020) analyzes the comparison model of semantic types and expression methods under the condition of Sino-Russian mutual translation. Research on semantic realization of cross-language machine translation technology is also conducted in other related documents have also conducted[3]. Yan Bing et al. (2020) study the semantic analysis of the Sino-US trade war international diplomatic discourse where there should be different expressions in the complex context of the field of machine semantic analysis[4]. With the comprehensive analysis of relevant literature, you will find that it is generally aimed at the realization of specific functions under a single function or a single application scenario in the current semantic analysis research based on machine learning and machine translation. It intends to explore a universal machine learning semantic analysis system that can provide high satisfaction with the current background research results in this paper.

Construction Model of Semantic Function Library

In the early stage, it was impossible to use the semantic function library to support the big data of the related semantic comparison process as the data structure of the function library was difficult to get an effective and efficient setting. Chinese words like nouns, pronouns, verbs, adverbs, adjectives, prepositions, etc. are equipped with relatively independent and almost no semantic evaluation indicators of coincidence. Especially in complex grammatical environments where nouns are used as verbs, adjectives or prepositional prepositions, even the use of machines to judge their true part-of-speech meanings is going to be a complex calculation. In some studies, the multi-level fuzzy comparison method is used to build the semantic function library where the part-of-speech comparison module is used to classify the part of speech of the input word firstly, and then judge the semantic context based on its context and the second-level library.

It focuses on the direct fuzzy implementation of semantics where the output target of the semantic function library is not directed at the judgment output of the human-machine interface but using a semantic depth code index for semantic recognition for subsequent machine learning modules. For example, Figure 1.

Figure 1

Definition of Semantic Identification Code

Thus, each fixed word in the semantic recognition library is converted into a 5-digit semantic recognition code. This recognition code does not provide data support during the actual rigidity comparison process, but it is sufficient to provide the process of isomorphism support for natural text heterogeneous data in machine learning.

In the semantic recognition library, there may be a situation where a fixed word corresponds to multiple semantic recognition codes, such as the word “observation”, which may correspond to the weak semantic strength option of the noun, or may correspond to the strong semantic strength option of the verb. And it may also be used as a noun-turning or verb-to-noun application. This requires machine learning judgment based on contextual semantic filtering with the support of convolutional neural networks. This judgment mode will be analyzed in the following.

It's not demanded to have the semantic function characteristics of each word in the modern Chinese dictionary during the design process of the actual semantic function library. It only needs to include the semantic potential of 2000-3000 common words to provide comparative data support for the semantic potential of most Chinese language words, that is, there are about 8000-12000 comparative correlation functions in the semantic function library.

Analysis of the Overall Algorithm Pattern of Strong Classification Feature Comparison

As shown in Figure 2, two character strings with unlimited length to be compared are entered in the system. In the strong convolution and stream input mode, a fuzzy neuron convolution network (see 2.1) is used to realize the process of generating its semantic string with the support of the semantic function library. The semantic string generated by the algorithm is subjected to a second analysis to obtain a time domain feature string by using the Fourier transform function as the core basis function of the frequency domain feature analysis module (see 2.2 for details). Then the two columns of feature strings go through a fuzzy neuron convolution network (see 2.3 for details) to obtain a comparison value Double result. Finally, a de-fuzzy module (see 2.4 for details) is used for de-fuzzy calculation, and a common formatted output module is used for output of the comparison results. The integrated algorithm uses a total of 2 fuzzy neuron convolutional networks to conduct a semantic comparison of the two columns of strings, which can minimize the computing power requirements of each neuron network to improve system efficiency.

Figure 2

Comparison of Strong Classification Characteristics of the Overall Algorithm Data Flow Diagram

Semantic Function Library Comparison Module Design

It's operated a double-loop convolution stream data comparison mode in the semantic function library where the comparison string is traversed separately to obtain the corresponding result and output for each record input of the semantic function library. This module belongs to a typical fuzzy neuron deep convolutional network algorithm with the core control variable of the pointer variable of the semantic function library, and the secondary control variable of the pointer variable of the two-column comparison string. The output variable is the semantic string data for the two column comparison string pointers. See Figure 3 for details.

Figure 3

Semantic Function Library Comparison Module Design

In Figure 3 the functional design, two comparison strings are independently compared where a traversal pointer i for the comparison function library is implemented. For each i, a comparison string is used according to the target string length in the comparison function library. The pointer j performs character-by-character traversal on the comparison character string to form a comparison string pointer. That is, for the library input and the comparison string input, the character string length is equal. This article limits the length of each comparison string to no more than 4 characters or 8 bytes. For the only two inputs of the ancient fuzzy neural network, the length of which does not exceed 8 bytes of Bit data. However, the input data should be deeply convolved because the system needs to fully consider the context. And the double-loop convolution method where ring A and ring B are four modules, each module is designed according to the hidden layer structure of 3, 7, 13, 5, and 1, and the nodes of each convolution module are carried out according to the method of high-order polynomial regression B. The function can be written as: Y=mn=60AnXmn Y = \sum\limits_m {\sum\limits_{n = 6}^0 {{A_n}X_m^n}}

The module inputs an 8-byte bit variable and outputs a 4-byte Double variable. The hidden layer performs node design according to the hidden layer structure of 3, 7, and 3, and uses linear functions for the nodes. The function can be written as: Y=(AXi+B) Y = \sum {\left({A \cdot {X_i} + B} \right)}

The output module integrates the output of the three convolution modules A1, B1, and B4 which are all Double variables. The statistical significance of this module fully binarizes the three sets of input data for the application of management. The hidden layer should reach a sufficient depth, so a 5-layer hidden layer design is adopted, and the node design is performed according to the hidden layer structure of 5, 17, 31, 13, and 3. The function can be written as: Y=1AeXi+B Y = \sum {{1 \over {A \cdot {e^{{X_i}}} + B}}}

In the output module of the semantic string, when the result is close to 1.000, it will read the characteristic semantic identification code and the comparison string pointer generates the semantic identification code sequence of the position of the comparison string pointer according to the output of comparison result in real time, and when the result is close to 0.000, the semantic identification code corresponding to the comparison string pointer will be defined as 0. When the semantic identification code corresponding to the comparison string pointer already exists, the arithmetic average method will give the average result. Even if the machine learning semantic string generated by the algorithm cannot be checked from the semantic function library, it is sufficient for the subsequent three modules to generate machine learning results.

Table 1 can be obtained with the combination of the sub-module design of this module.

Summary Table of Design Parameters of Semantic Function Library Comparison Module

Sub-module Input node Output node Hidden layer Total number of nodes Node function
Library input 1×64bit 1×Double 3′ 7′ 3 15 Y=(AXi+B) Y = \sum {\left({A \cdot {X_i} + B} \right)}
Compare string input
Convolution A1 2×Double 1×Double 32
Convolution A2 2×Double 1×Double 32
Convolution A3 1×Double 1×Double 31
Convolution A4 1×Double 1×Double 3′ 7′ 31 Y=mn=60AnXmn Y = \sum\limits_m {\sum\limits_{n = 6}^0 {{A_n}X_m^n}}
Convolution B1 2×Double 1×Double 13′ 5′ 32
Convolution B2 2×Double 1×Double 1 32
Convolution B3 1×Double 1×Double 31
Convolution B4 1×Double 1×Double 31
5′ 17′
Output 3×Double 1×Double 31 ′ 13′ 73 Y=1AeXi+B Y = \sum {{1 \over {A \cdot {e^{{X_i}}} + B}}}
3
Frequency Domain Feature Analysis Module Design

The essence of the machine learning semantic string generated in 2.1 is a time-domain function which calibrates the semantic identification code information on the character sequence pointer sequence of the input string. But the string still has the domain specificity at a certain time. Therefore, the statistical significance of the frequency domain feature analysis module aims to weaken the specificity in the time domain so as to obtaining a frequency domain feature data. This module needs to perform a frequency domain feature extraction calculation process based on time domain data. And this process can be realized by one Fourier transform.

Firstly, the semantic identification code is obtained according to the pointer T of the semantic string, and the frequency domain features are extracted based on the Fourier transform: f(ω)=+f(t)eiωtdt f\left(\omega \right) = \int_{- \infty}^{+ \infty} {f(t) \cdot {e^{- i\omega t}}dt}

After the feature function is obtained, it will be divided according to the total length of the pointer t, and the result will be extracted to form its frequency domain feature string.

This process is a rigid computing process and does not involve any machine learning algorithms. That is to say, this paper is to conduct a data management process based on rigid algorithm between two fuzzy neuron network modules.

Core Comparison Module Design

Two columns of frequency domain feature strings, namely frequency domain feature string A and frequency domain feature string B, are input into the core comparison module, which is also a fuzzy neuron network convolution algorithm module. See Figure 4 for details.

Figure 4

Core Comparison Module Data Flow Diagram

The core algorithm of the fuzzy process of this module can judge the length of two frequency domain feature strings, and use the difference method to convert the frequency domain feature strings into equal length. Then, two input strings are formed according to the converted characteristic string pointer as the control variable for a 4-module convolution module, in which the statistical significance of convolution A and convolution B integrate the input string data (Long type Variable) into the convolution cycle and the statistical significance of convolution C and convolution D provide a double data for each output module. Finally, an arithmetic average result is provided for all comparison results under the condition of equal length pointers. The result is the fuzzy comparison result of the two compared strings.

The idea of the sub-module of this module is similar to the semantic function library comparison module, in which two input strings are managed by linear regression function with the hidden layer structure 3, 7, 3 and the node function like function (2). Four convolution modules perform node management according to the high-order polynomial regression function with the hidden layer structure 3, 7, 13, 5, and the node function like function (1). One output module performs grounding sheet management according to the binary regression function with the hidden layer structure 3, 7, 3 and the node function like function (3). Therefore, a summary table of the actual design parameters of the module is shown in Table 2.

Summary Table of The Actual Design Parameters Of The Module

Sub-module Input node Output node Hidden layer Total number of nodes Node function
Input string A 1×Long 1×Double 3′ 7′ 3 15 Y=(AXi+B) Y = \sum {\left({A \cdot {X_i} + B} \right)}
Input string B
ConvolutionA 2×Double 1×Double 3′ 7′ 13 ′ 5′ 1 32 Y=mn=60AnXmn Y = \sum\limits_m {\sum\limits_{n = 6}^0 {{A_n}X_m^n}}
Convolution B 2×Double 1×Double 32
Convolution C 1×Double 1×Double 31
Convolution D 1×Double 1×Double 31
Output 2×Double 1×Double 3′ 7′ 3 16 Y=1AeXi+B Y = \sum {{1 \over {A \cdot {e^{{X_i}}} + B}}}
De-Fuzzy and Formatting Output Module Design

According to the previous analysis, the final output of the algorithm is the average result of the binarization after the deep algebraic average calculation, so the binarization feature of the final data is not significant. In other words, the final point of the model is basically concentrated on the [0,1] interval, and some results exceed this interval. The output result is a deeply de-fuzzy output result.

In the fuzzy process, two thresholds can be defined. When the output result is greater than a certain value M, the similarity of the two texts will be in a high confidence region, and when the output result is less than a certain value N, the similarity of the two ends of the text will be in the low Confidence zone, but it is still more likely to be in the [N, M] interval, at which point the system gives a weakly similar result. What it means is that the final formatted output of the algorithm in this paper contains three possible output judgment results. The semantics of the text at both ends are strongly similar, weakly similar, and dissimilar, and the output frequency of strong similar and dissimilar results should be ensured at least more than 80% for realization of the practical application scene adaptability of this algorithm.

Algorithm Performance Evaluation

Because the essence of the similarity evaluation results of text semantics is the user's subjective evaluation results, 100 volunteers, undergraduates or above students with a certain level of literary criticism in Chinese language and literature, international Chinese, and Chinese language education, are selected during the evaluation process. 50 pairs of text segments are selected for comparison to find the consistency of the system's evaluation results of the 50 pairs of text segments and the manual interpretation of volunteers. In order to judge the accuracy of the semantic similarity judgment of the system, the volunteers gave a subjective evaluation of very consistent (10 points), basically consistent (6 points), inconsistent (3 points) and completely inconsistent (0 points). In the final evaluation results, 100 volunteers gave very consistent evaluation 2,763 times (55.26%), basic consistent evaluation 1,326 times (26.52%), inconsistent evaluation 635 times (12.70%) and completely inconsistent evaluation 276 times (5.52%) in the corresponding 5,000 evaluation comparison process. The comprehensive judgment accuracy of this system (the proportion of very consistent and basically consistent synthesis) is 81.78%, and the comprehensive subjective score is 74.98 points (out of 100 points).

Among the 5000 evaluations, the system gave 1031 strong similarity evaluations, accounting for 10.62%, 391 weak similarity evaluations, accounting for 7.82%, and 3578 dissimilarity evaluations, accounting for 71.56%. Among them, the sum of strong similarity evaluations and dissimilarity evaluations was 4,609, accounting for 92.18%, which met the design requirements of this paper (see 2.4 for details).

Summary

The system focuses on the evaluation of the semantic similarity of text in the Chinese language, and achieves a comparison result of 81.78% of the accuracy of manual judgment. Only 5.52% of the volunteers believe that the system judgment result is completely inconsistent with the manual judgment result. The accuracy of achieving this judgment is still more advanced than the judgment accuracy of related literature for a single judgment target as the natural language semantic judgment based on machine learning is still a cutting-edge topic. The system, a general semantic judgment algorithm, can still be improved through further improvement of the semantic function library and deeper training of two sets of judgment neural network machine learning modules.

Figure 1

Definition of Semantic Identification Code
Definition of Semantic Identification Code

Figure 2

Comparison of Strong Classification Characteristics of the Overall Algorithm Data Flow Diagram
Comparison of Strong Classification Characteristics of the Overall Algorithm Data Flow Diagram

Figure 3

Semantic Function Library Comparison Module Design
Semantic Function Library Comparison Module Design

Figure 4

Core Comparison Module Data Flow Diagram
Core Comparison Module Data Flow Diagram

Summary Table of Design Parameters of Semantic Function Library Comparison Module

Sub-module Input node Output node Hidden layer Total number of nodes Node function
Library input 1×64bit 1×Double 3′ 7′ 3 15 Y=(AXi+B) Y = \sum {\left({A \cdot {X_i} + B} \right)}
Compare string input
Convolution A1 2×Double 1×Double 32
Convolution A2 2×Double 1×Double 32
Convolution A3 1×Double 1×Double 31
Convolution A4 1×Double 1×Double 3′ 7′ 31 Y=mn=60AnXmn Y = \sum\limits_m {\sum\limits_{n = 6}^0 {{A_n}X_m^n}}
Convolution B1 2×Double 1×Double 13′ 5′ 32
Convolution B2 2×Double 1×Double 1 32
Convolution B3 1×Double 1×Double 31
Convolution B4 1×Double 1×Double 31
5′ 17′
Output 3×Double 1×Double 31 ′ 13′ 73 Y=1AeXi+B Y = \sum {{1 \over {A \cdot {e^{{X_i}}} + B}}}
3

Summary Table of The Actual Design Parameters Of The Module

Sub-module Input node Output node Hidden layer Total number of nodes Node function
Input string A 1×Long 1×Double 3′ 7′ 3 15 Y=(AXi+B) Y = \sum {\left({A \cdot {X_i} + B} \right)}
Input string B
ConvolutionA 2×Double 1×Double 3′ 7′ 13 ′ 5′ 1 32 Y=mn=60AnXmn Y = \sum\limits_m {\sum\limits_{n = 6}^0 {{A_n}X_m^n}}
Convolution B 2×Double 1×Double 32
Convolution C 1×Double 1×Double 31
Convolution D 1×Double 1×Double 31
Output 2×Double 1×Double 3′ 7′ 3 16 Y=1AeXi+B Y = \sum {{1 \over {A \cdot {e^{{X_i}}} + B}}}

Liu Sihua, Zeng Chuanlu. Comparison of Modal Semantics of “Neng” and “Hui” [J]. Journal of Shenyang University (Social Science Edition), 2020, 22(01): 95–100+105. SihuaLiu ChuanluZeng Comparison of Modal Semantics of “Neng” and “Hui” [J] Journal of Shenyang University (Social Science Edition) 2020 22 01 95 100+105 Search in Google Scholar

Wang Youliang. A Study on the Comparison of Strong Relations of Semantic Relation Adjectives[J]. Journal of Jiaozuo University, 2019, 33(04): 7–11. YouliangWang A Study on the Comparison of Strong Relations of Semantic Relation Adjectives[J] Journal of Jiaozuo University 2019 33 04 7 11 Search in Google Scholar

Zhu Jing. Semantic Types of Russian-Chinese Comparative Categories and Their Expression Methods [J]. Chinese Russian Teaching, 2020, 39(01): 34–43. JingZhu Semantic Types of Russian-Chinese Comparative Categories and Their Expression Methods [J] Chinese Russian Teaching 2020 39 01 34 43 Search in Google Scholar

Yan Bing, Zhang Hui. A Comparative Analysis of the Discourse of Sino-US Trade War Discourse from the Perspective of Frame Semantics [J]. Foreign Languages, 2020, 36(01): 1–8. BingYan HuiZhang A Comparative Analysis of the Discourse of Sino-US Trade War Discourse from the Perspective of Frame Semantics [J] Foreign Languages 2020 36 01 1 8 Search in Google Scholar

Polecane artykuły z Trend MD

Zaplanuj zdalną konferencję ze Sciendo