Differentiation Analysis and Classification of Chinese Language and Literature Features Based on Text Big Data

A corpus-based approach is used to construct a parallel corpus using Sketch Engine. The report of the Nineteenth National Congress of the Communist Party of China and its English version are used as corpus. Based on Laviosa's research framework, this paper analyzes the linguistic features of Chinese-English translation. This paper analyzes political documents from the aspects of average sentence length, vocabulary density and word frequency. The results show that the English version of Chinese political documents is characterized by simplification in terms of lexical density and lexical frequency. However, from the perspective of average sentence length and lexical difficulty, this feature is not obvious. From the perspective of corpus, translation studies not only emphasize data statistics, but also emphasize theoretical analysis and the combination of qualitative and quantitative methods [1]. Text categorization has always been paid attention to in the field of natural language processing, especially in today's vigorous development of the Internet, and the amount of data is also increasing. In this context, how to improve the performance of text classifier has become the focus of researchers. Although the traditional methods based on machine learning have achieved certain results, they can not solve the text classification problem in massive data environment. Therefore, new ideas are needed. New technology. New algorithm. New ideas. New method. New method. New method! Cyclic Neural Network (RNN) is one of the most commonly used natural language processing methods, which uses loop structure to serialize and learn data. Convolution neural network (CNN) can extract effective information from a large number of visual images. Based on RNN and CNN, this paper proposes a new Chinese text categorization model-BLSTM-C. BLSTM-C starts from bidirectional long-short-term memory (BLSTM) layer, and is based on special RNN, and generates sequential output according to past and future context. It then feeds the sequence to the CNN layer for feature extraction from the previous sequence [2]. By comparing and analyzing the classification ability of Chinese fragments with different granularity (CDC), this paper discusses and verifies the feasibility, rationality and effectiveness of applying low-granularity features of Chinese characters to Chinese short text classification (CSTC). Support Vector Machine (SVM) classifier based on statistical learning theory (Softmax) is used to process the training corpus. The main work is as follows: 1. Comparative analysis of Chinese document classification performance with different granularity; 2. Summarize the best feature selection strategy; 3. Build a model. Conclusion; 4. [Methods/Processes] Taking "Chinese Text Reading Comprehension Task" as the research object, an experimental platform of Chinese text reading comprehension based on content/form-structure was constructed, and a simulation environment of Chinese text reading category based on CSSCI was built. Method 1: Granularity classification is based on keywords, and the classification effect of SVM algorithm is improved through three steps: feature optimization of experimental samples and comparison with external data. The experimental results show that: (-) for the same category, different feature selection methods will lead to different results; With the increase of granularity, the performance of CSTC also improves. Usually, the larger the granularity, the better the classification effect, and vice versa. However, low-granularity features are also feasible, and reasonable weight setting can improve its CDC, even beyond high-granularity features [3]. Standard Text Image Compression (JBIG2) algorithm mainly focuses on the characteristics of English and other alphabetic characters, but does not consider Chinese and other hieroglyphs. In order to improve the compression ratio of Chinese text image, an improved feature-based Chinese text image compression algorithm MC-Jb IG2 is proposed in this paper. The algorithm takes Chinese characters as the research object, and carries out feature selection, pattern classification and multi-level coding on them. On this basis, a prototype system of Chinese text image compression based on multi-scale analysis (MSA) is implemented. The system has high performance for different types of text. Good robustness. Quick efficiency. Low cost. Easy to operate. Strong practicability. High efficiency. Firstly, the proposed method extracts various features of characters in the image. Then these features are combined to form a cascade cluster, and the characters in each cluster are encoded. Finally, the parameters used in cascaded clusters are optimized, and Monte Carlo strategy is implemented to traverse the feasible space. Experimental results show that MC-JBIG2 algorithm is better than the existing algorithms and systems representing JBIG2 in processing Chinese text images [4]. The latest development of neural TTS makes it a reality to synthesize speech "from person to person" when there are massive studio quality training data obtained from speech talents. In most cases, machine learning methods can analyze and process these artificial data, so as to generate high-quality speech signals. With the improvement of computer performance, Bertson-Levenberg (B/L) algorithm has been widely used. Speech signals can be expressed in words; It can also be described by sound. Different storage forms. Various forms. There are many kinds. However, ordinary speakers can record limited speech only if they want to listen. Therefore, people hope to have more kinds of artificial products to replace people, which is called "human-like TTS" technology. Chinese is a text-based language. Compared with English, it has many unique features, such as different from Roman alphabet language, no adjacent words and spaces, easy to have word segmentation errors, semantic confusion, unnatural rhythm and so on. In this study, multi-speaker TTS is adopted to adapt to the lack of training data of target speakers, and language features and Bert derived information are studied to improve the rhythm of Putonghua TTS. Three linguistic features related to phonemes and prosody are investigated[5].The improved Bert-CRF model is used to predict the interruption well. Bert model is used to derive enhanced phoneme sequences with character embedding. Subjective tests were carried out on internal and external tasks in the fields of news, chat and audio books, and the results showed that all the elements could effectively improve the prosody of Chinese TTS. Experimental results show that adding phonetic features can significantly improve the influence of syntactic structure and semantic information on sentence stress recognition and reduce the error probability. At the same time, it is found that there are significant differences in different fields. These results lay a foundation for further research. 4. A new method is put forward. Five algorithms. Six tests. Seven indicators. 9 sets of data. Compared with the baseline, the extra character embedding model proposed by Bert has the best effect, and the gain is increased by 0.17 MOS[6].This chapter trains the classification model, and guides the transfer from source language to target language according to translation knowledge and definition parameters. We use the improved expectation maximization algorithm to extract information from the aligned source and target sentences, and then store the information on a parallel corpus. In this process, we will use the BLEU-based approach to bilingual assessment substitution from source language to target language; In addition, we also propose a method based on machine learning. In this chapter, we will introduce how to use machine learning technology to improve the prediction ability of translation quality. Finally, this paper also discusses the problems that should be paid attention to when the system is implemented. 4. Experimental results and analysis. 5. Summarize. Conclusion. End of thesis. Outlook. There is only one requirement for this kind of learning, that is, the target language has no markup data. This algorithm can evaluate accurately by running an independent classifier in different parallel corpora [7]. Author attribution of short texts has many challenges, such as short texts, sparse features and non-standardization of random words. To solve this problem, we improve the traditional method, establish a new processing mechanism, and use machine learning technology to realize the automatic construction of short text auto-organizationnetwork. At the same time, higher accuracy is obtained. Coverage. Stability. Scalability. Robustness. Intelligence. Accuracy. Availability. Recent research shows that this paper proposes a hybrid model to solve the author attribution of short texts. This paper mainly includes two parts: The first part is a method based on the combination of Ro BERT and pre-training language model, which is used to learn the features of users' tweets text and represent posts; The second part is a learning method based on CNN model, which is used to predict users' writing style. Finally, we integrate these representatives into an ultimate AA classification. The experimental results of this study show that the tweet model of this study presents the latest achievements on AA data sets of known tweets [8]. Fake news is a major issue facing every society. With the development of Internet technology, fake news is more and more likely to happen. Therefore, how to screen out the truly valuable information from the massive information has become an urgent problem to be solved. Fake news clues come from a wide range of sources and various forms. And the number is huge. It's hard to identify. Hard to track. Not easy to monitor. The cost is high. The loss is serious. Inefficient. The effect is not good. Fake news must be detected and no longer shared before it brings greater damage to the country. However, due to the dynamic nature of information, how to effectively identify fake news is one of the challenging tasks. This paper proposes a fake news detection framework for Thai media and tests it. The framework consists of three modules: information retrieval, natural language processing and machine learning. A comparative study on the test suite shows that the long-term and short-term memory model works best. At the same time, an automated online false news detection Web application is deployed [9]. Manual evaluation of machine translation is the key to improve the accuracy of translation output, which can be applied to text classification in advance. This paper presents a text categorization method based on the combination of parallel corpus and natural language processing technology. This problem has been solved well. This method firstly uses expectation maximization algorithm to classify multilingual texts, and then gets an efficient text classifier through training. Experimental results show that compared with traditional machine learning methods, this method has higher accuracy and recall rate; Compared with the classification model based on monolingual corpus, its performance is obviously improved. In addition, the differences of multilingual texts are also analyzed. There are also differences in cross-language texts. The similarity is greater. Higher. Bigger. Stronger. Better! ! Cross-language text classification, that is, using training data to classify the translated text, thus realizing the classification of various languages. The main idea of this mechanism is to use training data on parallel corpus and apply classification algorithm to reduce distortion and alignment error in machine translation process[10]. There are many motivations for adopting cloud data warehouse technology: cost reduction, pricing on demand, offloading data center, unlimited hardware resources, built-in disaster recovery and so on. There are inherent differences in language surfaces and feature sets between local and cloud data warehouse solutions. This may include subtle syntactic and semantic differences, which may have a significant impact on the correctness of the results, or it may be a complete function that exists in one system but is lacking in other systems [11]. While a great deal of work has been done to help automate the migration of native applications in the new cloud environment, one of the major challenges in slowing down migration is how to deal with the features that cloud technology does not yet support, or to some extent does not yet support. To solve this problem, cross-platform operations can be implemented by building new data centers (such as Hadoop, Cloud, etc). However, in this case, data interaction between virtual machines becomes more difficult. Therefore, more powerful systems must be developed to meet this challenge. This is the topic of this paper. What? The answer is the database. Based on the early adaptive data virtualization, this paper proposes some new technologies that enable applications to operate by using the internal complex database characteristics of external query engines, which lack native support. Specifically, this paper introduces an architecture for managing metadata differences across heterogeneous query engines and simulating database application code in a cloud environment without rewriting and modifying the application code [12]. Previous studies have shown that human perceptrons can identify individuals by biological movements (such as walking or dancing). Sign language movement is one of the important means to realize human recognition under pure biomechanical conditions. In this thesis, we propose a new concept for the first time-the relationship between motion capture (mocap ') and data identification symbols of deaf perceptors. Four sign language speakers used their own sign language words as "point light sources", and another group of deaf participants assisted in completing the task. When signatories appear at different places and times, the distance between them will change with time, which may be caused by the different levels of opportunities in their environment. When the subjects watched these images, they could correctly understand the presented content. This phenomenon indicates that visual attention may be one of the most important factors in cognitive process. In addition, all the subjects thought that physical movements would affect their emotions. But the experimental results are inconsistent. There are differences. There are different conclusions. There are great differences. Inconclusive. Lack of basis. Motion capture data and analysis based on morphological cues can help us better understand the differences between different types of signers. At present, there is no machine learning method that can deal with all the motion features related to human performance at the same time. In addition, for different kinds of signers, "unmarked" actions help them identify signers more effectively than "marked" actions. These results show that sign language is a complex and meaningful system. This paper also discusses the experimental design and main problems. Fig. 4. 15 refs. The first challenge is how to extract effective features from a large number of stimulus materials to distinguish signers, and on this basis, improve the accuracy of classification based on standardized size. The current behavior and calculation results show that the mocap data contains sufficient information for determining the signer, which goes beyond the simple morphological association clues [13].At present, target detectors based on convolution neural network (CNN) usually rely on the last layer features extracted by feature extraction network for detection. However, due to the large differences between different types of samples, traditional feature extraction methods are difficult to effectively extract deeper semantic knowledge hidden in the underlying structure of the image, resulting in a high rate of missed target detection. In addition, for large-scale complex scenes. The computational overhead is too high. Inefficient. The robustness is poor. Poor real-time performance. The effect is not ideal. Expensive. The cost is high. Depth feature is the position information with specific shape and size obtained by continuous convolution or aggregation of images. In order to solve this problem, this paper proposes a new detection model: Dense NNET, which is composed of feature fusion network, multi-scale anchor region analysis network and classification regression network. The algorithm is trained and tested by Pascal VOC2007 data set. Experimental results show that this method can effectively detect all the objects in the dataset, and its average accuracy rate is 73.87%, reaching a high level; At the same time, the average accuracy of target detection algorithm based on DenseNet is 5.10 and 3.10 higher than that of mainstream fast RCNN and SSD detection models respectively[14]. With the advent of the era of big data, we can use big data to analyze all aspects, so as to provide ourselves with richer information resources.This paper takes Chinese drama as the research object, investigates and analyzes the application status of big data analysis method in Chinese drama, finds out the existing problems and puts forward solutions, hoping to help the development of Chinese drama and promote the prosperity of Chinese drama. Big data analysis method is mainly realized by using crawler software and SPSS modeler. In the process of data mining, data should be classified first, then modeled, and finally processed by using feature word cloud. This study shows that the unique performance style and technique of Chinese drama make the drama distinctive, showing the characteristics of generality, freehand brushwork and drama. Then they are input into the decision tree as feature matrices, and the relationship between each feature and each attribute is obtained. Finally, the results are output to the visual interface. In this way, some main characteristics of Chinese drama can be displayed intuitively. Such as: cultural; Regionality, etc. Uniqueness. Diversity. Multiplicity. Richness. Complexity. In view of these characteristics, we can promulgate corresponding laws and regulations from the national or government level to promote the development of Chinese drama, so that the public can have a deeper understanding of Chinese drama[15].

2

Analysis of Chinese Language Application Based on Text Big Data Era

Chinese researchers should not only analyze a large number of Chinese facts, but also pay attention to description; The study of Chinese linguistics is a multi-disciplinary and complex problem. On this premise, Chinese grammar scholars must also have a broad vision. Be good at observing things in different ways of thinking. For example: 1. Multidimensional analysis. 2. Whole analysis method. 3. Dynamic comparison method. 4. Analogical reasoning. 5. System analysis, etc., which requires us to consciously look at objects in language from a more pluralistic perspective, just like blind people touch images. Under individual angles, we may only know some aspects of objects. If there are more angles, we can provide a more comprehensive picture for studying objects, which can also make us approach the truth more easily.

2.1

Analysis of Chinese Language Differences in the Age of Text Big Data

Under the background of "Chinese fever", the number of Chinese learners with different levels and purposes is increasing, and the effective learning ability test has at least the following applications in practical application.

In this way, we can also detect the degree of learning and whether we have strong autonomous learning ability. This is an indirect evaluation. Or direct assessment? The answer is yes. Screening out truly successful learners through learning ability test, so as to achieve the goal of personnel training; Therefore, it is necessary to carry out placement test for different types of beginners. So as to formulate scientific and effective teaching strategies; On the basis of realizing differentiated teaching, teaching fairness and maximizing teaching benefits. Thirdly, as the basic condition of implementing hierarchical teaching mode for teachers of Chinese as a foreign language

There are many records of similar views in ancient Chinese literature. However, due to historical reasons, these discussions have not been paid enough attention. It was not until 1980s that it gradually attracted the attention of academic circles. This is due to the bright future of opportunities brought by social development and scientific and technological progress.

2.2

Characteristic analysis of big data era

Big data has the characteristics of large amount of information and high value density. Therefore, it is widely used and plays a role. The application scope is very wide, and the concept of big data came into being in the early 1990s. Internet technology is one of the products of the information age, while big data is the concrete embodiment of big data in the construction industry, and it is also another technological revolution of the whole automobile industry in China. It can save a lot of manpower and material resources, link big data with cloud computing, form a huge database analysis system, sort out the data and information needed by users in a short time, and feed them back to users in time. With the improvement of science and technology, big data, as a new information storage method, is widely used in various fields, and has a great role in promoting social development. The technology roadmap is shown in Figure 1 below.

2.3

Application of Chinese Language Differences Based on Text Big Data Era

As one of the most basic subjects in mother tongue education in China, Chinese plays a very prominent role. Introducing concepts such as communication linguistics and computer applied linguistics into linguistics and combining big data with computer applied linguistics can provide new ideas for Chinese language research. In the Internet age, and its future development trend is still inconclusive. Since the 21st century, with the advent of the information age, both computer linguistics and artificial intelligence linguistics have made great progress. However, due to the influence of the progress of the times, especially the arrival of the big data era, linguistics has had a great impact. The influence of data information on Chinese linguistics is mainly reflected in language knowledge research and language theory. The era of big data provides abundant materials for linguistic resources, promotes the development of Chinese linguistics, and also promotes the innovation of Chinese language. If Chinese linguistics wants to survive in the era of big data, it must innovate linguistics.

There are many kinds. Strong diversity. In the era of big data, people have more and more demands for information resources, and the information of Chinese linguistics mainly comes from various resource information generated by computer technology, such as database information, which are very important information resources. With the help of modern computer group, we can carry out data analysis and data statistics, analyze a large number of data storage resources in statistical databases, and selectively screen out their valuable linguistic materials, which can effectively enhance the utilization value of Chinese linguistic knowledge. Mining the relevant laws and valuable language data in the development of Chinese linguistics from these data can not only enhance the value of Chinese linguistics research, but also have great significance for the application of Chinese linguistics in reality, and at the same time have far-reaching influence on the development of Chinese linguistics in later generations.

3

Classification algorithm of Chinese language feature differences in the era of text big data

Supervised classification is one of the effective methods to solve the classification problem. It obtains the corresponding classification algorithm by training the training sample data, and then uses the trained classification model. Based on the above background, a new classifier-Random Forest is proposed. The classifier has good generalization performance and adaptive performance. The results show that the method is feasible and effective. Reliable and efficient. Stable operation, easy realization, convenience and practicality, powerful function and easy popularization. Naive Bayesian algorithm has been widely used in the field of text classification because of its good classification effect and ideal learning efficiency.

3.1

Classification algorithm of feature difference in Chinese language

The basic principle of classification algorithm is: the samples located on a classification problem have n-dimensional characteristics and represent A₁, A₁…A_n in sequence; Samples have M labels and represent c₁, c₁…c_m in sequence. When this hypothesis holds, a new Naive Bayesian classification method for identifying unknown class tags is proposed. For an unknown class label test sample S, the feature represents (a₁, a₁…a_n) in vector form. Remember Event X, the class label is equal to c_i for samples, remember Event Y, the feature vector is equal to (a₁, a₁…a_n) for samples, and count all conditional probabilities P(X_i / Y) for i = 1, 2…m, that is: (1) $P (X_{i} / Y) = P ((C = c_{i}) / (A_{1} = a_{1}, A_{2} = a_{2} \dots A_{n} = a_{n}))$ \[P\left( {{X}_{i}}/Y \right)=P\left( \left( C={{c}_{i}} \right)/\left( {{A}_{1}}={{a}_{1}},{{A}_{2}}={{a}_{2}}\ldots {{A}_{n}}={{a}_{n}} \right) \right)\] (2) $c (S) = \arg \max (P (X_{i} / Y))$ \[c(S)=\arg \max (P({{X}_{i}}/Y))\]

According to the multiplication formula, P(X_i / Y) can be transformed to obtain: (3) $P (X_{i} / Y) = \frac{P ((a_{1}, a_{2} \dots a_{n}) / (c_{i})) \times P (c_{i})}{P (a_{1}, a_{2} \dots a_{n})}$ \[P({{X}_{i}}/Y)=\frac{P(({{a}_{1}},{{a}_{2}}\ldots {{a}_{n}})/({{c}_{i}}))\times P({{c}_{i}})}{P({{a}_{1}},{{a}_{2}}\ldots {{a}_{n}})}\] (4) $P (A_{p} / (A_{p}, C_{i})) = 0, \forall p \neq q$ \[P({{A}_{p}}/({{A}_{p}},{{C}_{i}}))=0,\forall p\ne q\]

According to the above assumptions, Equation P((a₁, a₂…a_n)/(c_i)) can be further expanded to obtain: (5) $P ((a_{1}, a_{2} \dots a_{n}) / (c_{i})) = P (a_{1} / c_{i}) \times P (a_{2} \dots a_{n} / c_{i})$ \[P(({{a}_{1}},{{a}_{2}}\ldots {{a}_{n}})/({{c}_{i}}))=P({{a}_{1}}/{{c}_{i}})\times P({{a}_{2}}\ldots {{a}_{n}}/{{c}_{i}})\]

Get: (6) $P ((a_{1}, a_{2} \dots a_{n}) / (c_{i})) = \prod_{j = 1}^{n} P (a_{j} / c_{i})$ \[P(({{a}_{1}},{{a}_{2}}\ldots {{a}_{n}})/({{c}_{i}}))=\underset{j=1}{\overset{n}{\mathop \prod }}\,P({{a}_{j}}/{{c}_{i}})\] (7) $P (X_{i} / Y) = \frac{\prod_{j = 1}^{n} P (a_{j} / c_{i}) \times P (c_{i})}{P ((a_{1}, a_{2} \dots a_{n}))}$ \[P({{X}_{i}}/Y)=\frac{\underset{j=1}{\overset{n}{\mathop \prod }}\,P({{a}_{j}}/{{c}_{i}})\times P({{c}_{i}})}{P(({{a}_{1}},{{a}_{2}}\ldots {{a}_{n}}))}\]

Considering that for all P(X_i / Y), the final result is: (8) $c (S) = a r h \max (P (c_{i}) \times \prod_{j = 1}^{n} P (a_{j} / c_{i}))$ \[c(S)=arh\max (P({{c}_{i}})\times \underset{j=1}{\overset{n}{\mathop \prod }}\,P({{a}_{j}}/{{c}_{i}}))\]

It was put forward by American psychologist E. L. Baum in 1970, and has been widely used in computer graphics since then. This classifier can be used to classify graphics. It has the advantages of small amount of calculation, easy realization, good effect, easy popularization, convenience and practicality, easy mastery and easy operation. Bayesian network is used to represent Naive Bayesian algorithm, as shown in Figure 2.

3.2

Calculation method of feature difference classification algorithm for Chinese language

P(a_j / c_i) and P(c_i ), which need to be calculated in practical application, are called posterior and prior probabilities respectively. They are all statistics that describe the value range or probability distribution law of random variables. The prior probability formula is as follows: (9) $P (c_{i}) = \frac{n_{i}}{n}$ \[P({{c}_{i}})=\frac{{{n}_{i}}}{n}\]

A method of constructing posterior probability density function based on Bayesian network (BN) is presented, and a classifier model with good generalization ability is constructed by using this method. Because the sample features are continuous and discrete, the corresponding posterior probability calculation methods are different.

The posterior probability formula for discrete feature samples is as follows: (10) $P (a_{j} / c_{i}) = \frac{n_{i j}}{n_{i}}$ \[P({{a}_{j}}/{{c}_{i}})=\frac{{{n}_{ij}}}{{{n}_{i}}}\]

It is considered that n_ij can be equal to zero, so that the final P(X_i / Y) is always equal to zero. In this way, when there is a certain amount of data in each subspace, there will be a "dimension disaster" problem with the following formula: (11) $P (a_{j} / c_{i}) = \frac{n_{i j} + λ}{n_{i} + k λ}$ \[P({{a}_{j}}/{{c}_{i}})=\frac{{{n}_{ij}}+\lambda }{{{n}_{i}}+k\lambda }\]

Where K satisfies: (12) $\sum_{p = 1}^{t_{j}} P (a_{j} / c_{i}) = \sum_{p = 1}^{t_{j}} \frac{n_{i j} + λ}{n_{i} + k λ} = 1$ \[\underset{p=1}{\overset{{{t}_{j}}}{\mathop \sum }}\,P({{a}_{j}}/{{c}_{i}})=\underset{p=1}{\overset{{{t}_{j}}}{\mathop \sum }}\,\frac{{{n}_{ij}}+\lambda }{{{n}_{i}}+k\lambda }=1\]

Calculation formula of sample mean value: (13) $\bar{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ \[\bar{X}=\frac{1}{n}\underset{i=1}{\overset{n}{\mathop \sum }}\,{{X}_{i}}\]

Calculation formula of sample variance: (14) $S^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (X_{i} - \bar{X}) = \frac{1}{n - 1} (\sum_{i = 1}^{n} X_{i}^{2} - n {\bar{X}}^{2})$ \[{{S}^{2}}=\frac{1}{n-1}\underset{i=1}{\overset{n}{\mathop \sum }}\,({{X}_{i}}-\bar{X})=\frac{1}{n-1}(\underset{i=1}{\overset{n}{\mathop \sum }}\,X_{i}^{2}-n{{\bar{X}}^{2}})\]

Let $μ = \bar{X}$ $\mu =\bar{X}$, σ² = S², then the posterior probability calculation formula is: (15) $P (a_{j} / c_{i}) = \frac{1}{σ \sqrt{2 π}} \exp (- \frac{{(a_{j} - μ)}^{2}}{2 σ^{2}})$ \[P({{a}_{j}}/{{c}_{i}})=\frac{1}{\sigma \sqrt{2\pi }}\exp (-\frac{{{({{a}_{j}}-\mu )}^{2}}}{2{{\sigma }^{2}}})\]

Considering the prior and posterior probability formulas required by Naive Bayesian classification algorithm, the following are: (16) $P (c_{i}) = \frac{n_{i}}{n}$ \[P({{c}_{i}})=\frac{{{n}_{i}}}{n}\] (17) $P (a_{j} / c_{i}) = {\begin{matrix} \frac{n_{i j}}{n_{i}} \\ \frac{1}{σ \sqrt{2 π}} \exp (- \frac{{(a_{j} - μ)}^{2}}{2 σ^{2}}) \end{matrix}$ \[P\left( {{a}_{j}}/{{c}_{i}} \right)=\left\{ \begin{matrix} \frac{{{n}_{ij}}}{{{n}_{i}}} \\ \frac{1}{\sigma \sqrt{2\pi }}\exp \left( -\frac{{{\left( {{a}_{j}}-\mu \right)}^{2}}}{2{{\sigma }^{2}}} \right) \\ \end{matrix} \right.\]

3.3

Classification algorithm for feature difference classification algorithm of Chinese language

Polynomial model claims that all items in a file are attributes of a file, and the same item can be displayed many times.

However, in practical application, it is found that there are many items that may play an important role or have certain significance for a specific document; And there are differences between different items. This method is simple and effective. In addition, the posterior probability of the same item is equivalent. Under the multi-item model, the posterior probability formula is as follows: (18) $P (a_{j} / c_{i}) = \frac{n_{i j} + λ}{n_{i} + k λ}$ \[P({{a}_{j}}/{{c}_{i}})=\frac{{{n}_{ij}}+\lambda }{{{n}_{i}}+k\lambda }\]

The posterior probability formula of documents under the above formula polynomial model is as follows: (19) $P (d / c_{i}) = \prod_{j = 1}^{t} P (a_{j} / c) \times P (c_{i})$ \[P(d/{{c}_{i}})=\underset{j=1}{\overset{t}{\mathop \prod }}\,P({{a}_{j}}/c)\times P({{c}_{i}})\]

On this basis, a new classifier model based on Bayesian Network and support vector machine-Bayesian classification tree model is given. This method can deal with a large number of unsupervised information in text at the same time. Experimental results show the effectiveness of this algorithm. It is pointed out that if it is expressed as the sum of characteristic posterior probability and class prior probability, it can be obtained: (20) $P (d / c_{i}) = \prod_{j = 1}^{n} P {(A_{j} / c_{i})}^{v_{j}} \times P (c_{i})$ \[P(d/{{c}_{i}})=\underset{j=1}{\overset{n}{\mathop \prod }}\,P{{({{A}_{j}}/{{c}_{i}})}^{{{v}_{j}}}}\times P({{c}_{i}})\] (21) $v_{j} = | a_{j} |$ \[{{v}_{j}}=|{{a}_{j}}|\]

The flow chart of the core part of the algorithm can be divided into two parts: prior probability calculation and posterior probability calculation. See Figure 3 for the flow chart of the algorithm.

4

Experiment of Chinese Language Feature Difference Analysis Based on Text Big Data

In a sense, Chinese language is the crystallization of human wisdom, reflecting the civilization and cultural level of a country or nation. With each passing day, rapid changes, rapid development and rapid progress, under the specific background of the times, the differences in Chinese language features are analyzed by using text big data.

4.1

Analysis of Accuracy Difference of Chinese Language Features Based on Text Big Data

People attach great importance to the accuracy of language, which is reflected in the use of language. And language is to express its accuracy through people's thinking, so the study of the relationship between language and thinking ability is of great significance for us to correctly understand and use language. There is a close relationship between language and thinking ability. Language depends on thinking. Thinking affects language expression. There is no doubt about it. With the continuous development of society, the content of social activities is constantly enriched, the knowledge people have is constantly increasing, and the things they face are constantly becoming complex, all of which require the accuracy of language. With the continuous improvement and progress of science and technology, human thinking has changed from "countless in mind" to "knowing fairly well", that is, from understanding of nature to understanding of itself, thus forming a "quantitative" thinking mode. As the basis and premise of quantitative thinking, rigor and accuracy have played an important role in a series of major scientific achievements made by mankind; The movement of molecules and atoms follows the strict theory of relativity, while in high-rise buildings, spaceships and other fields, it shows more "strict accuracy". According to different levels of text data, the accuracy of text big data is detected, as shown in Table 1 and Figure 4.

Table 1.

Detecting text big data tables based on text data of different dimensions

Training text data	Vector dimension	Accuracy rate (%)
80M	50	88.2
80M	100	94.1
80M	300	98.4

As far as natural science research is concerned, it can better reflect the accuracy and rigor of Chinese language. At the same time, we need to find more new problems. Therefore, scientific research has become a very difficult work. And this is precisely an area full of uncertainties. Accuracy is the first to bear the brunt. Accurate control of experimental methods and accurate and reliable measurement methods are one of the most important means for human beings to know the objective world, and are also essential conditions for establishing a complete theoretical system. See Table 2 and Figure 5 for data analysis of Chinese language feature accuracy based on text big data.

Table 2.

Accuracy analysis of Chinese language by different technologies

	Accuracy of time description	Accuracy of spatial description	Accuracy of thing description
Text big data	78%	88%	69%
OCT technology	65%	76%	56%
Computer recognition technology	55%	57%	45%

4.2

Difference Analysis of Conceptualization Features of Chinese Language Based on Text Big Data

Concept is different from the explanation of words. It is formed from many properties of objects, abandoning non-essential attributes and pulling away essential attributes. It is a scientific definition of the essential characteristics of things.

Chinese characters are the most widely used characters in the world, so they can also be said to be one of the symbol systems of Chinese culture. Chinese characters have rich connotations and contain the unique thinking habits of the Chinese nation. It not only refers to the interaction between people and objects, but also includes the motion of objects themselves. There is a specific change law between it and the acceleration of object moving. This is true from the ancient "Xu, Qu and Ji" to the present "speed". The speed varies from big to small, from big to small. Force and velocity are not a simple scientific concept today, but a common word. The two words "name" and "real" in modern Chinese have special meanings. Their original meanings are noun phrases and their extended meanings are verbs or adjectives. They can be used alone or in combination. Among them, monosyllables are the main ones. Two syllables take second place. Multi-tone varies in times. Used mostly in spoken language; Written language. Language unit. Vocabulary. It has a specific meaning and can also be extended to other usages. See Table 3 and Figure 6 for the analysis of conceptual features of Chinese language based on text big data.

Table 3.

Conceptual analysis of Chinese language by different technologies

	Abstraction	Analysis of new nouns	Concept Reorganization and Alienation
Text big data	89%	75%	91%
OCT technology	60%	80%	76%
Computer recognition technology	40%	68%	70%

Hierarchy is one of the characteristics of scientific thinking. Hierarchy, table and interior are the basis of scientific analysis. The higher the level, the more complex the object to be studied; The lower the level, the simpler the relationship. Different expressions can reflect the differences and connections between things. This requires us to express it with appropriate language art. Language is no exception. That is, hierarchy. Spread out layer by layer. Deepen layer by layer. Gradually improve. Development and progress. In the era of science and technology, people have learned to use scientific methods to deeply explore the essence of things and rationally divide their characteristics. Therefore, when expressing language, we should pay attention to language hierarchy. See Table 4 and Figure 7 for the hierarchical analysis of Chinese language with the help of text big data.

Table 4.

Hierarchical analysis of Chinese language by text big data

Article type	Progressive type	Sequential formula	Parallel formula
Short story	73%	50%	81%
Novel	89%	74%	79%
Scientific and technical articles	90%	54%	91%

4.3

Difference Analysis of Chinese Language Compatibility Features Based on Text Big Data

It is of great significance to study the characteristics of Chinese language in the era of science and technology. Now, with the acceleration of the process of scientific socialization and scientific society, in order to meet the needs of the development of the times, Chinese language has richer connotations in content and stricter logic in form, and great changes have taken place in terms of wording and concepts. With the rapid development of modern science and technology and information industry, people have more and more profound understanding of Chinese language and more and more in-depth research on Chinese language features, which provides a good platform for us to further study and learn Chinese language. With the rapid development of science and technology, the rapid dissemination of information and the rapid development of modern science and technology, we must absorb nutrition from traditional culture to develop ourselves. This has been confirmed. There is no doubt about it. Facts have proved that. The conclusion is the same. In the future, the characteristics of Chinese language will evolve constantly. According to the analysis of Chinese language compatibility based on text big data, see Table 5 and Figure 8.

Table 5.

Compatibility analysis table of modern technology to Chinese language

	Integration of scientific vocabulary into environment	Scientific vocabulary melts into characters	Scientific vocabulary expresses feelings	Perceptual vocabulary replaces scientific vocabulary
Text big data	89%	75%	88%	80%
OCT technology	56%	76%	67%	62%
Computer recognition technology	45%	65%	46%	44%

4.4

Application analysis of Chinese language features in the era of big data

When data information is underdeveloped, Chinese linguistics mainly focuses on the study of language knowledge and language theory, In the era of big data, linguistic resources contain a large amount of information and complex linguistic content. The focus of Chinese linguistics learning is not only limited to the study of language theory and language knowledge, but also moving towards the practical application of Chinese language. The application of computer science injects fresh vitality into linguistics. The application of computer linguistics is becoming more and more popular, which is known as artificial intelligence linguistics. It represents the progress of the times and the embodiment of linguistics in the era of big data. According to the application analysis of Chinese language features based on text big data, see Table 6 and Figure 9.

Table 6.

Analysis of the application of different technologies to Chinese language in the era of big data

	Carry forward the cultural spirit	Teaching application	Adapt to the social environment	Measure the level of development
Text big data	86%	92%	89%	93%
OCT technology	71%	75%	57%	67%
Computer recognition technology	68%	77%	64%	55%

5

Conclusion

With the advent of the big data era, our living standards have been continuously improved, and the information age has been continuously integrated into all aspects, making our lifestyles and methods superior. With the advent of the information age, big data is constantly infiltrating into culture, which is also of great significance for us to understand cultural knowledge, especially in Chinese language features. Chinese language features are an important way for us to learn cultural knowledge and carry huge cultural heritage. Applying big data technology to Chinese language and culture is undoubtedly an important direction of the current era. It is beneficial to analyze the feature differences of Chinese language in text big data, It can also better meet the need for a large number of integrated data and resources in Chinese language research. Through big data, the required resource information can be screened and sorted out in a short time, which can greatly shorten the research time and be more conducive to improving the use value of Chinese language research. Mining Chinese language knowledge data information from text big data can not only make Chinese language more practical, but also have important and profound development significance for the future development of Chinese language.

Idioma:: Inglés

Calendario de la edición:: 1 veces al año
Temas de la revista:: Ciencias de la vida, Ciencias de la vida, otros, Matemáticas, Matemáticas aplicadas, Matemáticas generales, Física, Física, otros

RSS Feed de revista

Differentiation Analysis and Classification of Chinese Language and Literature Features Based on Text Big Data

Xuerui Zhang

Publicado en línea: 27 feb 2025

Recibido: 03 oct 2024

Aceptado: 30 ene 2025

DOI: https://doi.org/10.2478/amns-2025-0105

Palabras claveText big data Features of Chinese language Difference analysis and application

© 2025 Xuerui Zhang, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Palabras clave
Text big data Features of Chinese language Difference analysis and application