Accesso libero

Research on Corpus-Based Linguistic Feature Analysis and Pattern Recognition in English Majors in Colleges and Universities

 e   
19 mar 2025
INFORMAZIONI SU QUESTO ARTICOLO

Cita
Scarica la copertina

Introduction

At present, English teaching in universities is an important part of China’s higher education, shouldering the important task of cultivating a large number of excellent international talents for the country [12]. At present, China is promoting the reform of English teaching in colleges and universities, and the publication of college English textbooks shows a thriving situation, with textbooks being more rigorous in both design and publication [34]. Although the textbooks for college English majors in use now have been improved and progressed in terms of writing techniques and stylistic choices, there are still problems such as untimely updating of contents and lack of distinctive levels of text difficulty [56].

Corpus serves language teaching, and its advantage is that it can provide learners with massive language resources, and scholars have discussed the relationship between the two as early as twenty years ago [78]. Scholars pointed out that a disciplinary co-evolutionary relationship has been formed between corpus research and language teaching: on the one hand, it refers to the direct application of corpus to language teaching, such as teaching corpus or teaching how to use corpus to serve the classroom teaching, and on the other hand, it is applied indirectly to language teaching, such as the development of dictionaries, the publication of teaching materials, etc [911]. In recent years, retrieval studies on corpus aspects have mainly centered on lexical retrieval, and most of the studies have not explored the syntactic level deeply enough, resulting in language learners still being confused in the presence of a large amount of linguistic information [1214].

The multidimensional/multifeature analysis method is based on corpus and computer. Based on the LLC corpus of spoken English and the LOB corpus of written English, the researchers analyze the distribution status and co-occurrence patterns of these features in spoken and written language from 67 linguistic features, followed by a comparative analysis of different texts [1517]. Each dimension includes a set of linguistic features respectively, generally including positive and negative features, and the dimension score of each utterance is equal to the difference between the factor scores of the positive and negative features within that dimension [1820]. This research model is nowadays the most involved and carefully categorized method to study the variation of language domains, combining quantitative and qualitative research, based on statistical analysis, and can be used to compare the variation of corpus in multiple domains. Therefore, it is very meaningful to conduct a corpus-based analysis of text articulation of target textbooks through multidimensional/multifeature analysis [2123].

In this paper, five methods of corpus feature analysis are proposed, and the text of the corpus of English major learners is crawled to design the English major corpus. After that, the convolutional neural network is applied to the corpus text recognition, and an English language feature analysis and pattern recognition model based on a shallow convolutional neural network is constructed. Finally, the model is used to help students majoring in English in colleges and universities identify and analyze the linguistic features of the corpus text.

Method
Indicators for linguistic characterization of the corpus

Density analysis

Density analysis [24] is to count the number of individual words or word blocks, compared with the vocabulary of the whole research factor, occupying the proportion of the research factor or corpus, density analysis can most intuitively see the importance of vocabulary in the text. If the proportion is large, it must have a high number of repetitions in the text or research factor, which indicates that vocabulary is very important. This function can be combined with the vocabulary search function, which can not only retrieve a single vocabulary word, but also display the density of the word block or regular expression. In the process of density analysis, it is necessary to introduce the black and white location map function to make it easier to reflect searched words or word blocks more intuitively. The black and white position map is an incidental function of density analysis, which counts and displays the position of individual words or word blocks in the text or corpus, and shows them as black bar charts on a white background, which is equivalent to the hotspot analysis, and shows the researcher the specific position of the searched words in the text in a more intuitive manner, so as to make it convenient for the researcher to identify the distribution of the words or word blocks in the text or in the corpus.

MI value

Mutual information value is abbreviated as MI value, which is a common method used to calculate the strength of word collocation, and the unit of calculation of MI value is bit.The use of MI value in English science is different from that in finance science, especially the range of its value is different. In information science, the MI value is in the range of 0~1, while in English language, the MI value between words is only in the range of 0. The larger the value, the greater the mutual encounter and attraction between words. Specifically, the MI value calculates the frequency of occurrence of a word in the corpus and provides information about the probability of occurrence of another word. I(a,b)=log2P(a,b)P(a)P(b)2S=log2F(a,b)WF(a)F(b)2S

If the corpus has a capacity of W word times, F(a) is the observed frequency of a multi-word sequence or lexical traveling structure, F(b) is the observed frequency of collocations of a multi-word sequence or lexical traveling structure a, and F(a,b) is the frequency of two lexemes co-occurring in the corpus, then the MI value can be computed in the following equation: I(a,b)=log2WF(a,b)F(a)F(b)

Z -value

If the total corpus capacity is W and the observed frequency of a collocation in the corpus is C1, then the average frequency of occurrence of that collocation at each word position is calculated as C1/W. If the collocation span is limited to S, the frequency of co-occurrence of that band of collocations with each node word is Cl·(2S + 1)/W, with 2S being the span position set to the left and right of the node word, and 1 the word position occupied by the node word, but in this thesis this can be computed for word chunks and similar sentence patterns. This can be calculated for word blocks and similar sentence types, so in the design the word position may not be 1. However, when considering the probability that the collocation word observes the co-occurrence of a particular node word with a frequency of N, its theoretical probability P should be calculated as shown below: P=C1(2S+1)WNW

The expected frequency of the collocation word node word co-occurrence can be found by multiplying the theoretical value of co-occurrence probability P by the reservoir W. Then, the expected frequency of collocation word node word co-occurrence is shown below: SD=(2S+1)N(1C1W)C1W

The standard deviation of the distribution of collocations in the text was further calculated as shown below: E=C1(2S+1)NW

The difference between the actual frequency C2 and the expected frequency E of the node word co-occurrence of the collocated words is divided by the standard deviation to obtain the value of Z. The numerical magnitude of the value of Z can be used to determine the strength of the word collocation. The word collocation Z value needs to be significant at the level of 0.01 and Z value must be equal to or greater than 2.576. Setting a threshold value of 2.576 allows the researcher to obtain significant word collocations and to filter out chance collocations that do not have an effect on the node words. The following equation shows the final value of Z: Z=C2ESD

Value of the log-likelihood function

Log-likelihood function [25], there is a class of methods called “maximum likelihood estimation” in parameter estimation, because the estimation function involved is often an exponential family, and taking the logarithm does not affect its monotonicity but makes the calculation process simpler, so the logarithm of the likelihood function is used, which is called the “log-likelihood function”. Depending on the model involved, the logarithmic function will vary, but the principle is the same, it is determined from the density function of the dependent variable and involves assumptions about the distribution of the random disturbance term. loglikelihood=2clogecAc+bA+B+2dlogedBc+dA+B

T-value

T-value is the most common relative status measure of linearly transformed standardized scores, used to indicate the relative position of an individual within his or her group. In English studies, it is a measure that describes how a word is weighted in the factor of study in which it is placed in comparison to another word in the same factor of study. The basic principle is that an individual’s raw score has several standard deviations above or below the mean, which is the Z-value, and the fraction of time obtained by expanding this Z-score is the T-value. The application of this in an English language corpus is that when the frequency of words in a text is normally distributed, Z-scores are used regardless of whether the overall standard deviation is known or unknown, and regardless of the size of the sample, but T-scores are used when the overall standard deviation is unknown and the sample size is very small. When we use T-scores to compare different factors in a corpus study, we must also be aware that the two factors must be of the same nature or level. The T-value algorithm is as follows: TScore=NF(n,c)FNF(c)NF(n,c)

Corpus text design

Database design is very important in the whole system. A good database design will bring great convenience to the post maintenance of the whole system. The databases used in English studies are called corpora. The texts in the corpus have certain characteristics in some aspects. The texts in the corpus have to be called at any time and have a large base and be used as a comparative sample in order to make the corpus study in the corpus to have a generalization and a wide range. The corpus was mainly retrieved from WordSmith Tools 6.0, and 6 texts with similar literary word counts were collected according to random sampling, 4 articles of “California Mountains, Journey to Alaska, Trillium and the Way of Nature” were used as the investigation corpus, and 2 articles of “Walden Lake and Maine Forest” were used as reference corpora.

Pattern recognition based on convolutional neural networks

With the continuous development of deep learning theory, the representation learning ability of CNN has gradually attracted the attention of researchers, and has also been widely used in various fields. Compared with the traditional feature extraction and classification methods, CNN can directly deal with complex raw data, avoiding the problem of information loss caused by feature selection, and at the same time, the CNN model can reduce the model parameters through the local sensing field and weight sharing, thus reducing the network complexity and processing data samples such as text, speech, and images and videos more effectively. At the same time, the CNN model can reduce model parameters through local receptive fields and weight sharing, which reduces network complexity, thus processing data samples such as text, speech, and image-video more efficiently. In this paper, CNNs are used to classify EEG signals, and the effectiveness of the additional silent reading task and the improved filtering frequency range is verified by statistically analyzing the classification accuracy.

Convolutional Neural Network Models

In order to solve the problem of handwritten digit recognition constructed a convolutional neural network [26] Le Net-5, achieved good recognition results. The basic structure of modern convolutional neural networks is still built on the basis of LeNet-5. The model of LeNet5 CNN network structure is shown in Figure 1.

Figure 1.

LeNet5 CNN network structure model

Usually there are three main layer structures of CNN: convolutional layer, pooling layer and fully connected layer. A complete CNN model can be formed by stacking these three layer structures. The following describes the different layers in the CNN model.

Convolutional Layer

Convolution is the core of CNN. In the convolutional layer, data samples are scanned with convolutional kernels to extract the corresponding local features. Different convolution kernels can extract different features. During the training process of the network model, the parameters in the convolution kernel are constantly modified, so that the effective features are strengthened, and thus the effective features are extracted. The formula for convolution is shown below: xjl=f(iMjxil1×ωijl+bjl) where xjl represents the j rd feature map in layer l, ωijl represents the convolution kernel corresponding to the i th input feature map in layer l, bjl represents the bias term of the j th feature map in layer 1, and f is the activation function. mj denotes the pooling of the input feature maps.

Pooling layer

The pooling layer is generally placed behind the convolutional layer, so the input of the pooling layer is the feature map output from the convolutional layer, and the purpose is to filter the primary features obtained, reduce the amount of data and parameters, so as to improve the robustness of the extracted features. The calculation formula is as follows: xjl=f(βjldown(xjl1)+bjl) xjl1 denotes the j rd feature map in the output of the previous convolutional layer of pooling layer l, βjl is the weight of the j th feature map of pooling layer l, bjl is the bias corresponding to the feature map of this layer, and down is the pooling function. In this paper, average pooling is used.

Fully connected layer

Each node of the fully connected layer is connected to each node of the upper layer, and all the output features of the previous layer are synthesized to form the global features. The formula for full connectivity is as follows: oj(l)=f(i=1nxi(l1))×ωji(l)+b(l) where oj(l) denotes the output corresponding to neuron j in the fully connected layer l, xi(l1) denotes neuron i in the previous layer, n denotes the number of neurons in the previous layer, ωji(l) denotes the connection weight of neuron j in the current layer to neuron i in the previous layer, b(l)denotes the bias term of neuron j, and f denotes the activation function.

Dropout layer

In order to prevent overfitting and improve the generalization ability of the network model, this paper introduces a dropout layer to randomly invalidate the neurons of the current layer and generate a new target network, thus reducing the number of neuron parameters and reducing the complexity of the network model. As shown in the following equation: r(l)={ px(1p)1xx=0,10x0,1 h˜(l)=r(l)×h(l)

Where p is the proportion of the number of discarded neurons set according to the needs of the study, using the parameter p to calculate the elements of the vector r(l), r(l) in line with the Bernoulli random distribution, using the vector r(l) to sample the output of the previous layer h(l) to obtain a new sub-network.

The optimization function of the convolutional neural network is chosen to be Adam (Adaptive moment estimation), the learning rate of each covariate is adjusted in real time during training, and the loss function adopts the cross rate loss function, where a is the real output, y is the predicted output, and n is the number of categories. As shown in the following equation: C=1nx[ ylna+(1y)ln(1a) ]

Modified linear unit (RELU) activation function is used throughout this paper. The RELU activation function is shown below: f(x)={ xifx>00ifx<0

The RELU function [27] keeps all the positive values unchanged, but the negative values are transformed to 0. This unilateral inhibition operation gives the neurons in the network a sparse activation property, and the model after the sparsity is achieved by the RELU function is able to better mine the features and fit the training data.

Design and construction of shallow convolutional neural networks

The small amount of sample data in the training set can lead to overfitting during the training of the convolutional neural network, and the high dimensionality of the data in a single sample can also lead to a subsequent decrease in the performance of the classifier. The structure of a shallow convolutional neural network [28] is shown in Fig. 2. Therefore, in this paper, the 4000 sampling points collected during the 4s imagining period in one experiment are divided into 10 × 400 sampling points, and the sample size of each EEG data is 60 × 400, with 800 samples for each Chinese character in each experimental task for a total of 3200 samples, and the samples are randomly divided into the training set, the validation set, and the test set with the ratio of 6: 2: 2, i.e., the training, validation, and test sets have a sample number of 1920, validation, and test sets respectively. The number of samples is 1920, 640 and 640 for the training set, validation set and test set, respectively. The convolutional neural networks are trained according to the EEG data samples of each subject. The trained convolutional neural network is then utilized to classify the test set of the current subject EEG data samples. For the study of this paper, a shallow convolutional neural network is used to classify the four types of EEG signals.

Figure 2.

Shallow convolution neural network structure

The convolutional neural network model is shown in Fig. 3.The first layer of the CNN network is the sample input layer, each EEG data sample is a two-dimensional matrix of 60 channels × 400 temporal sampling points, and the input layer passes through a convolutional layer with 32 convolutional kernels of 3 × 3, yielding 32 feature maps. Then it passes sequentially through a convolutional layer with 64 convolutional kernels of 5 × 5 and a pooling layer with pooling kernels of 2 × 2 steps of 2. Then it passes sequentially through a fully-connected layer and a dropout layer, and finally the softmax function classifies the features and outputs the classification results from the output layer.

Figure 3.

Convolutional neural network model

Results and discussion
Absolute co-occurrence frequency and lexical meaning of English feature vectors

The absolute co-occurrence frequencies and lexical meanings of the English feature vectors are shown in Table 1. The results show that the transitive contains four parts, namely, intransitive, single transitive, tense verb and complex transitive, and the to manage identifier has the highest absolute co-occurrence frequency in single transitive (134 times), and the to lack and To go very raidly identifiers have the highest absolute co-occurrence frequency in intransitive, which are 16 and 221 times respectively. The morphological forms contained six forms, including infinitive and general present tense. To manage, to lack, and to go very far had the highest absolute co-occurrence frequencies in the past participle, general present tense, and past tense, with 39, 6, and 92 occurrences, respectively. The sum of frequencies for each level within each identifier code should be equal. Overall, the three identifiers “to manage, to lack, and to go very raidly” co-occurred equally often in the material or morphological forms, 140, 24, and 244 times, respectively.

The absolute common frequency and meaning of the behavior eigenvector

Identification code Tag hierarchy To manage To lake To go very raidly
Tropism Intransients 2 16 221
Single sum 134 3 16
Verb 2 4 3
Complex verb 2 1 4
Form Indefinite type 27 3 51
In present tense 18 6 16
Now done 35 5 63
Past 17 4 92
Past participle 39 4 14
Imperative sentence 4 2 8
Relative co-occurrence frequency and lexical meaning of English feature vectors

In order to compare and analyze different frequencies, absolute frequencies need to be converted into relative frequencies. The relative co-occurrence frequencies and word meanings of the English feature vectors are shown in Table 2. The quantitative analysis was conducted using statistical methods and the results are displayed below. The relative co-occurrence frequencies of to manage, to lack and to go very raidly are the highest in single and object, which are 0.9571, 0.6667 and 0.9057 respectively. Among the morphological forms to manage, to lack and to go very raidly have the highest relative co-occurrence frequencies in the past participle, general present and past tense, with 0.2786, 0.2500 and 0.3770 respectively. The sum of the frequencies of the layers within each identification code should be the same. Overall, the relative co-occurrence frequencies of the three lexical meanings “to manage, to lack and to go very raidly” in the transitive or morphological forms all totaled one.

The relative common frequency and meaning of behavioral characteristics

Identification code Tag hierarchy To manage To lake To go very raidly
Tropism Intransients 0.0143 0.6667 0.9057
Single sum 0.9571 0.1250 0.0656
Verb 0.0143 0.1667 0.0123
Complex verb 0.0143 0.0416 0.0164
Form Indefinite type 0.1929 0.1250 0.2090
In present tense 0.1285 0.2500 0.0656
Now done 0.2500 0.2083 0.2582
Past 0.1214 0.1667 0.3770
Past participle 0.2786 0.1667 0.0574
Imperative sentence 0.0286 0.0833 0.0328
Characterization of English for Corpus Retrieval
Corpus Real Word Distribution Search

Table 3 shows the statistical results of corpus real word distribution. The results showed that the sum of the lexical densities of “nouns, verbs, adjectives, adverbs and numbers” in the four texts of “Mountains of California, Journey to Alaska, Trillium and Nature’s Way” in the corpus were 60.89%, 61.45%, 60.88% and 58.75%, respectively. The sum of the lexical density of the two texts in the reference corpus was 52.07% and 44.76%, respectively. It can be seen that the lexical density of the examined corpus is much higher than that of the reference corpus. The analysis results are slightly different from the STTR values, because the maximum number of two text glyphs in the reference corpus is 104357 and 98721, respectively, and the number of function words is relatively large, and there is over-modification. However, the total number of glyphs in “The Way of Nature” is small (only 68,352), and the number and proportion of corresponding function words are also small, so the vocabulary density has increased.

The corpus distribution of corpus is retrieved

Survey item Look at the corpus Reference corpus
Mountains of California Travel to Alaska Long grass Natural path Walden Maine forest
Nouns(%) 24.49 24.75 23.08 23.16 17.39 16.64
Verbs (%) 14.13 16.21 16.26 18.87 16.92 12.86
Adjective (%) 11.92 9.84 10.24 7.38 7.39 5.96
Adverb (%) 8.08 7.01 8.85 7.03 7.38 6.81
Numerals (%) 2.27 3.64 2.45 2.31 2.99 2.49
Keyword search analysis

In order to understand the general plot of the text, this paper utilizes the contextual co-occurrence function of the corpus software to input the keywords with context (KWIC) for retrieval; and then reads and analyzes the contexts on both sides of the retrieved words, so that the main plot of the text can be seen roughly. The contextual co-occurrence function is used to search for related nouns, adjectives, verbs, adverbs, and phrases by entering these top-ranked keywords. The retrieved real words and phrases are then summarized, categorized, and analyzed to understand their importance and details.

The statistical results of the corpus keyword search are shown in Table 4. The KWIC search of the top five keyword frequency items in the four texts reveals that: the keyword in the first place in the four texts is the, whose frequencies are 7905, 5990, 4940, and 5316, respectively; and the corresponding keynesses are 341.05, 108.99, 197.58, and 67.03, respectively. the mountains in California and the natural Way two texts ranked second for the keyword of, which corresponds to the keyness of 339.78 and 72.38, and the frequency of 4,741 and 2,421 times, respectively. And and ranked second in the texts Alaska trip and longifolia, which corresponded to a frequency of 3832 and 2465 times, respectively. The keywords ranked 3-5 in California’s mountains are and, in, and with, whose corresponding frequency are 3849, 2612, and 774 times, with keyness of 7.46, 50.03, and 17.41, respectively. the keywords ranked 3-5 in Alaska trip are of, in, and on, whose corresponding frequencies are 2944, 1611 and 865, with criticality of 82.65, 3.51 and 76.44, respectively. The keywords of of, a and in ranked in 3-5 for the extended grass are of, a and in, with corresponding frequencies of 2077, 1365 and 1237, with criticality of 33.49, 24.60 and 45.18, respectively. Nature’s Way ranked 3-5 for the keywords a, in and it, whose corresponding frequencies are 1540, 1348 and 595, and whose criticality is 1.59, 27.44 and 26.92 respectively.

The keyword search statistics result

Survey content The corpus retrieves the sort
1 2 3 4 5
Mountains of California Frequency 7905 4741 3849 2612 774
Key 341.05 339.78 7.46 50.03 17.41
Key words the of and in with
Travel to Alaska Frequency 5990 3832 2944 1611 865
Key 108.99 7.31 82.65 3.51 76.44
Key words the and of in on
Long grass Frequency 4940 2465 2077 1365 1237
Key 197.58 1.03 33.49 24.60 45.18
Key words the and of a in
Natural path Frequency 5316 2421 1540 1348 595
Key 67.03 72.38 1.59 27.44 26.92
Key words the of a in it

Through searching, we can quickly gather contextual information about the keywords in each text, which allows us to fully understand the theme, time, place, and other important information of each text. This helps the researcher to grasp the thought flow of the original text, and also helps the translator to highlight such information during the translation process. In addition, the study of featured words in the corpus can also reflect the thematic structure and important ideas of the text.

Be verb word frequency retrieval analysis

The results of examining the corpus be verb word frequency statistics are shown in Table 5. The results show that the number of sentences in the four texts is 2866, 2968, 2686, and 3022 in order. in The Mountains of California, the top verbs are be verbs is (1038 times) and are (6729 times), about 61.65% of the clauses contain these two be verbs, that is, more than half of the clauses are in the present tense. In Journey to Alaska, the top-ranked verbs are was and were in 46.66% of the total number of clauses, and are and are in 31.70% of the total number of clauses. In “The Longing Grass” is and are make up about 43.08% of the sentences and was and were make up about 26.88% of the sentences. The be verbs is and are make up about 54.43% of the total number of sentences in The Natural Way. With the exception of Journey to Alaska, all of them are in the present tense, which means that the authors recorded what they saw and heard during their journeys in the form of making notes on the spot. The author’s expedition to the glaciers in the Alaskan region lasted for more than ten years, and he continued to make changes to the notes taken during successive expeditions, so the past tense accounted for a larger proportion of the total number of sentences.

Statistical results of the word frequency of the corpus be

Name of work Total sentence is are was were
Mountains of California 2866 1038 729 473 344
Travel to Alaska 2968 514 427 836 549
Long grass 2686 838 319 430 292
Natural path 3022 1167 478 306 266
Conclusion

Using shallow convolutional neural network corpus search as a tool, this paper analyzes the linguistic features of four American ecological essays in terms of co-occurring frequency lexical meaning, real word distribution, keywords and sentence retrieval and analysis of the study samples of the texts. The main conclusions are as follows:

In pattern recognition, to manage identifier code is the highest in single and objects, and to lack and To go very raidly identifier code are the highest in less than objects, and their absolute co-occurrence frequencies are 134 times, 16 times and 221 times in that order. Quantitative analysis using statistical methods yielded the highest relative co-occurrence frequencies of single and objects in to manage, to lack, and to go very raidly, which were 0.9571, 0.6667, and 0.9057, respectively. The relative co-occurrence frequencies of the three identifiers in the morphological forms were highest in the past participle, general present tense, and past tense, at 0.2786, 0.2500, and 0.3770, respectively.

The sum of the lexical densities of “nouns, verbs, adjectives, adverbs and number words” in the four examined corpus texts is 60.89%, 61.45%, 60.88% and 58.75%, respectively; the sum of the lexical densities of the two texts of the reference corpus is 52.07% and 44.76%, respectively. It can be seen that the lexical density of the examined corpus is much higher than that of the reference corpus. The keyword the is ranked first in the four texts, and its frequency and corresponding criticality are between 7905-4940 and 341.05-67.03 respectively. It is possible to quickly catch the contextual information of the keywords in each text, so as to fully understand the important information of each text such as the theme, time, place, etc.; the analysis of the Be verb word frequency retrieval can help college students to quickly analyze and understand the linguistic features of the corpus.

Lingua:
Inglese
Frequenza di pubblicazione:
1 volte all'anno
Argomenti della rivista:
Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro