Research on Corpus-Based Linguistic Feature Analysis and Pattern Recognition in English Majors in Colleges and Universities
Publié en ligne: 19 mars 2025
Reçu: 28 oct. 2024
Accepté: 31 janv. 2025
DOI: https://doi.org/10.2478/amns-2025-0474
Mots clés
© 2025 Yuemei Fu et al., published by Sciendo
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
At present, English teaching in universities is an important part of China’s higher education, shouldering the important task of cultivating a large number of excellent international talents for the country [1–2]. At present, China is promoting the reform of English teaching in colleges and universities, and the publication of college English textbooks shows a thriving situation, with textbooks being more rigorous in both design and publication [3–4]. Although the textbooks for college English majors in use now have been improved and progressed in terms of writing techniques and stylistic choices, there are still problems such as untimely updating of contents and lack of distinctive levels of text difficulty [5–6].
Corpus serves language teaching, and its advantage is that it can provide learners with massive language resources, and scholars have discussed the relationship between the two as early as twenty years ago [7–8]. Scholars pointed out that a disciplinary co-evolutionary relationship has been formed between corpus research and language teaching: on the one hand, it refers to the direct application of corpus to language teaching, such as teaching corpus or teaching how to use corpus to serve the classroom teaching, and on the other hand, it is applied indirectly to language teaching, such as the development of dictionaries, the publication of teaching materials, etc [9–11]. In recent years, retrieval studies on corpus aspects have mainly centered on lexical retrieval, and most of the studies have not explored the syntactic level deeply enough, resulting in language learners still being confused in the presence of a large amount of linguistic information [12–14].
The multidimensional/multifeature analysis method is based on corpus and computer. Based on the LLC corpus of spoken English and the LOB corpus of written English, the researchers analyze the distribution status and co-occurrence patterns of these features in spoken and written language from 67 linguistic features, followed by a comparative analysis of different texts [15–17]. Each dimension includes a set of linguistic features respectively, generally including positive and negative features, and the dimension score of each utterance is equal to the difference between the factor scores of the positive and negative features within that dimension [18–20]. This research model is nowadays the most involved and carefully categorized method to study the variation of language domains, combining quantitative and qualitative research, based on statistical analysis, and can be used to compare the variation of corpus in multiple domains. Therefore, it is very meaningful to conduct a corpus-based analysis of text articulation of target textbooks through multidimensional/multifeature analysis [21–23].
In this paper, five methods of corpus feature analysis are proposed, and the text of the corpus of English major learners is crawled to design the English major corpus. After that, the convolutional neural network is applied to the corpus text recognition, and an English language feature analysis and pattern recognition model based on a shallow convolutional neural network is constructed. Finally, the model is used to help students majoring in English in colleges and universities identify and analyze the linguistic features of the corpus text.
Density analysis
Density analysis [24] is to count the number of individual words or word blocks, compared with the vocabulary of the whole research factor, occupying the proportion of the research factor or corpus, density analysis can most intuitively see the importance of vocabulary in the text. If the proportion is large, it must have a high number of repetitions in the text or research factor, which indicates that vocabulary is very important. This function can be combined with the vocabulary search function, which can not only retrieve a single vocabulary word, but also display the density of the word block or regular expression. In the process of density analysis, it is necessary to introduce the black and white location map function to make it easier to reflect searched words or word blocks more intuitively. The black and white position map is an incidental function of density analysis, which counts and displays the position of individual words or word blocks in the text or corpus, and shows them as black bar charts on a white background, which is equivalent to the hotspot analysis, and shows the researcher the specific position of the searched words in the text in a more intuitive manner, so as to make it convenient for the researcher to identify the distribution of the words or word blocks in the text or in the corpus.
MI value
Mutual information value is abbreviated as MI value, which is a common method used to calculate the strength of word collocation, and the unit of calculation of MI value is bit.The use of MI value in English science is different from that in finance science, especially the range of its value is different. In information science, the MI value is in the range of 0~1, while in English language, the MI value between words is only in the range of 0. The larger the value, the greater the mutual encounter and attraction between words. Specifically, the MI value calculates the frequency of occurrence of a word in the corpus and provides information about the probability of occurrence of another word.
If the corpus has a capacity of
If the total corpus capacity is
The expected frequency of the collocation word node word co-occurrence can be found by multiplying the theoretical value of co-occurrence probability
The standard deviation of the distribution of collocations in the text was further calculated as shown below:
The difference between the actual frequency
Value of the log-likelihood function
Log-likelihood function [25], there is a class of methods called “maximum likelihood estimation” in parameter estimation, because the estimation function involved is often an exponential family, and taking the logarithm does not affect its monotonicity but makes the calculation process simpler, so the logarithm of the likelihood function is used, which is called the “log-likelihood function”. Depending on the model involved, the logarithmic function will vary, but the principle is the same, it is determined from the density function of the dependent variable and involves assumptions about the distribution of the random disturbance term.
T-value
T-value is the most common relative status measure of linearly transformed standardized scores, used to indicate the relative position of an individual within his or her group. In English studies, it is a measure that describes how a word is weighted in the factor of study in which it is placed in comparison to another word in the same factor of study. The basic principle is that an individual’s raw score has several standard deviations above or below the mean, which is the Z-value, and the fraction of time obtained by expanding this Z-score is the T-value. The application of this in an English language corpus is that when the frequency of words in a text is normally distributed, Z-scores are used regardless of whether the overall standard deviation is known or unknown, and regardless of the size of the sample, but T-scores are used when the overall standard deviation is unknown and the sample size is very small. When we use T-scores to compare different factors in a corpus study, we must also be aware that the two factors must be of the same nature or level. The T-value algorithm is as follows:
Database design is very important in the whole system. A good database design will bring great convenience to the post maintenance of the whole system. The databases used in English studies are called corpora. The texts in the corpus have certain characteristics in some aspects. The texts in the corpus have to be called at any time and have a large base and be used as a comparative sample in order to make the corpus study in the corpus to have a generalization and a wide range. The corpus was mainly retrieved from WordSmith Tools 6.0, and 6 texts with similar literary word counts were collected according to random sampling, 4 articles of “California Mountains, Journey to Alaska, Trillium and the Way of Nature” were used as the investigation corpus, and 2 articles of “Walden Lake and Maine Forest” were used as reference corpora.
With the continuous development of deep learning theory, the representation learning ability of CNN has gradually attracted the attention of researchers, and has also been widely used in various fields. Compared with the traditional feature extraction and classification methods, CNN can directly deal with complex raw data, avoiding the problem of information loss caused by feature selection, and at the same time, the CNN model can reduce the model parameters through the local sensing field and weight sharing, thus reducing the network complexity and processing data samples such as text, speech, and images and videos more effectively. At the same time, the CNN model can reduce model parameters through local receptive fields and weight sharing, which reduces network complexity, thus processing data samples such as text, speech, and image-video more efficiently. In this paper, CNNs are used to classify EEG signals, and the effectiveness of the additional silent reading task and the improved filtering frequency range is verified by statistically analyzing the classification accuracy.
In order to solve the problem of handwritten digit recognition constructed a convolutional neural network [26] Le Net-5, achieved good recognition results. The basic structure of modern convolutional neural networks is still built on the basis of LeNet-5. The model of LeNet5 CNN network structure is shown in Figure 1.

LeNet5 CNN network structure model
Usually there are three main layer structures of CNN: convolutional layer, pooling layer and fully connected layer. A complete CNN model can be formed by stacking these three layer structures. The following describes the different layers in the CNN model.
Convolutional Layer Convolution is the core of CNN. In the convolutional layer, data samples are scanned with convolutional kernels to extract the corresponding local features. Different convolution kernels can extract different features. During the training process of the network model, the parameters in the convolution kernel are constantly modified, so that the effective features are strengthened, and thus the effective features are extracted. The formula for convolution is shown below:
Pooling layer The pooling layer is generally placed behind the convolutional layer, so the input of the pooling layer is the feature map output from the convolutional layer, and the purpose is to filter the primary features obtained, reduce the amount of data and parameters, so as to improve the robustness of the extracted features. The calculation formula is as follows:
Fully connected layer Each node of the fully connected layer is connected to each node of the upper layer, and all the output features of the previous layer are synthesized to form the global features. The formula for full connectivity is as follows:
Dropout layer In order to prevent overfitting and improve the generalization ability of the network model, this paper introduces a dropout layer to randomly invalidate the neurons of the current layer and generate a new target network, thus reducing the number of neuron parameters and reducing the complexity of the network model. As shown in the following equation:
Where The optimization function of the convolutional neural network is chosen to be Adam (Adaptive moment estimation), the learning rate of each covariate is adjusted in real time during training, and the loss function adopts the cross rate loss function, where Modified linear unit (RELU) activation function is used throughout this paper. The RELU activation function is shown below:
The RELU function [27] keeps all the positive values unchanged, but the negative values are transformed to 0. This unilateral inhibition operation gives the neurons in the network a sparse activation property, and the model after the sparsity is achieved by the RELU function is able to better mine the features and fit the training data.
The small amount of sample data in the training set can lead to overfitting during the training of the convolutional neural network, and the high dimensionality of the data in a single sample can also lead to a subsequent decrease in the performance of the classifier. The structure of a shallow convolutional neural network [28] is shown in Fig. 2. Therefore, in this paper, the 4000 sampling points collected during the 4s imagining period in one experiment are divided into 10 × 400 sampling points, and the sample size of each EEG data is 60 × 400, with 800 samples for each Chinese character in each experimental task for a total of 3200 samples, and the samples are randomly divided into the training set, the validation set, and the test set with the ratio of 6: 2: 2, i.e., the training, validation, and test sets have a sample number of 1920, validation, and test sets respectively. The number of samples is 1920, 640 and 640 for the training set, validation set and test set, respectively. The convolutional neural networks are trained according to the EEG data samples of each subject. The trained convolutional neural network is then utilized to classify the test set of the current subject EEG data samples. For the study of this paper, a shallow convolutional neural network is used to classify the four types of EEG signals.

Shallow convolution neural network structure
The convolutional neural network model is shown in Fig. 3.The first layer of the CNN network is the sample input layer, each EEG data sample is a two-dimensional matrix of 60 channels × 400 temporal sampling points, and the input layer passes through a convolutional layer with 32 convolutional kernels of 3 × 3, yielding 32 feature maps. Then it passes sequentially through a convolutional layer with 64 convolutional kernels of 5 × 5 and a pooling layer with pooling kernels of 2 × 2 steps of 2. Then it passes sequentially through a fully-connected layer and a dropout layer, and finally the softmax function classifies the features and outputs the classification results from the output layer.

Convolutional neural network model
The absolute co-occurrence frequencies and lexical meanings of the English feature vectors are shown in Table 1. The results show that the transitive contains four parts, namely, intransitive, single transitive, tense verb and complex transitive, and the to manage identifier has the highest absolute co-occurrence frequency in single transitive (134 times), and the to lack and To go very raidly identifiers have the highest absolute co-occurrence frequency in intransitive, which are 16 and 221 times respectively. The morphological forms contained six forms, including infinitive and general present tense. To manage, to lack, and to go very far had the highest absolute co-occurrence frequencies in the past participle, general present tense, and past tense, with 39, 6, and 92 occurrences, respectively. The sum of frequencies for each level within each identifier code should be equal. Overall, the three identifiers “to manage, to lack, and to go very raidly” co-occurred equally often in the material or morphological forms, 140, 24, and 244 times, respectively.
The absolute common frequency and meaning of the behavior eigenvector
Identification code | Tag hierarchy | To manage | To lake | To go very raidly |
---|---|---|---|---|
Tropism | Intransients | 2 | 16 | 221 |
Single sum | 134 | 3 | 16 | |
Verb | 2 | 4 | 3 | |
Complex verb | 2 | 1 | 4 | |
Form | Indefinite type | 27 | 3 | 51 |
In present tense | 18 | 6 | 16 | |
Now done | 35 | 5 | 63 | |
Past | 17 | 4 | 92 | |
Past participle | 39 | 4 | 14 | |
Imperative sentence | 4 | 2 | 8 |
In order to compare and analyze different frequencies, absolute frequencies need to be converted into relative frequencies. The relative co-occurrence frequencies and word meanings of the English feature vectors are shown in Table 2. The quantitative analysis was conducted using statistical methods and the results are displayed below. The relative co-occurrence frequencies of to manage, to lack and to go very raidly are the highest in single and object, which are 0.9571, 0.6667 and 0.9057 respectively. Among the morphological forms to manage, to lack and to go very raidly have the highest relative co-occurrence frequencies in the past participle, general present and past tense, with 0.2786, 0.2500 and 0.3770 respectively. The sum of the frequencies of the layers within each identification code should be the same. Overall, the relative co-occurrence frequencies of the three lexical meanings “to manage, to lack and to go very raidly” in the transitive or morphological forms all totaled one.
The relative common frequency and meaning of behavioral characteristics
Identification code | Tag hierarchy | To manage | To lake | To go very raidly |
---|---|---|---|---|
Tropism | Intransients | 0.0143 | 0.6667 | 0.9057 |
Single sum | 0.9571 | 0.1250 | 0.0656 | |
Verb | 0.0143 | 0.1667 | 0.0123 | |
Complex verb | 0.0143 | 0.0416 | 0.0164 | |
Form | Indefinite type | 0.1929 | 0.1250 | 0.2090 |
In present tense | 0.1285 | 0.2500 | 0.0656 | |
Now done | 0.2500 | 0.2083 | 0.2582 | |
Past | 0.1214 | 0.1667 | 0.3770 | |
Past participle | 0.2786 | 0.1667 | 0.0574 | |
Imperative sentence | 0.0286 | 0.0833 | 0.0328 |
Table 3 shows the statistical results of corpus real word distribution. The results showed that the sum of the lexical densities of “nouns, verbs, adjectives, adverbs and numbers” in the four texts of “Mountains of California, Journey to Alaska, Trillium and Nature’s Way” in the corpus were 60.89%, 61.45%, 60.88% and 58.75%, respectively. The sum of the lexical density of the two texts in the reference corpus was 52.07% and 44.76%, respectively. It can be seen that the lexical density of the examined corpus is much higher than that of the reference corpus. The analysis results are slightly different from the STTR values, because the maximum number of two text glyphs in the reference corpus is 104357 and 98721, respectively, and the number of function words is relatively large, and there is over-modification. However, the total number of glyphs in “The Way of Nature” is small (only 68,352), and the number and proportion of corresponding function words are also small, so the vocabulary density has increased.
The corpus distribution of corpus is retrieved
Survey item | Look at the corpus | Reference corpus | ||||
---|---|---|---|---|---|---|
Mountains of California | Travel to Alaska | Long grass | Natural path | Walden | Maine forest | |
Nouns(%) | 24.49 | 24.75 | 23.08 | 23.16 | 17.39 | 16.64 |
Verbs (%) | 14.13 | 16.21 | 16.26 | 18.87 | 16.92 | 12.86 |
Adjective (%) | 11.92 | 9.84 | 10.24 | 7.38 | 7.39 | 5.96 |
Adverb (%) | 8.08 | 7.01 | 8.85 | 7.03 | 7.38 | 6.81 |
Numerals (%) | 2.27 | 3.64 | 2.45 | 2.31 | 2.99 | 2.49 |
In order to understand the general plot of the text, this paper utilizes the contextual co-occurrence function of the corpus software to input the keywords with context (KWIC) for retrieval; and then reads and analyzes the contexts on both sides of the retrieved words, so that the main plot of the text can be seen roughly. The contextual co-occurrence function is used to search for related nouns, adjectives, verbs, adverbs, and phrases by entering these top-ranked keywords. The retrieved real words and phrases are then summarized, categorized, and analyzed to understand their importance and details.
The statistical results of the corpus keyword search are shown in Table 4. The KWIC search of the top five keyword frequency items in the four texts reveals that: the keyword in the first place in the four texts is the, whose frequencies are 7905, 5990, 4940, and 5316, respectively; and the corresponding keynesses are 341.05, 108.99, 197.58, and 67.03, respectively. the mountains in California and the natural Way two texts ranked second for the keyword of, which corresponds to the keyness of 339.78 and 72.38, and the frequency of 4,741 and 2,421 times, respectively. And and ranked second in the texts Alaska trip and longifolia, which corresponded to a frequency of 3832 and 2465 times, respectively. The keywords ranked 3-5 in California’s mountains are and, in, and with, whose corresponding frequency are 3849, 2612, and 774 times, with keyness of 7.46, 50.03, and 17.41, respectively. the keywords ranked 3-5 in Alaska trip are of, in, and on, whose corresponding frequencies are 2944, 1611 and 865, with criticality of 82.65, 3.51 and 76.44, respectively. The keywords of of, a and in ranked in 3-5 for the extended grass are of, a and in, with corresponding frequencies of 2077, 1365 and 1237, with criticality of 33.49, 24.60 and 45.18, respectively. Nature’s Way ranked 3-5 for the keywords a, in and it, whose corresponding frequencies are 1540, 1348 and 595, and whose criticality is 1.59, 27.44 and 26.92 respectively.
The keyword search statistics result
Survey content | The corpus retrieves the sort | |||||
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | ||
Mountains of California | Frequency | 7905 | 4741 | 3849 | 2612 | 774 |
Key | 341.05 | 339.78 | 7.46 | 50.03 | 17.41 | |
Key words | the | of | and | in | with | |
Travel to Alaska | Frequency | 5990 | 3832 | 2944 | 1611 | 865 |
Key | 108.99 | 7.31 | 82.65 | 3.51 | 76.44 | |
Key words | the | and | of | in | on | |
Long grass | Frequency | 4940 | 2465 | 2077 | 1365 | 1237 |
Key | 197.58 | 1.03 | 33.49 | 24.60 | 45.18 | |
Key words | the | and | of | a | in | |
Natural path | Frequency | 5316 | 2421 | 1540 | 1348 | 595 |
Key | 67.03 | 72.38 | 1.59 | 27.44 | 26.92 | |
Key words | the | of | a | in | it |
Through searching, we can quickly gather contextual information about the keywords in each text, which allows us to fully understand the theme, time, place, and other important information of each text. This helps the researcher to grasp the thought flow of the original text, and also helps the translator to highlight such information during the translation process. In addition, the study of featured words in the corpus can also reflect the thematic structure and important ideas of the text.
The results of examining the corpus be verb word frequency statistics are shown in Table 5. The results show that the number of sentences in the four texts is 2866, 2968, 2686, and 3022 in order. in The Mountains of California, the top verbs are be verbs is (1038 times) and are (6729 times), about 61.65% of the clauses contain these two be verbs, that is, more than half of the clauses are in the present tense. In Journey to Alaska, the top-ranked verbs are was and were in 46.66% of the total number of clauses, and are and are in 31.70% of the total number of clauses. In “The Longing Grass” is and are make up about 43.08% of the sentences and was and were make up about 26.88% of the sentences. The be verbs is and are make up about 54.43% of the total number of sentences in The Natural Way. With the exception of Journey to Alaska, all of them are in the present tense, which means that the authors recorded what they saw and heard during their journeys in the form of making notes on the spot. The author’s expedition to the glaciers in the Alaskan region lasted for more than ten years, and he continued to make changes to the notes taken during successive expeditions, so the past tense accounted for a larger proportion of the total number of sentences.
Statistical results of the word frequency of the corpus be
Name of work | Total sentence | is | are | was | were |
---|---|---|---|---|---|
Mountains of California | 2866 | 1038 | 729 | 473 | 344 |
Travel to Alaska | 2968 | 514 | 427 | 836 | 549 |
Long grass | 2686 | 838 | 319 | 430 | 292 |
Natural path | 3022 | 1167 | 478 | 306 | 266 |
Using shallow convolutional neural network corpus search as a tool, this paper analyzes the linguistic features of four American ecological essays in terms of co-occurring frequency lexical meaning, real word distribution, keywords and sentence retrieval and analysis of the study samples of the texts. The main conclusions are as follows:
In pattern recognition, to manage identifier code is the highest in single and objects, and to lack and To go very raidly identifier code are the highest in less than objects, and their absolute co-occurrence frequencies are 134 times, 16 times and 221 times in that order. Quantitative analysis using statistical methods yielded the highest relative co-occurrence frequencies of single and objects in to manage, to lack, and to go very raidly, which were 0.9571, 0.6667, and 0.9057, respectively. The relative co-occurrence frequencies of the three identifiers in the morphological forms were highest in the past participle, general present tense, and past tense, at 0.2786, 0.2500, and 0.3770, respectively. The sum of the lexical densities of “nouns, verbs, adjectives, adverbs and number words” in the four examined corpus texts is 60.89%, 61.45%, 60.88% and 58.75%, respectively; the sum of the lexical densities of the two texts of the reference corpus is 52.07% and 44.76%, respectively. It can be seen that the lexical density of the examined corpus is much higher than that of the reference corpus. The keyword the is ranked first in the four texts, and its frequency and corresponding criticality are between 7905-4940 and 341.05-67.03 respectively. It is possible to quickly catch the contextual information of the keywords in each text, so as to fully understand the important information of each text such as the theme, time, place, etc.; the analysis of the Be verb word frequency retrieval can help college students to quickly analyze and understand the linguistic features of the corpus.