Recommendations for Big Data-Driven English Learning Behavior Analysis and Personalized Teaching Strategy

At present, the study of learning behavior analysis is a hot topic in foreign language teaching, especially in the context of the current teaching reform, the analysis of learning behavior is crucial for blended teaching reform. However, learning behavior as an abstract concept, it is difficult for teachers to clearly use it as a criterion to guide students, and how to apply the results of the study of learning behavior in teaching has been the focus of research [1-3]. Constructivist learning theory advocates that the only way for a learning subject to develop knowledge is to acquire it through internal construction [4]. That is to say, learners’ experience occupies a very important position in their learning process, and teachers must pay attention to this in order to truly participate in the learning process of learners [5-7].

In recent years, big data has gradually become a popular tool for foreign language teaching research, and the rich data can not only provide teachers with suggestions on aspects such as teaching content and methods, but also be able to count the students’ learning data, which in turn facilitates foreign language teaching [8-10]. Big data technology is a technology capable of storing and processing huge amounts of data, which is a valuable resource and a useful tool, marking another revolution in technology that can enhance users’ decision-making power, insight and discovery [11]. With the acceleration of the process of education informatization, a large number of big data application cases have appeared in the field of education. Big data in education records, integrates and analyzes all the information and data in education and teaching activities, reflecting the learning status and results of students in a certain period of time, thus helping educators to implement more effective work under the premise of following the laws of students’ development and learning, implementing personalized parenting, and reaching the goal of comprehensive and holistic development of students [12-16]. Therefore, exploring the characteristics of English majors’ learning behavior through big data analysis technology and forming personalized teaching strategies based on the analysis results can provide suggestions for the characteristic teaching of English majors in the new era.

The use of big data technology to implement a more scientific teaching practice, and then carry out a scientific analysis of the overall situation of students, is the inevitable optimization and innovation of English teaching work. Literature [17] reveals the positive impact of English learner behavior analysis supported by big data technology on the effectiveness of English teaching and learning, and constructs a learning behavior model on the basis of the analysis results, which is conducive to the provision of personalized teaching interventions by teachers according to the individual characteristics, and significantly improves the efficiency of learners’ English learning. Literature [18] builds a big data analysis model with student learning behavior data from a distance English education platform as a sample to explore the learning patterns and behavioral patterns of learners in an online learning platform, and at the same time, to understand the motivation for independent learning of English in distance education and the demand for resources. Literature [19] examined the application of structural equation modeling in the analysis of student behavior in online courses, and provided theoretical references for the improvement of the quality of online education by mining the influencing factors of English online learning behaviors and integrating and analyzing them. Literature [20] showed that data mining technology embodies high efficiency, low occupancy and accurate prediction when dealing with online English education learning behavior data, and introduced fuzzy neural networks to establish a data mining model for English learners’ behavior data, which provides positive references for the improvement of online English learning system. Literature [21] proposes an English language learning pattern matching clustering algorithm, which is based on the big data framework can quickly obtain the big data information in English language teaching and carry out clustering processing, to meet the personalized development needs of learners, and provides an innovative path for English language teaching. Literature [22] uses the big data analysis strategy based on neural network algorithm to analyze the learner behavior data in the process of English teaching and constructs a discrete dynamic teaching quality evaluation model from five aspects, which is conducive to promoting the improvement of teaching efficiency and the implementation of personalized teaching. Literature [23] applies deep learning models to the automatic identification of students’ learning behavior in English daily teaching, which helps teachers understand students’ psychological characteristics, correct students’ learning behavior and significantly improve students’ learning efficiency. The application of big data technology in education supported by science and technology can accurately and realistically record students’ learning and life data in school, providing technical support for teachers to understand and analyze students’ personalized learning mode and life pattern. The development of personalized English teaching strategies driven by big data also reflects the humanistic care of education.

The study collects student learning behavior data with the help of data mining technology. The K-means algorithm realizes the classification of learners’ behavioral characteristics, classifies learners into different categories, uses the collaborative filtering recommendation algorithm to realize the personalized push of learning resources, and develops the optimal learning paths for learners through the genetic algorithm to realize the recommendation of personalized teaching strategies in English. Two classes of students in school A are taken as research objects, the experimental class adopts the personalized teaching method designed in this paper, and the control class adopts the traditional teaching method. The effect of this paper’s teaching strategy is verified through a before-and-after control experiment on English scores.

2

Method

The application of big data analytics in personalized teaching has great potential, as it can accurately identify students’ learning styles, interest preferences, and potential difficulties by collecting, processing, and analyzing the massive amount of data generated by students in the learning process, and provide tailored learning resources and paths for each student. This data-based decision-making approach makes personalized teaching more scientific, precise and efficient.

2.1

Data acquisition

2.1.1

Limitations of traditional data collection methods

Traditional data collection methods rely mainly on teachers’ observations and records, which have some limitations. Firstly, it is difficult for teachers to accurately record every learning behavior of students in their busy teaching work, and there may be omissions or subjective judgment bias. Second, traditional data collection methods cannot provide real-time and accurate data on students’ learning behaviors, and teachers cannot understand students’ learning status in time and provide them with personalized instruction. In addition, traditional methods also require a lot of paper records and organizing work, which is inefficient and prone to data loss.

2.1.2

Learning behavior data collection based on data mining techniques

Big data, which refers to huge data sets that are difficult to cope with by traditional data processing methods, is characterized by its core features of huge data volume, diverse types, rapid processing and rich potential value. This concept emphasizes that the size of the data set exceeds the processing capacity of conventional software tools, the data types are numerous and complex, the data processing needs are urgent, and the great value of the data can be mined through specialized processing.

The data collection method of student learning behavior based on data mining technology can overcome the limitations of traditional methods, effectively track student learning behavior, and provide more accurate and real-time data support for teaching management and personalized instruction. First, the use of digital tools in the learning process, such as electronic textbooks and online learning platforms, can record students’ learning behaviors and learning time and automatically save the data in the database, ensuring the accuracy and reliability of the data. Secondly, using data processing and analysis technology, students’ learning behavior can be quantitatively and qualitatively analyzed, students’ learning patterns and problems can be mined, and personalized learning suggestions and guidance programs can be provided [24].

In summary, the data collection method of students’ learning behavior based on data mining technology can provide more accurate and real-time data support for teaching management and personalized guidance, overcome the limitations of traditional methods, and effectively promote the improvement of education quality and students’ learning effect.

2.2

Cluster Analysis of Students’ English Learning Behavior

2.2.1

Application of cluster analysis in the study of learning behavior

Cluster analysis is an unsupervised machine learning method that can aggregate data samples with similar characteristics to form distinct groups or clusters. In learning behavior research, cluster analysis is widely used to identify students’ learning patterns and learning characteristics. Specifically, the application of cluster analysis in learning behavior research mainly includes the following aspects. 1)

Classification of student groups. Through cluster analysis, students with similar learning behavior characteristics can be gathered together to form different student groups. This helps educators better understand the learning needs and characteristics of different student groups and provides guidance for personalized teaching.

2)

Learning pattern identification. Cluster analysis can reveal students’ learning patterns and learning habits. For example, through clustering analysis of learning data on students’ online learning platforms, we can find out students’ learning activity in different time periods, the use of learning resources, etc., so as to identify students’ learning patterns.

3)

Learning effect assessment. By comparing the learning performance and learning behavior characteristics of different student groups, the effects of different teaching modes and teaching methods can be assessed. For example, the differences in academic performance and learning behaviors of student groups adopting traditional teaching modes and those adopting personalized teaching modes can be compared, so as to assess the effectiveness of the two teaching modes.

4)

Instructional intervention and optimization. Based on the results of clustering analysis, educators can formulate corresponding teaching interventions and optimization plans for the learning needs and problems of different student groups. For example, for student groups with low learning activity, measures such as increasing learning interaction and providing learning incentives can be taken to improve their learning motivation.

2.2.2

Principles and characteristics of the K-means algorithm

The K-means algorithm is a classical clustering algorithm for dividing data samples into K clusters such that each sample belongs to its nearest cluster center. The principle of the algorithm is relatively simple and intuitive; first, K cluster centers are randomly selected as initial points, then each sample is iteratively assigned to its nearest cluster center, and then the center position of each cluster is updated [25]. The iterative process will be repeated until the cluster centers no longer change or a predetermined number of iterations is reached.

The features of the K-means algorithm include ease of implementation and high efficiency and speed. Due to its simple iterative process, the algorithm is relatively simple to implement and easy to apply to large-scale datasets. At the same time, the K-means algorithm has linear time complexity and therefore runs faster on large datasets. However, the K-means algorithm is more sensitive to the selection of the initial cluster centers and may obtain different clustering results, so it needs to be run several times and select the better results in practical applications.

In addition, the K-means algorithm is suitable for the case where the data samples have obvious cluster structures and nearly spherical clusters. For non-spherical or cluster structures with different scales, the K-means algorithm may be less effective. To overcome these limitations, researchers have also developed many improved K-means algorithms such as the K-means++ algorithm and the K-medoids algorithm to improve the clustering effectiveness and robustness of the algorithm.

Overall, the K-means algorithm is a simple but effective clustering algorithm that is widely used in many practical applications, especially in the fields of data mining, pattern recognition and machine learning. However, the use requires attention to the selection of the initial cluster centers and the adaptation to the characteristics of the data distribution in order to obtain more accurate and robust clustering results.

2.2.3

Classifying Learning Behaviors Using K-means Algorithm

K-means algorithm is a clustering algorithm based on the classification of species, the criteria for the classification of species is to measure the similarity of the distance of each data, the algorithm can efficiently and rapidly process the data, so it is often used for large-scale massive data clustering.

First, the K-means algorithm requires that a dataset containing a large number of data samples and a specific value for the number of clusters be given. Second, randomly selected samples as the initial center of the algorithm, and then calculate the distance between the rest of the samples and the initial center of the initial center of the sample to measure a sample and which center is more similar to the center, it will be placed in the center of the cluster where the center of the class, and continue to calculate. The centroids of the clusters are varied according to the results until the error is minimized. The process of clustering data using the K-means algorithm can be expressed by the formula, given a set of samples K: (1) ${\begin{array}{l} K = 〈 C_{x} | C_{x} = (C_{x 1}, C_{x 2}, \dots, C_{x j}) 〉 \\ x = 1, 2, \dots, n \end{array}$

Where: n is the sample capacity. C is the set of attributes of the selected sample. x and j represent a total of j attributes for the xth data in set K, respectively. After selecting the sample set, the cluster class center can be determined: (2) ${\begin{array}{l} B = 〈 I_{v} | I_{v} = (I_{v 1}, I_{v 2}, \dots, I_{v d}) 〉 \\ v = 1, 2, \dots, m \end{array}$

Where: B is the set of all cluster class centroids in the dataset. I_v is the set of attributes of the centroids. v and d remain as centroids I_v has d different attributes and a total of m cluster classes to be calculated. Once both the data set and the cluster class centers are determined, the similarity distance between the two can be calculated and measured: (3) ${\begin{array}{l} T (C_{x}, I_{v}) = \sqrt{\sum_{d} {(C_{x} - I_{v})}^{2}} \\ x = 1, 2, \dots, n; v = 1, 2, \dots, m \end{array}$

Where: C_x and I_v are the centroids of the cluster class of a data in the data set and the cluster class that has been set. T is a measure of the similarity distance between the data and the cluster centroids, and there are d attributes to be calculated. Among them, x and v have their own fixed range of values. Based on the K-means algorithm, it can quickly and efficiently classify the massive data of students’ English learning behaviors, judge what category a particular learning behavior data belongs to based on the similarity of the metrics, and minimize the errors generated in the classification process. It provides data processing prerequisites and technical basis for the systematic study of students’ English learning behaviors.

2.3

Recommendations for Personalized Teaching Strategies

2.3.1

Improving Collaborative Filtering Algorithms for Personalized Recommendations

Learners have diversity, they have significant differences in age, interest, education level, expectations or study habits, and each student has a learning map that fits his or her characteristics. Based on the learner’s historical behavior records or similarity relationships to help discover items that learners may be interested in, the collaborative filtering recommendation algorithm (CF) is used to achieve personalized recommendations [26].

The collaborative filtering recommendation algorithm mainly consists of three steps: learner interest preference extraction, similarity calculation, and recommendation list generation. First, the learner interest preference features are extracted. In the online learning process, learners reflect their own preference characteristics in many ways, such as learners’ evaluation of learning materials (likes, ratings, favorites, sharing, evaluation and other behaviors), learners’ profiles (gender, age, preferences, etc.), learners’ resource viewing (search behavior, browsing time, clicking behavior, etc.), and learners’ evaluation of learning resources is the explicit feedback behavior of learners, which intuitively reflects the learners’ preferences. It intuitively reflects the learner’s preference, while the others are implicit feedbacks, which need to be analyzed through data analysis to explore their habitual characteristics. The classification of learners’ preference characteristics can be realized by the K-means algorithm above, which classifies learners into different categories. Next, similarity is calculated to find similar learning users or learning resources. The similarity is calculated based on vectors, that is, the distance between two vectors is calculated, the closer the distance the greater the similarity, the commonly used calculation methods are Euclidean distance, cosine theorem similarity coefficient, Jaccard’s similarity coefficient, Pearson’s correlation coefficient.

Cosine similarity is a commonly used measure of user or item similarity. It measures the similarity between two vectors by calculating the cosine of the angle between them. If the angle between two vectors is close to 0 degrees, the cosine similarity is close to 1, which means they are very similar. In recommendation algorithms, cosine similarity is calculated between two users by calculating their ratings for the same item. The formula is shown below: (4) $Cos i n e (x, y) = \frac{\sum_{i \in Y_{x y}} r_{x i} \times r_{y i}}{\sqrt{\sum_{i \in Y_{x y}} r_{x i}^{2}} \sqrt{\sum_{i \in Y_{x y}} r_{y i}^{2}}}$

where r_xi and r_yi denote the ratings of item i by users x and y, respectively, and Y_xy denotes the set of items jointly rated by users x and y.

Based on the similarity between users, his rating of an item can be predicted for the target user. The prediction formula is as follows: (5) ${\hat{r}}_{u i} = {\hat{r}}_{u} + \frac{\sum v \in S (u, K) \cap N (i) s i m (u, v) (r_{v i} - r_{v})}{\sum v \in S (u, K) \cap N (i) | s i m (u, v) |}$

where ${\hat{r}}_{u i}$ denotes the predicted rating, $S (u, K)$ denotes the set of K users that are most similar to user u, $N (i)$ denotes the set of users that have rated item i, and $s i m (u, v)$ is the similarity between users u and v.

The Jaccard correlation coefficient, also known as the Jaccard similarity coefficient, is a metric used to compare similarities and differences between finite sample sets. Its definition is based on the ratio of the intersection of two sets to their concurrent sets. Specifically, given two sets A and B, the formula for the Jaccard correlation coefficient $J (A, B)$ is as follows: (6) $J a c c a r d (A, B) = \frac{| A \cap B |}{| A \cup B |} = \frac{| A \cap B |}{| A | + | B | - | A \cap B |}$

Where $| A |$ denotes the number of elements in set A, the base of set A, and $| B |$ denotes the number of elements in set B, the base of set B.

The Jaccard correlation coefficient has a value between 0 and 1. When the Jaccard correlation coefficient is equal to 1, it means that the two data sets are identical. When the Jaccard correlation coefficient is equal to 0, it means that the two datasets do not have any elements in common. The higher the Jaccard coefficient, the more similar the two samples are.

The Jaccard coefficient correlation metric is called Jaccard distance and is used to describe the degree of dissimilarity between sets. The Jaccard distance measures the degree of differentiation between two sets in terms of the proportion of different elements in the two sets to all elements, the larger the Jaccard distance, the less similar the samples are. The formula is defined as follows: (7) $d_{j} (A, B) = 1 - J (A, B) = \frac{| A \cup B | - | A \cap B |}{| A \cup B |} = \frac{A | B}{| A \cap B |}$

Similarity was calculated using Pearson’s correlation coefficient P(i, j), calculated as: (8) $P_{i, j} = \frac{\sum_{c \in I_{i, j}} (r_{i, c} - \bar{r_{i}}) (r_{i, c} - \bar{r_{j}})}{\sqrt{\sum_{c \in I_{i, j}} {(r_{i, c} - \bar{r_{i}})}^{2}} \sqrt{\sum_{c \in I_{i, j}} {(r_{i, c} - \bar{r_{j}})}^{2}}}$

Where i and j are learner i and learner j, the set of ratings for learner i and learner j is I_i,j, r_i,c represents learner i rating of the cth learning resource, and ${\bar{r}}_{i}$ represents learner i average rating of all learning resources. r_i,c represents the rating of learner i on learning resource c, and ${\bar{r}}_{i}$ represents the average rating of learner j on all learning resources. The correlation coefficient P has a value range of $[- 1, 1]$ , with larger absolute values indicating a higher correlation between the two. Finally, the recommendation list is generated based on the similarity value.

Personalized recommendation can help learners access resources efficiently, quickly find the learning content of interest for learners, reduce the search time of learners, and avoid learners from “drowning” in the sea of information.

2.3.2

Genetic Algorithm Based Learning Path Customization

Personalized teaching is also reflected in the customization of personalized learning paths for learners based on their individual differences, using genetic algorithms to find the optimal learning content and customize learning paths for learners.

Genetic algorithm (GA) is an optimization algorithm based on Darwin’s theory of evolution of species, each individual is an entity with characteristics of chromosomes, predicting the chromosomes, establishing the initial generation of the population, selecting individuals according to the fitness function in each generation, and then combining the crossover and mutation according to the genetic operator to produce a new population, and the evolution of the generation by generation to produce better and better approximation of the solution, and realize the evolution of the population [27]. Genetic algorithms are widely used in the field of adaptive learning, GA takes a global search approach to obtain information, selects the results according to the size of the fitness and makes the results more adaptive, so it can efficiently analyze the learner’s preferences and accurately customize learning paths for them in line with the cognitive level of the learner.

Based on the learner database (including learners’ basic information such as name and ID, learner characteristic information such as learning style and knowledge level) and the course resource description table, the GA searches for the optimal learning path for learners through genetic algorithm.

The ELT course is labeled D, and the course D is divided into instructional units numbered by d₁, d₂, …, d_n, where d_i ∈ N. Each instructional unit contains a set of key concepts that learners are expected to master, labeled $(1 \leq i \leq n)$ by U_i. For each instructional unit d_i(1 ≤ i ≤ n), the complexity of C_i ∈ N and subject to θ ∈ N is defined and used to describe the complexity d_i. If mastery of the key concepts in instructional unit d_i is more difficult than in instructional unit d_j, then C_i > C_j.

Define U = U₁∪U₂∪…∪U_n, then U is the set of key concepts contained in course D. Define the dependency matrix $K = {K_{i j}}$ of a teaching unit where (1 ≤ i, j ≤ n). If d_j depends on d_i, then K_ij = 1, otherwise K_ij = 0.

Teaching sequence X is a combination of a series of teaching units, i.e. X = X₁, X₂, …, X_m, where m < n, X_i ∈ D and $(\forall i \neq j) \Rightarrow X_{i} \neq X_{j}$ .

Sequence X is acceptable if each unit occurs only once and the unit that occurs in the sequence is faster than the unit that depends on it. That is, if $(\forall i \neq j) \Rightarrow X_{i} \neq X$ , then teaching sequence X is acceptable if for any pair of teaching sequences $(X_{p}, X_{q})$ , where X_p, X_q ∈ X, and if p < q, then X_p the teaching unit does not depend on X_q the teaching unit (e.g., matrix K state), 1 ≤ p, q ≤ m.

The complexity of sequence X is defined as and $C_{i} (X_{i})$ is the complexity of instructional unit Xi.

For each $(X_{i}, X_{i + 1})$ sequence pair of Instructional Sequence X, define $r_{i} (X_{i}, X_{i + 1})$ , 1 ≤ i ≤ m − 1, $r i (X_{i}, X_{i + 1})$ as the number of Instructional Units X_i+1 on which the learner’s mastery of Instructional Unit X_i depends.

Instructional unit $(d_{i})$ will be the gene of GA, instructional sequence is the chromosome of GA, and the set of instructional sequences is the population of GA. The fitness function of the genetic algorithm is: (9) $F (X) = \sum_{i = 1}^{m - 1} C_{i} (X_{i}) \cdot r_{i} (X_{i}, X_{i} + 1) + C_{m} (X_{m})$

If F(X) < F(Y), then chromosome X is preferred to chromosome Y, i.e., learning sequence $\sum_{X_{i} \in X} C_{i} (X_{i})$ column X is preferred to sequence Y, when F(X) → min, sequence X is optimal. The learning sequences are selected based on the fitness function and the optimal set of learning sequences is generated using combinatorial crossover and double mutation.

3

Results and discussion

In this paper, a 3-month teaching practice study was conducted with students of the university English course of financial engineering and international economics and trade in the class of 2023 in an A school, divided into two teaching classes with a total of 300 students. The 150 students in the first class formed the experimental class and adopted the individualized teaching method designed in this paper. The 150 students in the second class formed the control class, adopting the traditional teaching method.

3.1

Learning behavior cluster analysis

In this paper, a Python application was written under Visual Studio Code platform to extract four items of learning behavior data of students in the experimental class provided by the Learning Access platform from the statistical data of the Learning Access platform, which are the video viewing time, the number of visits, the percentage of task point completion and the online quiz scores. In the process of extracting the learning behavior data, in addition to retaining the four learning behavior data also retained the student’s school number and name data, which facilitates the understanding of student information after classifying the students at a later stage. In this paper, data cleaning was done on the learning behavior data of this part of the students, and the data with zero logins were excluded, and a new collection of learning data was obtained after extracting and cleaning the learning data, and a total of 2 students’ learning behavior data were cleaned, and 148 records of learning behavior data were retained. Due to the different orders of magnitude of the extracted learning behavior data, the study used the standard deviation standardization method (Z-Score) to standardize the learning behavior data.

K-Means clustering algorithm was used to cluster the four learning behavior data of 148 students remaining after data cleaning. Unsupervised algorithm for machine learning and data training of the normalized learning behavior data was performed on Visual Studio Code platform using K-means model provided by scikit-learn, a third party library for Python. Cluster analysis using the K-means cluster analysis algorithm requires the determination of the number of categories, and in order to be able to accurately determine the number of categories, a questionnaire survey was carried out on the online learning initiative and learning effectiveness of 148 students. In this paper, several data training and clustering calculations were carried out, and it was finally determined that online learning students were divided into four types of learners, completing the classification of online learning students.

The clustering distribution of learning behavior data is shown in Table 1. 0, 1, 2 and 3 are the category labels of each category of students after cluster analysis, and students belonging to the same label belong to the same type of learners. The 148 students who participated in the online learning English course on the Learning Access platform were categorized into four types of learners.

Table 1.

Learning behavior data clustering

Class label	Clustering number	Video viewing duration	Access frequency	The percentage of the task points	Online test scores
0	24	-0.7792	-0.9316	-0.58422	-1.17966
1	56	0.11468	0.17027	0.41268	-0.49009
2	37	-0.6313	-0.61891	-0.64247	0.71358
3	31	1.20287	1.3203	0.3627	0.70907

In this paper, the clustering results are presented in the form of scatterplot of categorized data. As the learning behavior data is 4 dimensional data, in order to present the scatter characteristics, this paper reduces the dimensionality of the learning behavior data through Python’s scikit-learn library, the data itself is 4 dimensional after the reduction into 2 dimensions, you can draw the distribution of the sample points in the plane, and the clustering results of the visualization processed learning behavior data are shown in Figure 1. The sample data points belonging to label number 0 in the scatter plot of the learning behavior data clustering results are marked with blue stars and distributed in the lower left area of the plot, which classifies these 24 learners as low motivation and low achievement type learners (Type 1). The sample data points belonging to label number 1 are labeled as orange triangles and distributed in the lower right area of the graph, classifying these 56 learners as low motivation high achievement learners (Type 2). The sample data points belonging to label number 2 are labeled as purple dots and distributed in the upper left area of the graph, classifying these 37 learners as high motivation low achievement learners (Type 3). The sample data points belonging to label 3 are labeled as green hexagons and distributed in the upper right region of the graph, classifying these 31 learners as highly motivated high-achieving learners (Type 4). The sample points of each type are distributed in a normal distribution around the center point of the class family and do not cross, which indicates that the category boundaries are clear and the classification of the categories is more reasonable.

3.2

Personalized Recommendation Analysis with Collaborative Filtering

In this paper, the evaluation of recommendation algorithms mainly chooses offline experiments. RMSE is an important index for offline experiments to evaluate the performance of recommendation models, and it is applicable to the scenario of predicting user rating information. Its calculation formula is shown in equation (3): (10) $R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(\hat{r_{u t}} - r_{w i})}^{2}}{n}}$

Where, $\hat{r_{w}}$ denotes the predicted score of the user, r_w is the true score and n is the number of resources. The smaller the RMSE value, the better the performance of this recommendation model.

The comparison of RMSE under different neighbors for the improved collaborative filtering algorithm is shown in Fig. 2. Compared with the traditional user-based collaborative filtering algorithm, the RMSE of this paper’s method is smaller when the neighboring users increase, and the average RMSE value is lower than that of the traditional collaborative filtering algorithm by 0.07. It shows that this paper’s method is able to recommend resources that meet the user’s knowledge point needs and learning preferences.

3.3

Learning path customization analysis of genetic algorithms

The study quantitatively tracked and recorded the information of learning behaviors in the first 3 months of teaching practice in the experimental class, as shown in the time spent in each part of the page, the content and number of clicks, the comment target and content, and the content shared.

Scatterplots were used to count some of the learning behaviors of the four classes of learners, and the learning path behavior analysis is shown in Figure 3. (a)~(d) represent the learning path behaviors of low-motivation low-achievement learners, low-motivation high-achievement learners, high-motivation low-achievement learners, and high-motivation high-achievement learners, respectively. Where the x-axis represents the chronological order and the y-axis represents the sequence of learning activities (e.g., syllabus, forums, resources, exercises), and each point represents a behavior. Low-motivation low-achievement learners (Type 1) have an inconspicuous choice of learning paths, with no apparent focus on syllabi, forums, resources, and exercises. Low-motivation, high-achievement learners (Type 2), high-motivation, low-achievement learners (Type 3), and high-motivation, high-achievement learners (Type 4) focused primarily on the two learning activities of practicing and outlining, outlining and forums, and practicing and forums, respectively.

The learning sequences are selected based on the genetic algorithm fitness function and the optimal set of learning sequences is generated using combinatorial crossover and double mutation. The recommended learning paths for the four types of students in the experimental class are shown in Figure 4. The nodes contained in the figure represent different sequences of learning activities, where lower level learning behaviors are executed after higher level learning behaviors, e.g., Behavior 2 in LEVEL 2 is executed after Behavior 1 in LEVEL 1. The numbers on the edges represent the learning path coefficients, i.e., the likelihood of occurrence. From Fig. 4, the optimal learning paths for the four classes of students are {1-2-4-3}, {1-2-3-4}, {1-3-4-2}, and {1-3-2-4}. Generating personalized learning paths and accurately pushing them to learners through genetic algorithms not only solves the problems of learning lost and cognitive overload, but also achieves efficient use of learning resources and promotes learners’ active construction, internalization and transfer of knowledge.

3.4

Comparison of the effectiveness of teaching practices

Before the teaching practice, the statistics of English scores of the two classes are shown in Figure 5. The average English scores of the experimental class and the control class before the practice were 62.13 and 63.14 respectively, and after the data investigation T=0.03, P=0.82 is not statistically significant, which indicates that the average English scores of the two groups, although a little bit different, are still relatively comparable on the whole, and can be used for this teaching practice.

The English scores of the two classes after conducting a three-month teaching practice are shown in Figure 6. The average English scores of the experimental and control classes after the practice were 72.63 and 65.14 respectively. The test scores of the two classes were analyzed by SPSS for data analysis and there was a significant difference (p=0.002).

In summary, the strategy of using big data technology to analyze students’ English learning behaviors and then personalized teaching can significantly improve students’ English performance and promote the quality of education.

3.5

Pedagogical recommendations

Based on the results of the study, we propose recommendations on data analysis of college students’ English learning behaviors and personalized teaching in order to improve the quality of education and students’ learning experience. 1)

Schools should actively explore and implement personalized teaching programs, provide personalized courses and resources according to students’ motivation, learning methods, time allocation and frustration coping ability, and customize different teaching materials and methods to better meet the needs of different students.

2)

Schools can enhance students’ academic performance by stimulating their motivation, such as encouraging students to participate in interest courses, programs or competitions, providing them with information on career planning and prospects, and helping them to understand the importance of learning English and stimulate their interest in learning, so as to increase their motivation to learn.

3)

Schools should encourage students to try different learning methods, such as reading, listening, speaking, writing, etc., and provide them with diversified learning resources to meet the learning styles and preferences of different students.

4)

Schools should provide students with frustration coping support, including psychological counseling, academic support, and study skills training, in order to help students better cope with difficulties and frustrations in learning.

5)

Schools can provide training courses on time management skills to help students allocate their study time more effectively. Students should also learn to make study plans and rationalize their study time to enhance their academic performance.

4

Conclusion

In this paper, we cluster and analyze students’ learning behaviors by collecting a variety of data from their learning process. Based on collaborative filtering algorithm and genetic algorithm, the personalized push of learning resources and customization of learning path are carried out for different types of student groups respectively.

After K-Means clustering, the learning behavior of students in the experimental class of school A, for example, can be divided into four types: low motivation and low achievement type learners, low motivation and high achievement type learners, high motivation and low achievement type learners and high motivation and high achievement type learners.

The average RMSE of the collaborative filtering recommendation algorithm designed in this paper is 0.07 lower than that of the traditional collaborative filtering algorithm, which is more capable of recommending resources that meet the users’ knowledge needs and learning preferences, and then personalized learning path design based on the genetic algorithm for the four types of learners in the experimental class.

At the end of the 3-month teaching practice, the average English score of the experimental class is 7.49 higher than that of the control class, and there is a significant difference (P=0.002). The strategy of personalized teaching based on students’ learning behaviors designed in this paper contributed to students’ English achievement. Accordingly, suggestions are made to promote personalized teaching, enhance learning motivation, promote the diversification of learning methods, and strengthen frustration coping support and learning time management, with a view to providing references for realizing personalized teaching and enhancing stud References ents’ learning outcomes.

Language:: English

Publication timeframe:: 1 times per year
Journal Subjects:: Life Sciences, Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics, Physics, other

Journal RSS Feed

Recommendations for Big Data-Driven English Learning Behavior Analysis and Personalized Teaching Strategy

Lichao Zhang

Cuiping Shi

Published Online: Sep 26, 2025

Received: Feb 10, 2025

Accepted: May 11, 2025

DOI: https://doi.org/10.2478/amns-2025-1031

KeywordsData mining, K-means algorithm, Collaborative filtering algorithm, Genetic algorithm, Personalized teaching

© 2025 Lichao Zhang and Cuiping Shi, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Keywords
Data mining, K-means algorithm, Collaborative filtering algorithm, Genetic algorithm, Personalized teaching