Intelligent Analytics for Educational Big Data and Its Application to Instructional Management
Publié en ligne: 03 févr. 2025
Reçu: 29 août 2024
Accepté: 16 déc. 2024
DOI: https://doi.org/10.2478/amns-2025-0002
Mots clés
© 2025 Nan Zhang, published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
The rapid development of modern informatization has driven the rapid development of information technology such as cloud computing, the Internet of Things, and big data, which also brings unprecedented development opportunities for the informatization of education. Online education based on big data is changing the traditional way of education and has attracted a great deal of attention in the education sector with its advantage of “teaching according to the student’s abilities” [1–2]. The development trend of modern education is personalized and standardized based on a large amount of data. The most typical feature of big data is the huge amount of data and huge data space. Educational big data are mainly generated in the process of teaching activities and educational management and collect all the static and dynamic data in the whole educational teaching and management process [3–4]. Educational big data sources are scattered, of many types, of varying quality and inconsistent standards, and there may be duplicated data and high data redundancy among different data sources. It is of great significance to outline students’ portraits, analyze and mine students’ potential characteristics, self-worth tendencies, academic trends, etc. from educational big data with wide sources and many types [5–6].
Intelligent analysis oriented to education big data creates new opportunities for school development and teaching changes. More and more intelligent devices, online systems and platforms, etc., are entering the campus, and the construction of smart campuses and smart classrooms continues to advance [7–8]. In the process of realizing the smart classroom, data is the foundation, teaching is the root, wisdom is the key, and management is the guarantee, which directly pushes the research of education big data to a new height. Teaching management is an important factor that affects the teaching effect, and it is a dilemma in teaching in several schools [9–10]. It is crucial for teachers to rationally analyze and conduct effective classroom teaching management by combining real-time data collected by smart devices in the “new classroom” and “smart classroom” [11]. Intelligent devices are increasingly collecting and using real data from the first scene of the classroom, improving the ability to interact with people and thus giving play to the greater value of data, which is the embodiment of data intelligence [12]. How to be able to utilize the data in the process of education and teaching and give full play to the value of data intelligence in education and teaching has been a recent research hotspot in the field of education. However, how to truly combine big data with teaching management practices needs to be further explored [13–15].
It is an urgent requirement of the times to use intelligent technologies to deeply analyze and mine educational data to find the correlation between educational phenomena, educational contents, and educational laws in order to conform to the inherent logic of the development of education. Literature [16] systematically reviews the status of the application of data mining techniques and learning analytics in educational data in the last decade and provides information on resources such as specific tools and free available datasets to help readers better understand and apply these methods. Literature [17] deployed a deep artificial neural network in a virtual learning environment and verified the effectiveness of this deep learning model through intervention experiments, which can more accurately mine information about students’ learning behaviors from big educational data and provide technical support for realizing sustainable education. Literature [18] emphasizes the importance of the application of big data and introduces in detail the emerging multimedia intelligent learning, analysis technology, and multimedia data online streaming processing technology applied in the field of education. Literature [19] uses SWOT analysis to analyze the trend of applying smart technologies in the field of education and makes suggestions to prompt readers to think about how to better utilize digital technologies to meet the needs of the new generation to adapt to the digital society.
With the further study of educational big data intelligence and the integration and optimization of various types of intelligent devices, data can provide more and more effective help for teaching management. Literature [20] designed an intelligent teaching management system based on big data technology and verified its effectiveness through performance testing, which can transform the existing management data into useful knowledge, thus improving the level and quality of school management. Literature [21] points out that the use of big data for in-depth research and fine management of teaching is a new trend in the current era, and the development of an intelligent automation system through big data learning analytics can descriptively, predictively, and normatively analyze students’ learning behaviors and improve their learning process. Literature [22] created an online interactive educational intelligent system based on cloud-based big data analytics and verified its feasibility through compatibility, security, and system performance tests, which can improve the learning efficiency of students and the efficiency of teaching management while supporting the key aspects of the teaching process. Literature [23], after analyzing the dilemmas, challenges, and innovative significance of education management in colleges and universities in the context of the big data era, proposed a big data intelligent education management system that adapts to the management needs of the big data era by optimizing and adjusting the education management work and provides students with better quality teaching services.
This study proposes the use of the DBSCAN clustering method and the XGBoost model to construct an academic early warning system. Taking the students of Lanzhou University of Technology as the research object, the DBSCA method is used to study the behavioral data generated by the students during their school years and analyze the connection between borrowing and consumption behaviors and students’ performance. The proposed XGBboost early warning model was compared with other early warning models in a performance test to verify its accuracy. Then, the distribution of grades was studied according to students in different professional courses, and finally, the XGBboost early warning model was used to mine the key features affecting students’ academic performance in different semesters to achieve teaching management in colleges and universities.
Data is an important information resource, but only through continuous mining and refining can potential resources become usable wealth. Data mining is one of the important aspects of knowledge discovery, which is the process of extracting implied prior unknown, but at the same time potentially useful information and knowledge from massive data, i.e., a class of deep data analysis methods. It should be noted that these massive data are usually random, fuzzy, and incomplete [24]. At the same time, data mining is a cross-cutting discipline that mainly involves the following related technologies: machine learning, pattern recognition, neural networks, mathematical statistics, databases, rough sets, fuzzy mathematics, and so on.
Because data mining technology can comprehensively analyze the hidden intrinsic connection between teaching evaluation results and various factors, it would be very meaningful to apply it to teaching evaluation. In the field of teaching, along with the continuous growth of data, we apply data mining to the teaching management of colleges and universities, which can promote further reform, improvement, and development of the teaching system. In the past, in teaching management, we always relied on personal experience or learned from other people’s experience, which leads to the phenomenon of lagging management, which has a negative impact on the enhancement of academic construction and the cultivation of talents. Data mining technology can objectively reflect the problems in teaching management in colleges and universities, which may provide an important basis for teaching decision-making.
Data mining technology has the following characteristics:
The scale of the data being processed is very large, reaching orders of magnitude of GB, TB, or even larger. Queries are generally immediate random queries made by decision makers (users), which often do not result in precise query requirements, which rely on the system itself to find data it may be interested in. In some applications, since data changes rapidly, data mining is required to be able to respond quickly in order to provide decision support at any time. In data mining, the discovery of rules is based on statistical laws. Therefore, the rules it discovers may not have to be applied to all data but are considered valid when a certain threshold value is reached. Therefore, a large number of rules may be discovered using data mining techniques. The rules discovered by data mining are dynamic. It only reflects that the current state of the database has the rules. With the continuous addition of new data to the database, it needs to be updated at any time.
Artificial Neural Network
Artificial neural networks, referred to as neural networks or connection models, are a mathematical model of algorithms that model the behavioral characteristics of animal neural networks for distributed parallel information processing [25]. To process information, this type of network relies on the system’s complexity and adjusts the interconnections between a large number of nodes within it. Thus, neural networks are a powerful tool that can be easily applied for prediction, classification, and clustering.
Decision Tree
Decision trees are strong tools for classification and prediction because decision trees represent rules. Therefore, data mining based on decision tree methods is attractive. Decision trees have a significant advantage in handling non-numerical data, as they eliminate a significant amount of data preprocessing compared to neural networks.
Association rules
An association rule is an implication of the form
The fundamental principle of the DBSCAN density clustering algorithm is that the sample points within each cluster are closely related to each other, and the density of a point inside the cluster must be greater than the density of the point outside the cluster, that is, the points within the same category must be related [26]. The density is defined as a set of parameters (
The
In a sample set, assuming that the distance between the data point
Core object: if there are no less than minPts sample points within a distance of

DBSCAN clustering principle
Direct density reachability: if data points Data point If The number of data points within the
Density reachability: If a sample point is connectable to another sample point, and the distance between these two sample points is less than
Density connected: for data points
Boundary point: a sample point is called a boundary point if it belongs to a certain class, but the
Noise point: if there exists a point that is not in all the clusters and all other core objects are density unreachable for this point, the point is called a noise point. For example, the red point in Fig. 1 is labeled as a noise point.
Based on the above definition, the steps of the DBSCAN clustering algorithm can be described in such a way that the category obtained by executing the algorithm is the largest set of samples that are connected by the density reachable and density connected between the sample points, and the specific process can be described as:
Mark all sample points as not visited. Randomly select a sample point If point Iteratively execute the third step to find the points that are directly density reachable by each core object, and for the same sample points that may be implicitly related to more than one core point in the middle that are directly density reachable, they belong to the same cluster. Until no new sample points are added to the set, the clustering process ends.
The student academic alert model was created to inform students about future academic performance. The student academic alert results were categorized into three categories: incomplete, deferred completion, and successful completion. If a student took more than four additional years to earn a degree or did not finish it, it is considered incomplete. Extended completion means that the student took between one and three additional years to earn a degree. Successful completion means that the student has graduated with a degree. The XGBoost-based academic alert model for college students is shown in Figure 2, which contains three main parts: data preprocessing, student academic alert algorithm, and student academic alert characterization. Data preprocessing. In order to improve the performance and warning accuracy of the student academic alert algorithm, the raw data set needs to be processed. Processing the original dataset is necessary to use it directly in the model for training as it may have missing data, outliers, and feature attribute redundancy. Data cleaning, feature selection, data transformation, and imbalance processing are the main aspects of data preprocessing in this paper. Student academic early warning algorithm. This algorithm is the core part of the whole model, which analyzes the students’ related data, such as final grades, enrollment grades, scholarships, and other information, to obtain their academic warning results. In this study, the XGBoost algorithm, which has achieved better results in various fields, will be chosen to construct the student academic early warning model. Characterization of students’ academic early warning. This feature analysis is to analyze the various academic characteristics of the students to determine the degree of influence of these characteristics on the academic results of the students, and to understand the academic situation of the students through the feature analysis.

Student aca demic warning model
The XGBoost algorithm, whose full name is “Extreme Gradient Boosting”, was proposed in 2016 as a boosting learning algorithm. The algorithm is improved on the basis of the GBDT algorithm, which adopts the second-order Taylor expansion and the regular term to effectively avoid overfitting and accelerate the convergence speed [27]. Compared with other algorithms, the XGBoost algorithm has the advantages of being fast, efficient, generalizable, and suitable for dealing with large-scale tabular data, so it has been widely used in the fields of statistics and machine learning.
The integrated idea of the XGBoost algorithm is to establish multiple weak evaluators through residual fitting and eventually accumulate the weak evaluators to get strong evaluators, and each additional weak evaluator is expected to improve the effect of the model so that the final prediction results of the model can be maximally close to the real value and have as much as possible generalization ability to the unknown samples.
The XGBoost algorithm can be expressed as an additive form as:
Where
Where
Where
Equation (7) can be used to evaluate the goodness of the tree model, with smaller values indicating a better model. From this, the score formula (8) used for the tree to perform node splitting can be derived, and finally, the Gain value of each feature is ranked to find the optimal feature and the optimal cut-off point.
The research in this paper is based on behavioral data generated by students at Lanzhou University of Technology, and most of the data in this paper is recorded using students’ campus cards. Students in the school use a card for book borrowing, recharging, and consumption. The original data of this paper contains 600 book borrowing records of students in the second semester of 2023, more than 20,000 more than 22,000 card consumption records, and 10,000 records of grades. The book borrowing data for one semester is 25. The passing grade is 65 points, and the excellent grade is between 80-100 points.
Student behavior data contains student achievement, book borrowing data, and campus card consumption data three-dimensional data using the DBSCAN algorithm for this three-dimensional data were clustering analysis, student achievement, and book borrowing as shown in Figure 3, student achievement and consumption amount clustering analysis as shown in Figure 4, the number of students borrowing and consumption data clustering as shown in Figure 5. Cluster analysis of student achievement and borrowing amount The analysis is carried out using the DBSCAN clustering algorithm, the number of data clusters k is set to 4, and the distance from the data object to the central cluster is calculated using the Euclidean distance and the clustering results obtained are shown in Figure 3. The horizontal axis in Fig. 3 represents the students’ performance information, and the vertical axis represents the information on the number of students borrowing in a semester. Four different colors are used in the figure to represent four different data clusters, where the red part represents that the number of books borrowed by students with grades of 0-65 in a semester ranges from 1-8 books, and most of the students have a number of borrowing data of 1 book. The green portion represents students with grades of 63-73 who borrowed between 1-17 books in a semester. The blue section represents the number of books borrowed in a semester by students with grades of 75-80 ranging from 0-16. The orange portion represents the number of books borrowed in a semester by students with grades between 80-100 ranging from 1-24. Synthesizing the results of the cluster analysis reveals that the number of books borrowed by undergraduate students is small, with most students borrowing only one book. Visualization of the clustering results shows that the number of books borrowed by students with good grades is about two times more than that of students with failing grades. Therefore, the number of books borrowed can also tell how good a student’s grades are. For example, a student who borrows 17 books in a semester will have an academic grade of 78 or more and will definitely have a passing academic grade. Cluster analysis of students’ grades and consumption quotas The DBSCAN clustering algorithm is also used for analysis; the number of data clusters is set to 4, and the distance between data objects is calculated by using the Euclidean distance, and the final clustering result is obtained, as shown in Figure 4. The student’s academic performance is represented by the horizontal axis in Figure 4, and the vertical axis represents their spending amount in one month at school. The amount of student consumption is arranged in the data in the figure, and the red portion in the figure depicts the amount of student consumption between 0 and 240 yuan. The green part indicates that the student consumption limit is around 230-380 yuan. The blue part means that the student’s spending limit is around 400-600 yuan. Around 600-1700 yuan is the student’s spending limit, as indicated by the orange part. From the visualization results of cluster analysis, it can be found that there is no obvious internal relationship between student performance and campus card consumption data, but the activity trends and consumption characteristics of most undergraduates can be understood and divided into four categories, as shown in the figure. Cluster analysis of book borrowing volume and consumption quota The DBSCAN clustering algorithm is also used for analysis, the number k of data clusters is set to 4, and the clustering results obtained by calculating the distance between data objects by using Euclidean distance are shown in Figure 5. In Figure 5, the horizontal axis represents the number of books borrowed by students, and the vertical axis represents the amount of money students spend on school per month. The clustering of the data in the figure is divided by the amount of student consumption, and the blue part of the figure represents the amount of student consumption of approximately 400-600 yuan. The red part represents that the student’s consumption is around 100-240 Yuan. Around 300-410 yuan is the limit for student consumption, as indicated by the green part. The blue section indicates that the student has a spending limit of around 400-600 yuan. The orange part represents that students spend around 600-1200. From the visualization results of clustering analysis based on partitioning, it can be found that the number of books borrowed by students with a consumption limit of about 400-500 yuan is significantly higher than that of students with other consumption categories. Most students with monthly spending limits of less than 250 yuan and more than 1,000 yuan did not borrow books.

Student achievement and borrowing number clustering

Student achievement and borrowing number clustering

Student borrowing number and consumer data clustering
For the four algorithms, RF, GBDT, XGBoost, and LGB, the training parameters need to be adjusted before all training to improve the prediction performance of the model. By testing multiple training parameter combinations, the combination with the best prediction effect is chosen as the final parameter combination. Since the prediction of academic warning in this paper is a classification problem, classification precision, prediction accuracy and recall, and F1 score are used as the evaluation indexes of model performance, and the results of the four algorithms are shown in Table 1.
Four algorithms test results
Algorithm | Classification accuracy | Prediction accuracy | Recall rate | F1 score |
---|---|---|---|---|
XGBoost | 0.96341 | 0.89452 | 0.88973 | 0.92106 |
RF | 0.9418 | 0.8703 | 0.85812 | 0.85033 |
LGB | 0.93752 | 0.88594 | 0.83093 | 0.87901 |
GBDT | 0.91602 | 0.87763 | 0.84945 | 0.85132 |
Table 1 shows that the XGBoos algorithm achieves 96% classification precision, 89% prediction accuracy, 88% recall, and 92% F1 score. Next RF algorithm is better, and the GBDT algorithm has the worst performance. The GBDT algorithm has a difference of 4.7%, 1.7%, 4.0%, and 7.0% compared to the XGBoos algorithm. Thus, it shows that out of these 4 algorithms, the XGBoos algorithm classifies the dataset used in this paper well, so this algorithm is chosen for student academic prediction.
In order to visualize whether the distribution of different courses in the academic data of students in this school is normal or not, this experiment divides five achievement intervals, which are [80, 100], (60, 80], (40, 60], (20, 40], [0, 20]. In order to analyze the distribution of students’ achievements in different professional courses, the achievement data of a number of courses are randomly selected as an example for division. Here, the probability theory course achievement of management students, the psychology and English course achievement, the Java programming course achievement of accounting majors, and the Java programming course achievement of soft engineering majors are selected as the object of division, and the histograms of the distribution of division achievement are shown in Fig. 6, Fig. 7, Fig. 8, and Figure 9 are shown.

The distribution of the probability of management

Distribution of psychological and English scores

The distribution of the design of the Java programming

The performance distribution of the soft Java programming
From the four graphs, it can be seen that the achievement of the course of Java Programming in the major of Accounting is mainly distributed around 70-80 points. In contrast, the achievement of the course of Psychology and English in the major of Management is mainly distributed around 70. The course of Java Programming in the software has a certain distribution in each achievement interval and does not conform to the normal distribution, which can be seen as the distribution of the achievement of the different courses in different majors. It can be seen that the distribution of grades of different courses in different majors has a large difference.
The reason for this large difference is not only affected by the characteristics of the course and the difficulty of the teacher’s marking but also due to the missing values in the raw data and more error values, resulting in the total distribution of students’ grades is affected, so it is necessary to further process the student data.
The correlation coefficients between students’ performance characteristics were calculated using the Pearson correlation coefficient analysis method, and the first semester GPA and the second semester GPA are shown in Figure 10 as an example.

Characteristic correlation scatter point diagram
It can be seen that by fitting the scatterplot, the first semester GPA and the second semester GPA of different students are positively correlated in general, which is consistent with common sense, indicating that the performance of each semester shows a strong correlation between the performance of each semester and that often students who start with poorer learning are more likely to perform poorly in their subsequent learning as well. Also, the skewed normal distribution of first-semester and second-semester GPAs indicates that the largest number of students have GPAs between 4.0 and 4.5, which is consistent with expectations for overall student performance.
To improve the interpretability of the model and analyze which data features have a greater impact on its effectiveness, this section ranks the input data features in terms of their importance. Feature importance analysis plays an important role in predictive modeling, where higher feature importance indicates that the feature is more useful to the model, and by ranking the importance of the features, it is possible to derive which features are the most relevant to the goal and which features are the least relevant to the goal. By interpreting this, it can be utilized to collect better quality data.
Since XGB is the best predictor, the results produced by the XGB model were used as the basis for calculating the importance values of each feature, and the output is shown in Figure 11. V1 represents the number of failed credits in the second semester, V2 represents the number of failed courses in the second semester, V3 represents the second-semester grade point average, V4 represents the second-semester grade point, V5 represents the first-semester grade point average, V6 represents the second-semester credits taken, V7 represents the number of failed credits in the first semester, V8 represents the first semester GPA, V9 represents the number of failed courses in the first semester, and V10 represents the number of credits taken in the first semester.

Ranking of characteristic importance
The importance of the characteristics of each variable is ranked, with horizontal coordinates representing the different characteristics of each semester and vertical coordinates representing the importance of the different characteristics. By showing that the values of the characteristics in the second semester of the value of the second semester of the number of failed credits, the number of failed courses, and the average value of the grade point average are 0.35, 0.3, and 0.25, respectively, it can be concluded that these three types of characteristics are the three most important indicators affecting whether or not to be academically warned. The three indicators that have the smallest impact on academic warning are the first-semester GPA, first-semester grade point average, and credits taken in the first semester.
The actual situation corresponding to the results analyzed in Figure 11 represents that in order to reduce the difficulty of students’ learning and improve students’ enthusiasm for learning, the curriculum in the first semester tends to be more basic and simple, mostly some introductory courses, which are not very difficult, so most of the student’s performance is relatively good. While the second semester of the curriculum increases some more specialized courses, the difficulty of the course also rises. The students learn more strenuously, so it is more able to differentiate between the learning levels of the students, resulting in some of the learning abilities of the weaker students and the other students opening the gap.
The article first uses the DBCANS clustering algorithm to analyze the behavioral data of Lanzhou University of Science and Technology students in school, from the influence of the relationship between students’ grades, book borrowing, and monthly consumption amount in school. The clustering results demonstrate that there is a significant difference between the borrowing quantity and consumption amount of students with good grades and the borrowing quantity and consumption amount of students with unqualified grades. The borrowing and consumption amounts of students with good grades are more stable than those with unqualified grades. At the same time, the performance of the XGBoost prediction model proposed in this paper is compared, and by comparing with other models, the classification accuracy of this model reaches 96%, the prediction accuracy of 89%, and the recall rate of 88%, which is better than the performance of other models. Finally, based on the academic data of the students in the school, XGBoost calculated the importance of eigenvalues, and the results concluded that the teacher’s teaching courses in the first semester were simpler and the students performed well, and in the second semester, along with the increase in the difficulty of the courses, the performance of the students appeared to be different from each other. Therefore, the student indicators in the second semester placed a higher weight on the academic warning. Using the XGBoost model for students’ academic warning to understand their academic performance can help teaching administrators better grasp students’ learning.