Construction of a machine learning-based model for stratified assessment of college students’ mental health and design of intervention pathways

At the university stage, students are in a critical period of growth and development. In the pursuit of learning and exploration has produced a wealth of psychological activities, psychological quality continues to develop [1–2]. In the process of carrying out cultural quality education in colleges and universities, the mental health problems of college students cannot be ignored. At present, the common mental health problems of college students mainly include employment pressure, academic pressure, emotional problems, etc., which have a negative impact on the healthy growth and development of college students [3–4]. Based on the perspective of positive psychology, mental health intervention and education should be emphasized for college students to cultivate correct moral norms and ideological values of college students, so that they can form a good self-knowledge and evaluation ability, and improve their mental health and self-intervention ability [5–6].

At this stage, the lack of clear goals for the mental health intervention of college students has led to problems in the content and methods of mental health education, and the traditional mental health education model can hardly be adapted to the psychological characteristics of college students in the new era. Therefore, it is of great significance to explore the mental health interventions for college students and assess the mental health level of college students, which is an inevitable measure to improve the mental health level of contemporary college students [7–9].

Mental health assessment in traditional methods relies on the doctor’s or counselor’s questioning and questionnaires, and the diagnostic results generally depend on the experience of the psychological researcher and the honesty of the tester, which is susceptible to the influence of the differences in the subjective aspects, and there are large limitations, which may lead to misdiagnosis, omission of diagnosis, inconsistencies in the before and after diagnosis, and other problems [10–11]. At present, the main body of intelligent assisted diagnosis is still the doctor’s diagnosis, but the technical means used for diagnosis and the basis for judgment have changed dramatically. In terms of technology, intelligent assisted diagnosis collects, summarizes and analyzes a large amount of data and information through the use of modernization and informatization technology, and then extracts and transforms the data and deposits them into the corresponding database system to provide as comprehensive basic data as possible for the diagnosis of mental illness [12–13]. With the development of big data, artificial intelligence and other related technologies, intelligent assisted diagnosis has been gradually applied and developed to meet the complex diagnostic requirements, providing a broad prospect for the prediction and screening of psychosomatic disorders and assisted diagnosis and treatment [14–16].

In this paper, the data preprocessing of students’ mental health data is carried out first, and data integration and conversion are realized after data screening and data cleaning. The data range is scaled to a specific range to complete normalization of the data, and the Smote algorithm is used to deal with the imbalance of the sample data. The characterization of students’ mental health is then carried out using three main factors: physical condition at the individual factor level, growth history at the family factor level, and social intimacy at the school factor level. Under the Stacking framework, a two-layer multi-learners fusion algorithm is proposed, which combines several different learners, such as SVM, RF model, and XGBoost, to construct the overall network structure of the DSMFA-based student mental health prediction model. Taking the data collected and obtained from 5762 online questionnaires for college students as the research object, we carry out the analysis of stratified assessment of students’ mental health and explore the mental health problems of students.

2

Hierarchical model for assessing the mental health of students in higher educational institutions

2.1

Processing of student mental health data

This experimental study was conducted on students enrolled in higher education, and the dataset used was obtained from the online design of a questionnaire collection over a period of three months, with a deadline of August 2024, and a total of 5,762 completed data were finally collected. The dataset consists of three main components: personality trait information, demographic information, and psychological state information.

The preprocessing technology of the data is the basis for the excellent performance of all models and algorithms, and in the face of the massive raw data, it is necessary to clarify the objectives according to the actual business needs, and to mine and screen the attributes required by the subject. In the raw data, a lot of data are unavailable to us, and there are a variety of problems, such as selecting the required features of the subject in the data screening process, and performing dimensionality reduction processing on the raw data to reduce the redundancy of the data. This can effectively alleviate problems such as slow model operation due to excessive data dimensions. In addition, due to program failures, database read and write errors and other unknown reasons, the raw data also often have missing values, values that do not make sense, etc., and the existence of these erroneous data will reduce the quality of the entire dataset. And in the training process, the quality of the data plays almost a decisive role in the accuracy and reliability of the model. Therefore, before proceeding to the subsequent model building and implementation, data preprocessing must be carried out on the original data, so as to make the data better meet the model needs and ensure that the model achieves better prediction results.

2.1.1

Pre-processing of data

During the questionnaire assessment, there are always various reasons that affect the integrity of the final scale data obtained. In this project, the following steps are taken to preprocess the raw data for anomalies. 1)

Data screening

When analyzing massive data, we generally do not need to use every attribute of the original data set. This is because the primary goal of data collection is to gather as much and as comprehensive data as possible, but the specific use of the data is not considered carefully. And with a specific research objective, it is necessary to filter out the data needed for that research objective.

2)

Data Cleaning

Data cleaning, i.e., accidental errors in the data after the data screening process, such as data inconsistency, data anomalies, missing data, etc., leading to poor data quality. In this process, different processing measures are generally taken for different situations, mainly missing value processing, outlier processing, data duplication processing, this paper on demographic information table and two large number of tables to collect abnormal data for the following processing: first, if an original data is missing more than three relevant attributes, delete; second, the missing type of numerical type of the original data using the mean value of the filler, the missing raw data of string type are filled using the plural.

3)

Data Integration

In practical applications, data may come from multiple sources. At this time, we need to integrate data from different sources to create a unified data set. This integration is generally achieved by merging, de-emphasizing, and reorganizing data from multiple sources. The key to data integration is to ensure the consistency, integrity, and reliability of the data, and to provide data convenience for subsequent operations when building models.

4)

Data conversion:

Data transformation is also known as data mapping processing, which means that data in different application scenarios needs to be transformed in different ways in order to improve the performance of the modeling algorithm.

This research does not involve any questions with the identity of the tester. An informed consent form was provided on the first page of the questionnaire for the tester’s confirmation, and all the participants were informed and self-sensitized to participate in this research.

2.1.2

Normalization of data

Normalization of data, also known as normalization, is an operation that involves scaling the data range to a specific range, usually [0, 1] or [-1, 1]. This processing method can make different features have the same feature scale, to avoid problems due to different scales of the data itself. Currently, there are a variety of normalization methods. The commonly used method is the min-max normalization method. The conversion function is as follows: 1 $x^{'} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

Where x′ is the scaled value, x is the original data value, x_min is the data minimum, and x_max is the data maximum. The min-max normalization method scales data to a specified range by subtracting the minimum value from each data point and dividing by the difference between the maximum and minimum values.

Normalization of data on demographic information attributes and personality trait scales can effectively avoid the impact of different feature scales and size differences on the training and prediction results of early warning models.

2.1.3

Balancing of data

Data balancing processing refers to the process of model training, through a variety of technical ways to make the corresponding data samples reach a balanced state, so as to improve the accuracy and stability of the model. In the actual data collection process, due to various reasons, the number of data samples from different attribute classes is very likely to be unbalanced. Especially in the binary classification problem, the number of samples in one category may be much larger than the number of samples in another category. This data imbalance can lead to biased training and evaluation results of the model, thus reducing its accuracy and reliability. Therefore, we need to balance the data to minimize the problem of poor performance of the warning model due to data imbalance.

Currently, to solve the imbalance problem of mental state attribute data, balancing techniques often include down-sampling, up-sampling, and generating artificial samples, among other methods. In this project, we propose to use the smote algorithm to address the imbalance problem of the samples, which is a method of generating artificial samples [17]. The Smote algorithm enriches the diversity of the mental state dataset by generating artificial samples that are similar to the original samples, and avoids the overfitting problem that may occur in up-sampling and down-sampling. The generation process of this algorithm applied in this topic is as follows: first, for each sample x with anxiety tendency, calculate its distance to each data in the data set of the sample with anxiety tendency in terms of Euclidean distance to get its k nearest neighbor. Second, based on the sample imbalance situation, set the corresponding sampling multiplicity N, and for each anxiety-prone sample x, randomly select a number of sample data from the k nearest-neighbor data. Then, assuming that the selected nearest neighbors are x_n, for each randomly selected nearest neighbor x_n, a new sample is constructed with the original sample data according to Eq. x_new = x + rand(0,1)*|x – x_n|. Eventually, the amount of sample data with and without anxiety tendency is made consistent, and a new data set of mental states is obtained. Similarly, the depression binary classification model and stress binary classification model training datasets are handled as shown above.

2.2

Characteristic constructs of students’ mental health

Based on psychological and sociological theories, a hierarchical analysis method was used to construct the indicator system of mental health assessment of college students mainly covering three core indicators and subdivided into 18 specific assessment indicators at the basic level. The table of the characteristic index system is specifically shown in Table 1. Next, this section will work on the feature construction from the personal factor level of physical condition, family factor level of growth history, and school factor level of social intimacy, respectively.

Table 1.

Mental health assessment index system

Target layer	Characteristic layer	Basic layer
Characteristic index system	Personal factor	Life sense
		Neuroticism
		Quality of sleep
		Self-evaluation
		Health pressure
		Sense of sense
		Life attitude
		favourability
		Extroversion
		Life goal
	Family factor	Family pressure
		Relationship pressure
		Career pressure
		Frustration pressure
	School factor	School environmental pressure
		Interpersonal pressure
		Academic pressure
		openness

1)

Social intimacy

College students, as a special group in the process of social and cultural development, are at a critical stage of personal growth. Some of the college students who show withdrawn characteristics (which can be identified by homogeneity elimination method in the mining of college students’ social networks) are often accompanied by a variety of psychological disturbances, and it is difficult to maintain normal interpersonal interactions, and this unhealthy mode of social interactions urgently needs to be corrected and adjusted in a timely manner. The following calculations were made to obtain the mutual intimacy matrix between the students:

For student A, an intimacy matrix DA can be obtained by the intimacy formula as follows in equation (2), which contains the intimacy values between student A and other students at different locations. This matrix can represent the strength of relationship between student A and other students, and each element D(A,n) represents the intimacy value of A with n students. The whole intimacy matrix D is a vector which contains the intimacy values of all the students of that student A: 2 $D_{A} = [D (A, B) \dots D (A, n)]$

At the same time, the closeness relationship between student A and other students is extended to the whole school scale to obtain an closeness matrix D as in equation (2) above. This matrix is a two-dimensional half-angle matrix where each row corresponds to a student and each column corresponds to an affinity value of that student as another student. Element D(A,B) represents the affinity value of student A and student B: 3 $D = [\begin{matrix} D (A, B) & \dots & D (A, n) \\ ⋮ & ⋱ & ⋮ \\ D (n, A) & \dots & D (n, n) \end{matrix}]$

By analyzing the intimacy matrix DA, it is possible to understand the closeness between student A and other students in different locations, while the whole intimacy matrix D, as in equation (3), can be used to depict the distribution of intimacy among students in the whole school. This helps to gain a deeper understanding of the structure and characteristics of students’ social networks.

Constructing the similarity matrix:

The similarity matrix GS is obtained by the existing intimacy matrix relationships and calculated using equation (3) as shown in equation (4) below: 4 $G S = e^{\frac{D^{2}}{2 σ^{2}}}$

2)

Physical condition

For the processing of the three data of students’ frequency of medical treatment, weighted scores of physical education tests, and punch card scores of sports apps, it is further calculated that the above three data models should all conform to the regular normal distribution are normally distributed, and the physical condition of students can be compared by calculating the position of each normal distribution of a student in the three data models, i.e., the standardized scores, as follows:

For the frequency of medical visits p_A there is a total group of students P = [p₁, p₂, …, p_n–1, p_n], it is clear that the data of this group conforms to a normal distribution, calculate the mean and variance or standard deviation of this distribution, respectively, and then use the probability density function to calculate this its Z-Score.

Its probability density is as follows equation (5) 5 $f (x) = \frac{1}{\sqrt{2 π} σ_{0}} \exp {- \frac{{(x - μ)}^{2}}{2 σ_{0}^{2}}}$

For the frequency of medical visits, mean μ and standard deviation S Eq. (6) were calculated: 6 ${\bar{X}}_{m e d i c a l} = \frac{\sum_{i = 1}^{n} p_{i}}{n}$

${\bar{X}}_{m e d i c a l}$ represents the sample size of the number of visits, and n is the number of that sample, as in equation (7) below: 7 $S_{m e d i c a l} = \sqrt{\frac{\sum {(p_{i} - {\bar{X}}_{m e d i c a l})}^{2}}{n - 1}}$

Where, S_medical denotes the standard deviation of this sample, which is calculated as the standardized score in equation (8) below: 8 $Z_{m e d i c a l} = \frac{X_{m e d i c a l} - {\bar{X}}_{m e d i c a l}}{S_{m e d i c a l}}$ the student’s medical visits p_A.

The same steps can be applied to the sports test scores and sports app punch card data because both of them fulfill the conditions of normal distribution.

Finally, the mean of the Z-scores of the three factors was calculated as in equation (9) below: 9 $A v e r a g e Z - s c o r e = \frac{Z_{m e d i c a l} + Z_{t e s t} + Z_{a p p}}{3}$

This mean Z-score represents the relative position of the three factors in a normal distribution. In a normal distribution, a data point with a mean of 0 and a standard deviation of 1 has a Z-score of 0, indicating that the data point is in the center of the distribution. Based on the magnitude of the absolute value of the mean Z-score, the location of these factors relative to the center of the normal distribution can be determined. This score can then be used for the prediction and judgment of students’ mental health to provide data support for the subsequent use of the model.

3)

Growth history

Individual mental health is deeply shaped by a variety of factors in the course of growing up, including the family environment (e.g., parents’ marital status, family atmosphere, and parent-child interactions), school experiences (academic pressure, subject choices, and achievement), and social relationships (peer interactions, and campus environment). Family background affects individual’s sense of security, emotional development, and social skills, and especially poses challenges to the mental health of children left behind. Successful experiences, interest development, and adaptability in the academic process are related to future expectations and mental health status. At the same time, the social environment and peer interactions also play a significant role in the social skills and psychological well-being of individuals. In conclusion, recognizing and understanding the multiple factors in an individual’s developmental process, and providing positive guidance and support at the right time, can help to promote the development of all aspects of mental health, and can help to formulate intervention strategies to help children cope with life’s challenges and cultivate positive mental qualities.

2.3

Mental Health Stratified Assessment Modeling

This chapter proposes a two-layer multi-learner fusion algorithm based on the Stacking framework, where the first layer is the combination layer of base learners and the second layer is the output layer of meta-learners [18]. The Stacking algorithm is used to combine multiple different learners, synthesize the predictive classification results from the training of each heterogeneous learner to form a new training dataset, and send it to the second layer for re-training to improve the final prediction performance of the model.

2.3.1

Learner Selection

The DSMFA-based model is trained on the student mental health prediction dataset using several different base learners, a new training set and test set will be obtained after the completion of the base learners in the first layer, and the new dataset is trained by the meta-learner in the second layer in order to synthesize the training data from the individual base learners, which ultimately gives the new classification prediction results. This subsection will focus on selecting a base learner and a meta learner. 1)

Base Learner

The key factors affecting the prediction performance of the model are the prediction ability of the base learners and their variability. In order to ensure the accuracy of the classification results of the final prediction of the model, through the practice of a variety of learners, this chapter adopts three heterogeneous learners, SVM, RF and XGBoost, as the first layer of base learners of DSMFA, and the following is the introduction of each base learner.

(1)

SVM [19].

SVM is a widely used machine learning algorithm on classification tasks, given a dataset {x_i, y_i}, where x_i ∈ Rⁿ, y_i ∈ Y = {–1,1}, the core idea is to find an optimal hyperplane ω^Tx + b = 0, which needs to satisfy two conditions: one is to categorize the data as much as possible; the other is to maximize the sample spacing between the two categories, so the formula for the classification function f(x) is shown in the following equation (10): 10 $f (x) = s i g n (ω^{T} x + b) = {\begin{array}{l} + 1, ω^{T} x + b > 0 \\ - 1, ω^{T} x + b < 0 \end{array}$ where sign(·) denotes the sign function.

(2)

RF [20].

RF is an integrated learning algorithm based on Bagging. Let the classification result of input sample x be R(x), and its decision calculation formula is shown in the following equation (11): 11 $R (x) = \arg \max_{b} \sum_{i = 1}^{n} ℐ (h_{i} (x) = b_{j}), i = 1, 2, \dots, n, j = 1, 2$

Where h_i(x) denotes the classification prediction result of the i nd decision tree, b_j denotes the j th category in the classification labeling, which has been transformed into a binary classification task in this chapter for the student mental health prediction task, and I (·) denotes the indicator function when h_i(x) = b_j, I = 1, and I = 0 otherwise.

(3)

XGBoost [21].

XGBoost also known as Extreme Gradient Boosting is an algorithm based on gradient boosting decision tree. XGBoost algorithm can be described by the following equation (12): 12 $y_{i} = \sum_{k = 1}^{n} f_{k} (x_{i})$

Where y_i denotes the classification result of the model, f_k ∈ F denotes the i rd decision tree model, F denotes the set of all decision trees, and x_i denotes the ith sample data.

The objective function of XGBoost consists of a loss function and a regular term as follows in Eq. (13), where the optimization is done in the regular term part, i.e: 13 $L (^{ϕ) t} = \sum_{i = 1}^{n} ℓ (y_{i}, y_{i}^{(t - 1)} + f_{i} (x_{i})) + Ω (f_{k})$ 14 $Ω (f_{k}) = γ T + \frac{1}{2} λ | | ω | |^{2}$

Where ℓ(·) denotes the loss function, $y_{i}^{(r - 1)}$ denotes the classification result of the first (t – 1) iterations, Ω(f_k) denotes the regularity term of the t th iteration of the model, γ denotes the complexity penalty term of the decision tree model, T denotes the number of leaf nodes of the decision tree, and λ denotes the smoothness coefficient.

The second order expansion of the objective function of XGBoost leads to the following equation (15): 15 $L (^{ϕ) t} ≅ \sum_{i = 1}^{n} [g_{i} f_{i} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} ω_{q (x_{i})}^{2}$ where g_i is denoted as the first order derivative of x_i, h_i is denoted as the second order derivative of x_i, and ω_j is denoted as the weight of x_i at the q(x_i)th leaf node.

2)

Meta-learner

The DSMFA proposed in this chapter combines the prediction results obtained from three different base learners, combines the new feature data as the input of the second layer meta-learner, and finally outputs the classification prediction results of the whole model. Since the base learner adopts a more structured modeling algorithm that may produce overfitting, the meta-learner favors simpler algorithms to avoid overfitting.

DSMFA uses a simple logistic regression classifier as the second layer of the meta-learner, and its classification model f(x) is shown in equation (16) below: 16 $f (x) = \frac{1}{1 + e^{- (ω^{T} x_{i} + b)}}$ where ω^T denotes the model parameter vector, x_i denotes the i rd dimensional eigenvalue of the sample data, and b denotes the model bias parameter.

The learning process of logistic regression is achieved by maximizing the log-likelihood function of the samples, and the goal is to find the model optimums ω^T and b so that the predicted probabilities match the classes of the samples as closely as possible.

2.3.2

Model network structure

The DSMFA-based model consists of a two-layer Stacking framework with a base learner layer and a meta-learner layer, and the constructed student mental health prediction dataset is firstly trained using a 5-fold cross-validation on three different base learners in the base learner layer, which are specifically composed of SVM, RF and XGBoost. 1)

Input layer

The source of the input layer is the students’ mental health prediction dataset obtained after three phases of feature selection, data preprocessing, and data enhancement in this chapter, which is used as the input data for the subsequent learner, and the dataset is partitioned into the training dataset X_train and the test dataset X_test in a certain ratio.

2)

Base Learner Layer

DSMFA selects SVM, RF and XGBoost as the base learners in the base learner layer, which are denoted as B₁, B₂, B₃. Taking the tth algorithm in the base learner layer as an example, the formula of the base learner model obtained from each cross-validation training is given in the following equation (17): 17 $B_{t}^{k} = L_{t} (X_{t r a i n} - X_{t r a i n}^{k})$

Where $B_{i}^{k}$ denotes the t rd base learner model after the completion of the k nd 5-fold cross-validation training, and L_i(·) denotes the t th classification algorithm of the base learner layer.

The base learner $B_{i}^{k}$ obtained from the completion of training in each iteration is then used to predict the classification results for its validation set, see equation (18) below: 18 $p_{t}^{k} = ℬ_{t}^{k} (X_{t r a i n}^{k})$ 19 $P_{t} = (p_{t}^{1}, p_{t}^{2}, p_{t}^{3}, p_{t}^{4}, p_{t}^{5})$

Where $p_{i}^{k}$ denotes the classification result predicted by the t rd base learner on the validation set $X_{n a i n}^{k}$ after the completion of the k nd 5-fold cross-validation training, and P_t denotes the set of classification results of the t th base learner integrating the 5 validation sets after the completion of the 5-fold cross-validation training.

In the process of 5-fold cross-validation, based on the base learner $B_{i}^{k}$ obtained from the training of the training set in each iteration, the classification results are also predicted for the test set X_test, and then the results of the 5 predictions are averaged by adding up by rows, which is used as the new test set used by the meta-learner layer. The specific calculation formula is shown in the following equation (20): 20 $z_{ι}^{k} = ℬ_{ι}^{k} (X_{t e s t})$ 21 $Z_{t} = \frac{1}{5} \sum_{k = 1}^{5} z_{t}^{k}$ where $z_{l}^{k}$ denotes the classification result predicted by the t rd base learner on test set X_test after the k nd 5-fold cross validation training is completed, and Z_t denotes the average of the test set prediction results obtained by the t th base learner after 5-fold cross validation.

3)

Meta-learner layer

In the DSMFA meta-learner layer, logistic regression is used as the meta-learner, denoted as M, and the new training set $X_{t r a i n}^{'}$ and test set $X_{t e s t}^{'}$ produced by the base-learner layer are used as the input data, in which the new training set $X_{t r a i n}^{'}$ is denoted as (P₁, P₂, P₃), and the new test set $X_{t e s t}^{'}$ is denoted as (Z₁, Z₂, Z₃), and the training process of meta-learner M is shown in the following equation (22): 22 $M = L_{LR} (X_{ι r a i n}^{'})$

Where L_LR(·) denotes the logistic regression algorithm.

4)

Output layer

Based on the meta-learner M after the training is completed, the final classification result is obtained. Since the predicted result of the model is the binary classification result of the two-two students’ ranking high and low, the ranking high and low of each student with other students except themselves are counted to finally get the predicted student mental health result.

3

Analysis of the stratified assessment of the mental health of students in higher educational institutions

In this chapter, the stratified assessment analysis will be carried out on the mental health data of 5,762 college students collected through the online questionnaire. And before formally starting the assessment analysis, the performance of the stratified assessment model of college students’ mental health constructed in this paper will be evaluated.

3.1

Analysis of model performance evaluation

The modeling of the hierarchical assessment of mental health of students in higher education in this paper is carried out in Jupyter Notebook. The steps to build the model are as follows. 1)

First bring the preprocessed training set into the model.

2)

Determine the main parameters of the model and determine the corresponding parameter ranges for the grid search.

3)

Traverse all combinations of parameter enumeration values using a combination of grid search and five-fold cross-validation. The final parameters are determined by the parameter values that correspond to when the model has the highest accuracy.

4)

Build a model based on the best parameters obtained from the training set and bring the test set into that model to output the corresponding values of the evaluation metrics such as accuracy, recall and F1 value.

3.1.1

Modeling and assessment results

The test set of student mental health data used for the model assessment test had a total of 165 cases of data. Combining the performance of the model of this paper on the test set, the confusion matrix of the model was obtained, as shown in Table 2. In the test set, there were 58 cases that had a positive mental health status and 107 cases that had a negative status. Among them, there are 54 cases of correct assessment of positive mental health status and 4 cases of assessment error, with a correct assessment rate of 93.1%, and there are 102 cases of correct assessment of non-positive mental health status and 5 cases of assessment error, with a correct assessment rate of 95.33%.There are a total of 155 cases of correct assessment and 10 cases of assessment error in the test set of the CatBoost model, with an overall correct rate of 93.94%.

Table 2.

Confusion matrix

Actual result	Predictive result
Actual result	1 (positive)	0 (non-positive)
1 (positive)	54	4
0 (non-positive)	5	102

The ROC curve for the test set of the model in this paper is shown in Figure 1, corresponding to an AUC value of 0.9259.

3.1.2

Comparison of model results

The model in this paper uses Stacking algorithm and incorporates SVM, RF, and XGBoost models as base learners. In this section, the model of this paper is compared with SVM, RF, and XCBoost models to determine the optimal parameters on the training set, and then the test set is brought into the parameter-tuned model for validation. The final model results are shown in Table 3. Under the same test set, comparing the classification effects of the four models, it can be seen that the classification prediction performance of this paper’s model is significantly better than that of the SVM model, RF model, and XGBoost regression model. The AUC values of the four models differ less than 0.01, indicating that the overall classification prediction performance of the four is similar. However, this paper’s model is much better than the other models in terms of accuracy, precision, recall, and F1 value, which are 90.18%, 92.84%, 92.73%, and 94.16%, respectively. Combining all the evaluation indexes, this paper’s model is more effective in the process of recognizing mental health status assessment.

Table 3.

Performance data

Model	Accuracy rate(%)	Accuracy rate(%)	Recall rate(%)	F1(%)	AUC
RF	82.47	64.96	70.37	73.79	0.9056
SVM	83.11	79.17	70.37	74.51	0.9067
XGBoost	85.71	80.77	77.78	79.25	0.9148
Model of this article	90.18	92.84	92.73	94.16	0.9154

3.1.3

Model Interpretation and Characteristic Importance

On the basis of the stratified assessment model of college students’ mental health constructed in this paper, the SHAP value is used to explain the model, as shown in Figure 2.The SHAP value method is an additive explanatory model based on the Shapley value, which adopts the game theory method to explain the output of the model, and it belongs to the post hoc explanatory method of the model, and it outputs a predictive value for each sample, and it can calculate the contribution of each variable to the model’s The contribution of each variable to the output of the model can be calculated, and the black box model can be interpreted at both local and global levels, which can take into account the influence of individual variables as well as groups of variables, and also solve the problem of multicollinearity.The absolute value of the size of the SHAP value represents the degree of influence of the feature, and the sign represents the direction of the influence of the feature.

From the figure, we can derive the order of importance of the variables, in which the greatest degree of influence on the state of mental health is emotional stress, followed by the degree of self-acceptance, and the other variables in the order of importance from the largest to the smallest are feelings of life, neuroticism, quality of sleep, stress in choosing a career, self-evaluation, interpersonal stress, relationship stress, academic stress, stress in the school environment, search for a sense of meaning, attitude towards life, goals in life, desirability, extroversion, health stress, family stress, openness, and frustration stress.

3.2

Mental health assessment analysis

In this section, the mental health stratified assessment model of college students constructed in this paper will be used as a means to develop a mental health stratified assessment of the 5,762 college student mental health data collected through an online questionnaire.

3.2.1

Analysis of basic student attributes

In this paper, the characteristics of students that do not change over time in the short term are called static characteristics, and the static characteristics of the basic student information data analysis discussed include gender, age, country, type of domicile, and whether or not they are an only child. Characteristics that change over time in the short term are called dynamic characteristics, such as phases of anxiety symptoms, depressive symptoms, obsessive-compulsive symptoms, and other psychological abnormalities. In this section, the user’s basic attribute information in the student mental health assessment data is statistically analyzed, and the distribution of the number of students in each attribute is shown through statistical charts. The statistical analysis of the user’s basic data information and the number of abnormal mental health status is shown in Figure 3, and (a)~(d) charts are shown for the data of students of male and female genders, age strata, type of household, and whether or not they are only children.

As can be seen from Figure (a), the ratio of male and female students is 35:65. In the sample of assessors, the proportion of females is higher compared to males, and females may show more delicate emotions.

The analysis based on Figure (b) shows that most of the students are concentrated between the ages of 17 and 25. Specifically, 4,711 students, or about 81.76% of the total, were between the ages of 17 and 22. Therefore, it can be inferred that students at this stage may be in the rebellious stage and the incidence of psychological abnormalities is relatively high.

From Figure (c), the proportion of students with rural household is higher, totaling 4512 students, accounting for 78.31% of the total. On the contrary, the number of students with urban household is relatively small, only 1,250 and accounting for only 21.69% of the total number of students, so different backgrounds may have different impacts on students’ mental health.

Combined with Figure (d), it can be seen that the number of only children is slightly higher than the number of non-only children, and the numbers of the two groups are quite close to each other.

3.2.2

Mental health data analysis

In this section, students will be assessed on different areas of psychological status, including somatization, obsessive-compulsive symptoms, interpersonal sensitivity, depression, anxiety, hostility, fear, paranoia, and psychoticism. Each psychological symptom is rated on a five-point scale from 1 (none) to 5 (severe), indicating the severity of the symptom experienced. The final results include total and mean scores for each dimension, as well as the number of positive items and an overall assessment of the number of positive items, which represents the number of items with scores between 2 and 5, indicating the presence of the symptoms described in those specific items. The scores and counts of the different psychological states are shown in Figure 4. It is obvious from the figure that the psychological states with the highest number of detections were anxiety, sensitivity to human-computer relations and depression, with the number of detections reaching 1303, 946 and 850, accounting for 22.62%, 16.42% and 14.76% of the total number of detections, respectively.

The next step will focus on the analysis of the 3 psychological states of anxiety, interpersonal sensitivity and depression, and the specific score statistics are shown in Figure 5. Anxiety scores range from 10 to 50, with higher scores indicating greater anxiety. When the anxiety score is more than 30, the subject’s anxiety is stronger, and when the anxiety score is less than 20, the subject’s anxiety is weaker. According to the results of this batch of data, the number of people with strong anxiety symptoms can be up to 1303 and the number of people with severe anxiety symptoms is 260. The depression score ranges from 13 to 65. When the depression score is more than 26, it indicates that the subjects are strongly depressed, while more than 39 indicates more severe depression. From the distribution of the scores, it can be observed that the number of people with strong and more serious degrees of depression are 698 and 171 respectively, accounting for 73.78% and 18.08% of the number of students with a depressed psychological state, indicating that the problem of depression is still serious in the mental health of college students. Relationship sensitivity scores range from 9 to 45. The higher scores indicate a higher level of sensitivity in these relationships, and scores above 15 indicate that individuals are sensitive or highly sensitive in their interactions with others. The graph shows that 1,274 students exhibit sensitivity in relationships, a sensitivity characterized by introversion and difficulty interacting with others. Over time, this tendency to withdraw can lead to the development of a withdrawn mindset, which in turn increases the risk of other psychological disorders such as depression and anxiety.

4

Mental health intervention pathways for students in higher education

In the mental health assessment and analysis of college students in the previous chapter, anxiety, interpersonal sensitivity and depression are the main positive psychological states of college students, and the mental health problems of college students need to be paid attention to and solved urgently. In this chapter, we will propose a path for mental health intervention for college students to provide assistance for their mental health. 1)

Constructing a favorable ecological interaction path on campus.

At the meso level of the school, both the creation of the environment and the mobilization of students’ motivation in mental health crisis should be designed in a precise way, to build a regional development model of mental health campus, a social practice service model and a comprehensive management service model for homogeneous groups.

2)

Implementation of case management and individual precision intervention.

Individual-level precision intervention is mainly for individual students and their families, and is an intervention strategy for controllable cognitive behavior and social functioning problems caused by psychological crises. When individual students have more serious psychological problems, the following model can be selected from a case-by-case perspective to provide precise interventions in the area of psychological crisis for students in need. For more serious problems, casework services should also be provided in conjunction with their families of origin, and interventions should not be taken unilaterally and hastily. Intervention should be carried out under the guidance of properly qualified psychological counselors and social workers.

3)

Social practice service model.

Social practice is an important way for students to integrate into society and accumulate social experience after school. The social practice mode is not only an important platform for integrating industry and education and fostering school-enterprise cooperation, but also an important way to help students complete their personal socialization in society. Social practice is an important way for students to integrate into society and accumulate social experience after school. The social practice mode is not only an important platform for the integration of industry and education and cooperation between schools and enterprises, but also an important way to help students complete their personal socialization in the society.

5

Conclusion

This paper builds a hierarchical assessment model of college students’ mental health to provide an effective means for the hierarchical processing and analysis of mental health data among students. In the model performance evaluation, the correct rate of mental health positive state assessment of this paper’s model is 93.1%, and the correct rate of non-positive state assessment is 95.33%, and the total model correct rate reaches 93.94%. Comparing the SVM model, RF model, and XGBoost model, the AUC values of the four models, including the model of this paper, have a small difference, but the performance of this paper’s model is much better than the other models in terms of accuracy, precision, recall, and F1 value, which are 90.18%, 92.84%, and 92.73%, respectively, and show an excellent model performance. The online questionnaire was utilized for the collection of student health data in colleges and universities, and a total of 5,762 student data were collected, which was used as the research object to carry out the assessment of students’ mental health. In the analysis of students’ basic attributes, the ratio of male and female students in student data Zhanghong is 35:65, the age distribution of students aged 17 to 22 years old has 4711 students, accounting for about 81.76% of the total number of students, while in the type of household registration there are a total of 4,512 students with rural household registration, accounting for 78.31% of the total number of students, and the number of students who are only children is slightly higher than the number of non-only children. All the positive items of psychological status, anxiety, interpersonal sensitivity and depression were detected by the largest number of students, amounting to 1303, 946 and 850, accounting for 22.62%, 16.42% and 14.76% of the total number of students, respectively. Among them, the number of students with severe symptoms of anxiety was 260, the degree of depression was more serious 171, and the number of students who showed obvious sensitivity in interpersonal relationships was as high as 1274. In view of the students’ mental health problems shown in the analysis of the mental health assessment of students, a variety of intervention paths are proposed to provide a reference direction for the development of mental health work in colleges and universities.

Sprache:: Englisch

Zeitrahmen der Veröffentlichung:: 1 Hefte pro Jahr
Fachgebiete der Zeitschrift:: Biologie, Biologie, andere, Mathematik, Angewandte Mathematik, Mathematik, Allgemeines, Physik, Physik, andere

Zeitschrift RSS Feed

Construction of a machine learning-based model for stratified assessment of college students’ mental health and design of intervention pathways

Jie Gao

Online veröffentlicht: 19. März 2025

Eingereicht: 07. Nov. 2024

Akzeptiert: 18. Feb. 2025

DOI: https://doi.org/10.2478/amns-2025-0476

SchlüsselwörterStacking framework, Data mining, SVM model, RF model, XGBoost model, Student mental health

© 2025 Jie Gao, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Schlüsselwörter
Stacking framework, Data mining, SVM model, RF model, XGBoost model, Student mental health