With the popularisation of education, the number of college students is increasing day by day, and there are also more students with psychological health problems. Whether students’ psychological abnormalities can be detected in time is one of the main problems faced by colleges and universities at present. Adopting digital technology to mine, collect and analyse the data generated by psychological health education in colleges can effectively solve the dynamic development of students’ psychological health problems. Therefore, in this paper, the psychological health problems of college students are identified and classified by establishing an improved logistic regression model. The behaviour characteristics are quantified and the differences are combined according to students’ relationships with their classmates, life rules and economic conditions. The test results show that the regression effect of the model is excellent, which can identify college students’ psychological health problems and improve the intervention and treatment of educators on students’ psychological problems.

#### Keywords

- logistic regression model
- psychological health problems
- data mining

With rapid development of the knowledge economy and the popularisation of higher education, the number of college students is increasing day by day, and there are also more students with psychological problems. At present, the main means of investigating students’ psychological health problems in colleges is through paper or online questionnaires. For the convenience of later tracking, students are generally required to leave more detailed personal information. However, due to considerations of personal privacy, many students are worried that they will be given special treatment or be labelled especially when filling out the questionnaire, which cannot objectively reflect the true psychological status, and leads to false information in most feedback [1,2,3]. In psychological research, big data technology has had a profound impact on the research logic, research methods and research tools. In the traditional research of psychological health education, the difficulty of data statistics limits the in-depth development of related research to a certain extent [4]. The arrival of big data technology has expanded thinking, innovated research platform and easily solved the problem of untoward data collection. Therefore, it is urgent to mine, collect and utilise the data generated by psychological health education in colleges, and combine big data technology with psychological health education in colleges and universities.

At present, researchers have made an in-depth analysis of the source, causes and countermeasures of college students’ psychological health, which is of vital significance for finding the abnormal psychological problems of college students in time and providing scientific theoretical support for their psychological health education [5,6,7]. At the same time, all colleges and universities are trying to equip full-time psychology teachers to undertake the tasks of teaching and psychological consultation of psychological health courses. Although they attach great importance to psychological health education, owing to the lack of professionals, digital technology cannot be effectively used, and thus psychological health education cannot be implemented efficiently [8, 9]. In addition, students’ psychological health is a process of dynamic development. However, there are problems such as insufficient attention to evaluation, weak teachers and imperfect evaluation system in most colleges [10], so it is difficult to capture the dynamic psychological development and changes in students during their study. Meanwhile, with the help of digital technology, under the condition of ensuring the scientificity, stability and sensitivity of indicators, establishing a dynamic identification and evaluation system for students’ psychological problems can effectively promote the development of college students’ psychological health.

In the psychological health identification model, on the basis of analysing and summarising numerous data of students, deeper information and features are extracted, which is to identify whether students have a tendency of psychological abnormality through the data of students’ behaviour and other data in college, thus guiding the direction for college teachers and psychological education staff.

Data mining is a complicated process. Through a series of calculations in a large amount of data, it is necessary to find the potential and hidden internal relations in the data and extract valuable information. The specific process is shown in Figure 1:

Selection and preparation of data set. According to the actual demand, select the initial data set and collect as much relevant data as possible. The more relevant the data, the higher the accuracy, but the amount of calculation will also increase [11].

Data preprocessing. The collected data may contain some isolated points and discrete points, so it is necessary to preprocess the data to make it useful. Methods of data preprocessing include data cleaning, transformation, integration and specification [12].

Data mining. Select appropriate methods and models for data mining, and find the intrinsic relationship between data and extract hidden valuable information, according to the data characteristics after data preprocessing and the actual needs of users,

Analysis and application of results. Analyse the obtained results in connection with the actual situation, and if it is not applicable to the actual situation, the above steps need to be repeated.

Logistic regression is a kind of data mining technology. For a given data set, it can get a nonlinear model through linear model transformation to predict the actual test value as much as possible and output it [13, 14]. Actually, it is a classification model. Compared with other algorithms, logical regression has a simple form, strong interpretability and fast training speed, whose main idea is to fit the decision boundary as much as possible and output the predicted value. On the basis of linear regression, that is, for multivariate input _{1}, _{2}, …, _{n}^{T}, the linear expression of predicted output

Sigmoid function is usually selected as the mapping function, and its expression is:

Introduce Formula (1) so that:

The value of

It can be seen that through sigmoid the function can define the domain as

Regarding

Due to the serious data imbalance between normal and abnormal samples, classification without other operations will affect the results of the model [15]. Therefore, before the dichotomy, it is necessary to balance the data, and combine their features with the differences.

The samples of normal students are marked as negative examples, whose code is 0, and the samples of abnormal students are marked as positive examples, whose code is 1. Before training the model, in order to avoid the influence of dimensional units between feature samples, the data should first be standardised. Data standardisation refers to scaling all feature data to a fixed range to avoid the influence of different units and different numerical values when training the model. Similarly, because there is a large numerical span in the features selected in this experiment, in order to eliminate the influence of numerical value on the classification model, the data should be normalised at first. That is, the result is mapped to [0–1] by linear transformation.

For sample _{1}, _{2}, _{n}_{i}_{i}

According to this formula, the range of sample data can be compressed to [0,1], so as to eliminate the differences among different feature samples and avoid the influence of numerical dimension among feature samples.

The logistic regression model has an important influence on the results of sample classification, so the parameters of the logistic regression model are optimised. According to the transformation of sigmoid function, the assumed function is:

Among them, _{1},_{1}) … (_{i},y_{i}_{i}_{i}

To prevent the model from over-fitting, a regularisation term is added to the loss function and is represented as:

Introducing features into the model:
^{T}. That is, according to the optimised parameters, the prediction model for identifying students with abnormal psychology is constructed as follows:

In this paper, the model updated with the above parameters is used to predict students’ psychological problems, in which _{1}, _{2}, _{3}, _{4}, and

In the mining of college education data, experts and scholars often pay attention to the characteristics related to students’ achievements, ignoring students’ psychological activities [16]. Therefore, based on the five-factor model [17,18], this paper extracts and quantifies the behavioural characteristics of students’ psychological health problems on the basis of a large amount of original data. By referring to psychology, pedagogy and other related knowledge, indicators are established to extract students’ characteristics, and perfect and complete behavioural characteristics are constructed relatively to measure students’ psychological health problems. Based on the characteristics of responsibility and extraversion of the five-factor model, the behaviour of students is quantified from three perspectives: students-classmate relationship, regularity and economic situation.

Most of the students’ studies in colleges and universities are closed or semi-closed, so their health problems can be reflected through their classmates. The co-occurrence times of two students are directly proportional to the probability of becoming friends, therefore, association rules can be used to determine the classmate relationship.

In order to obtain students’ friends at college, it is necessary to calculate the co-occurrence data set between pairs of students, which is represented as:

In order to eliminate contingency, if the co-occurrence times of student A and student B are greater than T, the two students are considered as friends. Considering that the threshold value of each location is related to the total number of times, it is defined as:

Among them, for students ^{l}_{i}^{l}

Form to explore the relationship, that is

where
^{l}_{i}^{l}

Confidence is defined as:

Among them

Where

The regularity of students’ behaviour reflects their self-discipline and orderliness in college. Students with strong regularity have better binding force, and the regularity of students’ behaviour is closely related to their academic performance [19,20,21]. Therefore, it can be considered that students with strong behavioural regularity can arrange their own plans in life. In this paper, the regularity of students’ behaviour is mainly quantified, such as eating and bathing, and whether there are significant differences between students with abnormal psychology and normal students in the regularity of students’ campus life behaviours, are explored. Shannon entropy is used to calculate and measure, and its definition is as follows:

Here

where

Studies [23] have shown that the psychological status of college students is influenced by the family economic situation, and many students from poor families will suffer from long-term inferiority and depression; also they seem to be unwilling to communicate with others and other behaviours. However, due to the limitation of conditions, it is impossible to know the family status of students’ original families. In order to explore whether there are significant differences in the college performance between psychologically normal and abnormal students, in this paper, students’ financial situation is measured from the students’ financial aid and their consumption at college.

Given the time and amount of students’ consumption each time, first of all, the annual consumption data of 2,000 students (436,044 items in total) were processed and analysed, and the statistics of the average consumption of two kinds of samples between one semester and 1 year showed that there was no obvious difference in data distribution. Through Wilcoxon S hypothesis test, the zero hypothesis is as follows: There is no significant difference in consumption level between normal students and psychologically abnormal students. The alternative hypothesis is that there are significant differences in consumption level between normal students and students with psychological disorders. It is found that under the condition of 0.05 confidence, the consumption level of the two groups is

According to the above characteristics, the quantitatively extracted sample data is introduced into the improved logistic regression model obtained before, which can effectively reflect the psychological health problems of college students. The evaluation index of this model is defined as:

Accuracy, that is, the proportion of correctly predicted samples to all samples. In the confusion matrix,

Precision, which refers to the proportion of correctly predicted positive samples (

Recall represents the proportion of correctly predicted positive samples (_{rate}

In this model, for the positive sample in the original data tag, if it is predicted as a positive example, the prediction result of this sample is a True Positive example, or TP in short. On the contrary, if it is predicted as a negative example, the result is a False Negative example. Similarly, a False Positive example (

The original sample ratio of this model is about 10:1, as shown in Figures 2–4; if it is directly used to train the model, the logistic regression effect of the sample data is poor, but with the approaching of positive and negative proportion, the accuracy, precision and recall of the sample are better. When the positive-negative ratio of the sample is 1:2, the model regression effect is the best, with the accuracy, precision and recall reaching 76.2% and 83.4%, respectively. At the same time, when the positive–negative ratio of the sample is less than 1:2, the regression effect of the model is somewhat reduced.

In addition, it is necessary to identify students with psychological disorders in the shortest time. For this reason, the data sets of one semester, 1 year, are selected for comparison. The data set used in this model is for students in the second semester of sophomore year (one semester), sophomore year (1 year) and college stage (freshman to junior year; there is no data for senior students here, because there are few exam classes for senior students in our college, and most of the students are in internship outside during their senior years). Feature training model as previously described is used to classify these data.

The recall rate indicates the proportion of the number of samples correctly predicted as positive cases in the sum of all samples divided into positive cases. The purpose of the experiment in this paper is to identify students with abnormal psychology among all students, that is, to identify a relatively large number of positive examples, so recall rate is an important evaluation index for this model. The higher is the recall rate, the more students with abnormal psychology are identified by the model. The results show that the improved logistic regression model performs well when adopting all 3-year data. In addition, the recall rate obtained by using 1 year's data is slightly lower than that obtained by using all the data. Combined with the classification accuracy, it can be considered that it takes at least 1 year's student behaviour data to identify students with psychological disorders, and the more data used, the better the robustness of the logistic regression model.

Researching the dynamic evaluation of college students’ psychological health based on digital technology can provide new research methods and theoretical practice for the follow-up psychological health education in colleges and universities. Therefore, based on the logistic regression model in data mining, by combining the characteristics of responsibility and extraversion of personality theory, students-classmates relationship, life rules and economic condition are quantified. Then, through the differential combination of data about students’ behaviour characteristics, an improved logistic regression model is constructed to classify and predict students’ psychological health problems. Finally, accuracy, precision and recall are used as the evaluation indexes of the model. The results show that when the positive–negative ratio of the sample is 1:2, the regression effect of the model is best with the accuracy, precision and recall reaching 76.2%, 83.4% and 86.6%, respectively. Meanwhile. more than 1 year's data of students’ behaviour is more conducive to identifying students’ psychological abnormalities, and the more data adopted, the better the robustness of the logistic regression model.

