Optimization Research on Interactive Methods of Ideological and Political Education in Colleges and Universities under Intelligent Teaching Environment

Artificial Intelligence (AI) is a science and technology that aims to enable computer systems to simulate and perform tasks of human intelligence, involving a variety of fields such as intelligent technology and deep learning. The goal of AI is to give computers the ability to “think” and “learn”, to understand, learn and solve complex problems [1–2]. Through the use of big data technologies and powerful algorithms, AI can analyze and interpret information to provide accurate predictions and decisions [3–4]. Industries including power, healthcare, finance, transportation and manufacturing have widely adopted AI technologies.

As one of the important educational contents, the ideological and political education of college students can use artificial intelligence technology to carry out education and teaching reform. With the development of society, ideological and political education in colleges and universities has emerged such problems as “grasping the big and letting go of the small”, “reaching a wide range of areas, and lacking depth” [5–6]. In this regard, the state can utilize artificial intelligence technology to carry out educational and teaching reforms in promoting accurate parenting in ideological and political education. Specifically, it can analyze the learning data and behavioral patterns of students in colleges and universities, provide targeted feedback and improvement suggestions for ideological and political education in colleges and universities, and recommend the provision of personalized learning resources and teaching programs, so as to enhance the learning effect and interest of students [7–10]. It is also possible to create an interactive virtual learning environment with the help of artificial intelligence technology to stimulate students’ interest in appreciating, experiencing and exploring the content of ideological and political education, and at the same time to improve the teaching perception of teachers and effectively change their teaching mode, so as to better realize accurate parenting [11–14]. Promoting ideological and political education through artificial intelligence technology can not only better monitor the learning progress of students, but also help teachers to improve the teaching mode and teaching efficiency, which can better serve the society and realize the goal of sustainable development of the society and education reform and development [15–17].

In this paper, 3DMax modeling is used to construct a teaching 3D model of multimedia blackboard, display table, character characters, etc., and rendering optimization is carried out on the model and the scene to construct a real and content-rich teaching scene. Based on the VR ideological and political teaching scene, a processing framework based on message mechanisms is further designed to reduce scene module coupling and enhance system scalability. Then the principles and methods of the noise reduction algorithm are introduced, and the process of processing the acquired speech signals is described, with steps including data preprocessing, feature parameter extraction, and identification of the speaker’s gender, identity, and emotion. Subsequently, a deep learning model based on the combination of CNN-LSTM-Attention mechanism is proposed to establish the mapping relationship between temporal features and speech emotion by extracting the temporal features of speech in classroom teaching, in order to better assist the implementation of intelligent teaching scenarios. The paper concludes with a comparison experiment to verify the effectiveness of the proposed method. A specific ideological and political teaching video is used to analyze the teacher and student learning states of the interaction behavior method based on classroom speech recognition proposed in this paper.

2

Method

2.1

VR Teaching Scene Modeling and Processing Framework

2.1.1.

Three-dimensional modeling

This system utilizes teaching models that include perspective characters, multimedia blackboards, showcases, and venue room structures, which can be drawn as 3D models using 3ds Max. After that it is exported to. FBX format, then this file is imported into Unity3D for function integration, during which it is necessary to write logic scripts and attribute settings in Unity3D, write scripts for development, and finally export the fully functional model as an apk format file [18]. The process of developing the model is shown in Figure 1. 1)

Scenario and learning tool modeling

3D learning scene is the core of this system modeling. It mainly includes the venue room, multimedia blackboard, showcase and so on. First use 3ds Max software for all objects object three-dimensional modeling, the process can be used to edit some of the existing model examples, such as rectangular, sphere, etc., and then need to model the surface rendering process or mapping. One of the multimedia blackboards needs to reduce its surface reflection effect. The reflection is too strong, so the screen display will be white. The subsequent video rendering texture will further enhance its visual effect. For space reasons, the detailed modeling process will not be described here.

Model building and optimization process needs to be corrected on the axis of the model (i.e., are rotated around the ¥ axis), more conducive to the unified combination of all models, each large model in the construction process there is also a parent-child model of the relationship, which is the need to fix the parent-child relationship. Model construction and optimization are shown in Figure 2.

2)

Viewpoint character model

Characters are created to allow users to have a more realistic experience, to be able to perform free perspective activities in 3D learning scenarios, to see the derivation of the algorithms from close up, and to act as a third person perspective in the system [19]. So in this paper, we recommend two forms of acquiring character models: (1) There are many free character models available in the Unity3D developer’s city, which can be acquired in Unity’s repository. (2) If you want customized character models, you can use Ready Player Me automatic modeling platform, which is the most widely used cross-platform tool for making virtual images. It supports automatic modeling by taking photos, manual modeling, and exporting models.

Because the character model does not involve the core of this paper and needs to be personalized, this paper adopts the form of photo automatic modeling of Ready Player Me, which is also more convenient and efficient, and the online automatic modeling process is shown in Figure 3.

2.1.2.

Optimization and integration of teaching scenarios

1)

Model Rendering

In the field of graphic design, rendering is a key step to enhance the visual effect, usually using image editing tools to represent the model graphically, and is also the final part of model construction to match the model with the environment. The rendering process involves modifying the materials and colors of the model, as well as the reflective effects of lighting on the model. The stronger the sense of realism between the model and the environment, the higher the requirements for model rendering quality and a realistic environment, which helps to stimulate the user’s learning interest level.

2)

Lighting arrangement

(1)

Point light source: light source for specific scenes, similar to common bulbs, flames, diffuse light, can illuminate a certain range of scenes, by adjusting the size of the control range.

(2)

Parallel light: light commonly used in development, similar to the sunlight in the real world. But the direction of the light source is fixed, and will not be affected by the movement of objects, all the shadows of the object objects are in one direction.

(3)

Spotlight: similar to flashlight and stage light in life, a beam of light is emitted from a single point in a certain direction. Adjust the direction, distance, range, etc. by setting parameters.

(4)

Area Light: less used, only in baked light mode, can simulate soft, smooth shadow effect.

3)

Scene Integration

Unity3D has a specific folder Prefabs for storing models, copy all the required models into this folder, place the 3D house model in the virtual coordinate space, use the model to be in a parallel state with the virtual space, and then place other models, such as multimedia blackboards, displays, and characters, into the house. The scale attribute of the models in the space is set so that they can be scaled to a similar size as the real world objects, so that the sense of consistency, and the relative spatial coordinates between the models are adjusted to establish the prototype of the whole scene. The model is fixed in the center of the Visualizer, the character is fixed in front of the Visualizer, and the multimedia blackboard is located in the object of the character. When the algorithm is deduced, the Visualizer demonstrates the process of 3D code derivation, the multimedia blackboard is used for auxiliary explanations, and intelligent voice interaction provides auxiliary questions and answers.

2.1.3.

Message mechanism processing framework

1)

Analysis of module problems in VR

There is a multi-scene, multi-module structure in the virtual reality system, and there will be switching back and forth between different scenes, and the phenomenon that the function modules will call each other. In the function call, all the operations are controlled by C# script programming, the function call each other is often used to mount and instantiate the way of script class, which has a big defect, the coupling between the function module is too high, the function expansion will be limited, the script code will also produce a lot of redundancy. And with the increase of functional modules, the call will become disorganized.

2)

The principle of message mechanism

Messaging mechanism is a kind of message process serving modular or loose structure software system, which can be applied to small or distributed system, and play the role of linking call to each different component or module, so that different modules can work together and interact with each other through the form of message, which largely reduces the module coupling. The use of messaging mechanisms can obtain asynchronous message delivery, instant message delivery, freedom in format definition, less restrictions for receivers, convenient management and distribution, and other technical advantages.

3)

Message mechanism processing framework design

In order to cope with the complex scenarios of teaching in order to provide rich functional modules of course content, a message mechanism processing framework is designed for managing the functional modules of teaching scenarios in VR, which also effectively reduces the coupling of the modules and improves the functionality scalability. The processing framework based on the message mechanism is shown in Figure 4. It consists of a message management center and multiple function management modules, and under each module, one or more C# scripts are managed specific to different functions. There are nine major parts of the modules covered.

During the execution of specific functional modules, taking UI interface management as an example, the processing framework flow structure of the message mechanism is shown in Figure 5.

2.2

Classroom Speech Recognition

2.2.1.

Speech Noise Reduction Algorithm

In this paper, the raw audio data is processed using Perceptual Masking Deep Neural Network (PM-DNN) based algorithm.The main principle of PM-DNN noise reduction algorithm is to elaborate the time-frequency masking associated between speech and noise.This algorithm is capable of high-performance noise reduction for processing speech in complex environments. The noise reduction process is divided into two parts: training and enhancement [20]. In the training phase, noise common to offline classroom environments is added to the pure speech, and the mapping relationship between the classroom noise and the pure speech used for training is obtained after neural network training. In the testing phase the classroom audio test set with noise is input to the neural network model for training to get the pure classroom speech after noise reduction.The model of PM-DNN noise reduction algorithm is shown in Fig. 6.

The specific noise reduction process is as follows: 1)

Use PM-DNN network to train pure speech features and output separated speech amplitude spectrum $\tilde{S}$ and ambient noise amplitude spectrum $\tilde{N}$ :

2)

Calculate the processing of $\tilde{S}$ and output the masking threshold T :

3)

Perform the correlation calculation between the masking threshold T and the ambient noise amplitude spectrum $\tilde{N}$ to output the perceptual gain function G.

4)

Combine the above computed G with the noise-bearing speech amplitude spectrum Y to form a new network layer and superimpose it with the original network to compute the final output $\tilde{S}$ via equation (1): 1 $\hat{S} = G \otimes Y = \frac{1}{1 + \max (\sqrt{\frac{{\tilde{N}}^{2}}{T} - 1, 0})} \otimes Y$

After calculating the final output $\hat{S}$ of the network, the weight parameters of the network are trained by pure audio signals. The trained objective function consists of two parts shown in Eq. (2): 2 $J = ω {‖ \hat{S} - S ‖}_{2}^{2} + v {‖ \tilde{S} - S ‖}_{2}^{2}$ where ω denotes the weight of the error between $\hat{S}$ and the pure speech signal S, and v denotes the weight of the error between the speech amplitude spectrum $\tilde{S}$ and the pure speech signal S, while guaranteeing ω + v = 1.

After training and validation, the PM-DNN algorithm has good noise reduction performance in both high signal-to-noise ratio and low signal-to-noise ratio noise environments.

2.2.2

Data pre-processing

In the process of speech recognition, in order to further improve the efficiency of speech recognition and facilitate the subsequent extraction of the characteristic parameters of the speech signal, we need to add a preprocessing step to remove redundancy before recognition. The pre-processing steps mainly include pre-emphasis processing, frame-splitting and windowing processing, and endpoint detection processing. 1)

Pre-emphasis

We use pre-emphasis to enhance the high-frequency part and eliminate the radiation effect. The principle of pre-emphasis is to input the audio signal into the high-pass filter, output the high-frequency component, and finally obtain the enhanced audio signal.

The transfer function of the high-pass filter is shown in equation (3): 3 $H (z) = 1 - α z^{- 1}$

Where, a is the pre-emphasis coefficient, which is generally taken as 0.9~1.0 depending on the specific scenario. The pre-emphasis coefficient in this experiment is taken as 0.965. Defining the audio signal at the nnd moment as φ(n), the computational function of the first-order high-pass filter is shown in the following equation: 4 $ϕ (n) = φ (n) - α φ (n - 1)$

2)

Framing and windowing

The amplitude of an audio signal varies frequently throughout the time domain and is therefore a non-stationary signal difficult to analyze quantitatively using mathematical methods [21]. The audio signal is more stable between short time intervals (generally 10ms~30ms by default), which is also called the short-term smoothness of speech. Based on the above characteristics, we can slice the audio signal into short frames that meet the conditions before completing the quantitative analysis.

When slicing, each subframe may or may not overlap. In order to avoid the incoherence between the subframes, this paper uses the interleaved framing method for processing. The interleaved framing of the speech signal is shown in Figure 7, the interleaved framing method is to keep the next frame overlapping partially with the previous frame when slicing the audio, in order to ensure a smooth transition between the sub-signals. In this, the header interval between neighboring frames is also called frame shift.

After completing the framing step of the audio signal, in order to minimize the spectral leakage caused by framing, it is also necessary to perform a windowing operation on the framed signal. The windowing operation is used to enhance the audio signal near the sample point and attenuate other irrelevant waveforms. Assuming that the s(n) function is used to represent a frame of an audio clip and the w(n) function is used to represent the window function, the formula for the audio clip after windowing is: 5 $s_{w} (n) = s (n) * w (n)$

The choice of window function is also very important, the speech signal is processed by different window functions, which will also cause different experimental results. At present, there are two commonly used window functions: Hamming window and rectangular window, in this paper, we choose Hamming window, the formula of Hamming window is shown in equation (6): 6 $w (n) = {\begin{matrix} 0.54 - 0.46 \cos ⌊ \frac{2 π n}{N - 1} ⌋ .0 \leq n \leq N - 1 \\ 0. O t h e r \end{matrix}$

Where N denotes the frame length, the frame length is taken as 256 sampling points in this experiment.

3)

Endpoint detection

The speech signal contains many frames. The short-time energy of each frame varies. There is a clear difference between the short-time energy of noise segments and pure audio segments. In addition to this, the short-time energy of clear and turbid tones in pronunciation is also different. By setting the short-time energy threshold, the starting and ending positions of the audio signal can be picked out. Defining x_n (m) as the nth frame of the audio signal, the short-time energy calculation formula for one frame of the audio signal is shown in (7): 7 $E_{n} = \sum_{m = 0}^{N - 1} {[x_{n} (m)]}^{2}$

The short-time zero crossing rate refers to the number of times a frame of audio signal completes alternating positive and negative values through the horizontal axis. The expression for the short time zero rate Z_n of audio signal x_n (m) is: 8 $Z_{n} = \frac{1}{2} \sum_{m = 0}^{N - 1} | sgn [x_{n} (m)] - sgn [x_{n} (m - 1)] |$ where N denotes the frame length and sgn [] is the sign function whose expression is $sgn [x] = {\begin{array}{l} 1, & x \geq 0 \\ 0, & x < 0 \end{array}$

2.2.3

Speech feature extraction

After completing the preprocessing step, the feature extraction operation begins. The extraction and screening of feature parameters is an important part of performing speaker identification. The extraction methods of base frequency, resonance peak, and Mel frequency cepstrum coefficient are described in detail below. 1)

Extracting the base frequency

In this paper, the cepstrum method is used to extract the keynote frequency, and the definition equation of the cepstrum is as follows: 9 $c (n) = \frac{1}{2 π} \int_{- π}^{π} \ln | z (s (n)) | e^{j w n} d w$

The process of extracting gene frequencies is shown in Figure 8.

2)

Extraction of resonance peak frequencies

After the inverse spectrum calculation, the gene frequency and spectrum can be separated. Take the logarithm of the DFT of the inverted spectrum coefficient and take the mode, can obtain a smooth logarithmic spectrum, which can reflect the resonance structure of the input audio, the value of the spectral peak frequency is the frequency of the resonance peak, and find the maximum value of the above smooth logarithmic spectrum, can obtain its resonance peak, the process of extracting the resonance peak is shown in Figure 9.

3)

Extracting Mel frequency cepstrum coefficients

Mel cepstrum coefficient can effectively simulate the auditory perception of the human ear, so it is widely used in speaker recognition, speech gender, emotion recognition and sentiment recognition. The process of extracting Mel cepstrum coefficients is divided into two steps: the first step needs to convert the frequency of the input audio signal into Mel frequency, and the second step performs cepstrum computation on the Mel frequency, and finally obtains the Mel cepstrum coefficients.The Mel cepstrum coefficients extraction process is shown in Figure 10.

The formula for converting the actual frequency of the input audio signal to the Mel frequency is as follows: 10 $M e l (f) = 2595 * \log_{10} (1 + \frac{f}{700})$ where f denotes the actual frequency of the input signal.

After calculating the Mel frequency, the discrete Fourier transform is performed on the audio signal to facilitate the observation of the energy distribution of the audio signal. The larger the energy spectrum, the more eigenvalues required to recover the signal I. Let the speech time domain signal function be q(n), then the spectrum function Q(k) is calculated as: 11 $Q (k) = D F T (q (n))$

The energy spectrum is filtered using a triangular filter bank, which reduces the amount of computation. At the same time, the resonance phenomenon is more pronounced, reducing the influence of other waveforms on the audio signal and making the eigenvalues easier to obtain, assuming that there is M triangular filter, and the range of values of M is 22, 26, and they are used to merge into a filter bank with the center frequency set to f(m), and m is taken as (1,2,⋯,M).

The frequency response of the triangle filter is expressed as follows: 12 $H_{m} (k) = {\begin{matrix} 0, k < f (m - 1) \\ \frac{2 (k - f (m - 1))}{(f (m + 1) - f (m - 1)) (f (m) - f (m - 1))}, f (m - 1) \leq k \leq f (m) \\ \frac{2 (f (m + 1) - k)}{(f (m + 1) - f (m - 1)) (f (m) - f (m - 1))}, f (m) \leq k \leq f (m + 1) \\ 0, k > f (m + 1) \end{matrix}$ where $\sum_{m = 0}^{M - 1} H_{m} (k) = 1$

After processing using the triangular filter bank, the logarithmic energy of each triangular filter is calculated using the following equation: 13 $s (m) = \ln (\sum_{m = 0}^{M - 1} {| X_{a} (k) |}^{2} H_{m} (k)), 0 \leq m \leq M$

After calculating the logarithmic energy, the Mel frequency cepstrum coefficients are obtained by calculating the inverse discrete cos -function transform values: 14 $C_{n} = \sum_{k = 1}^{M} \cos (\frac{π n (k - 0.5)}{M}) \ln (s (m)), n = 1, 2, \dots, N$

Where N represents the order of Mel frequency cepstrum coefficient, the value range is 12~16.

Because the first-order Mel frequency cepstrum coefficient can express limited characteristics, it cannot fully represent the characteristics of the original audio signal, so the Mel frequency cepstrum coefficient is actually only suitable for processing static speech. The formula for the difference parameter is shown in equation (15): 15 $d_{t} = {\begin{matrix} C_{t + 1} - C_{t}, t < Q \\ \frac{\sum_{q = 1}^{Q} q (C_{t + q} - C_{t - q})}{\sqrt{2 \sum_{q = 1}^{Q} q^{2}}} \\ C_{t} - C_{t - 1}, t \geq P - Q \end{matrix} O t h e r$ where Q is used to denote the time difference of the first order derivative and P refers to the order of the cepstrum coefficients. C_t denotes the tth cepstrum coefficient, and d_t represents the t th order difference parameter.

2.3

CNN-LSTM-Attention for Speech Emotion Recognition

In this section, a deep learning model based on the combination of CNN-LSTM-Attention mechanism is proposed to extract temporal features in speech and establish the mapping relationship between temporal features and speech emotion.

2.3.1.

Fundamentals of CNNs

Convolutional neural networks not only have the advantages of traditional neural networks, such as adaptability and fault tolerance, but also have the advantages of automatic feature extraction and weight sharing. Convolutional neural networks calculate the output value through forward propagation, and the bias and weights in the calculation process can be adjusted through backward propagation. Convolutional and pooling layers are the most basic structures of convolutional neural networks, and at the same time, all convolutional neural network models are improved on this basic structure.

The two phases of feature extraction and feature mapping constitute the convolutional layer. In the feature extraction stage, the feature maps of each layer of each neuron are interconnected, and the convolutional filter is used to perform the convolution operation to extract that local feature. The feature extraction stage completes the extraction of local features and then feature mapping is required. This is generally mapped to a numerical value using an activation function. The most commonly used activation function is the Sigmoid function, which is derived from the type S curve in biology, and its function is to introduce nonlinearity and map the input value to a range from 0 to 1. The Sigmoid function is shown in (16) below. 16 $f (x) = 1 / (1 + e^{x})$

Direct classification or regression prediction after calculation through the convolutional layer is prone to overfitting, and a pooling layer is generally introduced to reduce the probability of overfitting. The purpose of the pooling layer is to collect the main feature information, which can reduce the dimensionality of the training data to reduce the probability of overfitting, but also to enhance the network’s adaptability to changes in the size of the feature data. The general form of pooling layer is shown below: 17 $X = f (d o w n (X) * w + b)$ where down(x) denotes the subsampling function, which generally sums and averages all the pixel points of different input data n*n. The subsampling layer makes the output data shrink n*n while each calculation corresponds to a weight w and a bias b of its own.

2.3.2.

Fundamentals of the LSTM

LSTM recurrent neural network has a processing device called cell, cell is used to determine whether the learning results are useful for the overall results. cell contains three gate units, input gate, output gate and forget gate, which are used to filter the computational results, the structure of the gate unit of the LSTM is shown in Fig. 11, the cell will be filtered by the forget gate, which is mainly used to filter out the LSTM recurrent neural network each operation does not meet the conditions of the computational results, and only retain the beneficial results of the LSTM recurrent neural network. Current research shows that LSTM recurrent neural network can solve the long order dependence problem, and the generalization of this recurrent neural network is very high.

Suppose h_t−1 is the output value of the LSTM recurrent neural network at moment t–1, c_t−1 is the state of the gating unit at moment t–1, and x_t is the sequence input value at moment t. h_t and c_t are the output value of the LSTM recurrent neural network and the state of the gating unit at moment t, respectively. Then the forgetting gate, input gate and output gate in the LSTM recurrent neural network at this time are calculated as follows: 18 $f_{t} = θ (W_{f} \cdot h_{t - 1} + W_{f} \cdot x_{t} + q_{f})$ 19 $i_{t} = θ (W_{i} \cdot h_{t - 1} + W_{i} \cdot x_{t} + q_{i})$ 20 $o_{t} = θ (W_{o} \cdot h_{t - 1} + W_{o} \cdot x_{t} + q_{o})$

Where f_t, W_f and q_f are the result of forgetting gate calculation, weight matrix and bias term respectively. i_t, W_i and q_i are the input gate calculation result, weight matrix and bias term respectively. i_o, W_o and q_o are the input gate calculation results, weight matrix and bias terms, respectively. The final output of the LSTM is determined by the output results of the output gates and the state of the cell together, as shown in (21) to (23) below. 21 $c \tilde{t} = \tanh (W_{c} \cdot h_{t - 1} + W_{c} \cdot x_{t} + q_{c})$ 22 $c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ {\tilde{c}}_{t}$ 23 $h_{t} = o_{t} \circ \tanh (c_{t})$

Where c denotes the input cell state at moment t. W_c denotes the input unit state weight matrix. q_c denotes the input unit state bias term. tanh denotes tanh the activation function. Denotes multiplication by elements.

2.3.3

Fundamentals of Attention

The introduction of the attention mechanism can improve the structural deficiencies in the design of recurrent neural network models. The attention mechanism is a model proposed by Treisman and Gelade to simulate the attention mechanism of the human brain, which can be simply understood as a combinatorial function by calculating the probability distribution of attention to highlight the effect of a key input on the output. The attention mechanism considers all inputs, but instead of giving the same weight to each input, it pays more attention to certain specific inputs.

3

Results and discussion

3.1.

Model validation

In order to verify the effectiveness of the method proposed in this paper, the following sets of comparison experiments are set up, which are conducted on two datasets, CASIA and Emo-DB, respectively, and the comparisons of this paper’s method on CASIA and Emo-DB datasets for each emotion are shown in Fig. 12 (Fig. a is the comparison of this paper’s method on the CASIA dataset for each emotion, and Fig. b is the comparison of this paper’s method on the Emo-DB dataset for each emotion). Dataset, and Figure b shows the comparison of each emotion on the Emo-DB dataset). From the figure, it can be seen that the accuracy of the improved attention mechanism recognition in the CASIA dataset in the three emotions of Angry, Fear, and Happy is higher compared to the others, the Sad emotion is inferior to SEnet and CBAM in recognition accuracy, and the recognition of the remaining emotions is about the same as the other highest ones. In the Emo-DB dataset, the recognition accuracies of Angry, Disgust, and Fear emotions are all the highest, and are on par with SEnet and ECAnet in Sadness, which further validates the effectiveness of the method proposed in this paper.

A comparison of the accuracy of the different methods in the CASIA dataset is shown in Table 1. From the table, it can be seen that when 3D-CRNN is used for speech emotion recognition in the public dataset CASIA, the model, although relatively fast in training, is only 52.6% accurate in recognition, and Experiments B-D have some effect on improving the accuracy rate by adding network layers, although Experiment B can achieve 83.4% accuracy rate by stacking 34 layers of network, but the The computational complexity is relatively high, compared to the method proposed in this paper is slightly insufficient, verifying the effectiveness of the method proposed in this paper.

Table 1.

Different methods are compared in the casia data concentration accuracy

Experimental group	Data set	Processing method	Accuracy rate
A	CASIA	3D-DCNN	52.6%
B	CASIA	RestNet	83.4%
C	CASIA	Data enhancement+DCNN+LSTM	75.1%
D	CASIA	CNN+BLSTM	70.2%
E	CASIA	Ours	89.8%

The comparison of the accuracy of different methods in Emo-DB dataset is shown in Table 2. From the table, it can be seen that the model proposed in this paper has the highest recognition accuracy compared to other models in the public dataset Emo-DB, with an accuracy of 87.2%.

Table 2.

Different methods are compared in the Emo-DB data concentration accuracy

Experimental group	Data set	Processing method	Accuracy rate
A	Emo-DB	3D-CRNN	82.2%
B	Emo-DB	Improved speech processing+2-DCNN	83.9%
C	Emo-DB	DCNN_LSTM	81.1%
D	Emo-DB	DCNN_DTPM	83.9%
E	Emo-DB	ML ELM_AE	81.4%
F	Emo-DB	Ours	87.2%

3.2

Speech Sentiment Analysis of Teacher-Student Interaction in the Classroom

The classroom video selected for this experiment is a teaching video of ideological and political education in a university, this classroom video is a public video recorded for the offline teaching scene, which ensures the authenticity and validity of the experiment, the teacher in the classroom interacts with the students more, and the teacher often allows the students to discuss the problems, the teacher in the video asks many questions for specific problems, the classroom activities mainly include lecturing, questioning, answering, student practice Classroom activities included lecturing, questioning, answering, student exercises, discussions, experiments, etc. The class was taught by a female teacher of about 40 years of age. The class lasted 40 minutes, and there were 24 students who spoke a total of 22 times.

For classroom speech interaction analysis, it is necessary to crop the audio data to remove the muted segments when students are doing exercises and the miscellaneous sound segments when students are discussing, and retain the clear teacher-student interactive communication speech as the analysis data for the experiment. By removing invalid speech segments such as classroom discussions, exercises, and experiments, six effective interaction speech segments were retained, with durations of 6 minutes and 30 seconds, 1 minute and 30 seconds, 3 minutes and 40 seconds, 3 minutes and 50 seconds, 6 minutes and 40 seconds, and 1 minute, totaling 22 minutes and 50 seconds of effective classroom interaction speech.

Classroom Interaction Segment 1 is the beginning of the classroom, which is dominated by the teacher’s lecture, and mainly conducts knowledge review and classroom questioning to examine the students’ mastery of the knowledge of the previous lesson, and the emotion change of Interaction Segment 1 is shown in Figure 13. 2 minutes before the teacher’s lecture, the emotion value P is mainly in the range of 0 to 1.5, which indicates that the teacher’s emotion has not yet been fully activated and is in a low degree of positive emotion at the beginning of the lesson. 2 minutes to 5 minutes. 30- second stage, the teacher based on the content of the previous lesson, the students to classroom questions, so the overall emotion fluctuation is large, and varies between the range of [-1,1], indicating that in the conduct of classroom questions and answers, because the content of the discussion are with a questioning tone, and for the students together to answer the question, the teacher and the students voice staggered, appeared to be a large fluctuation in the value of the emotional assessment.5 minutes and 30 seconds to 6 points stage, the teacher in the introduction of the the application of ideology and politics in life learning, the affective value P is increasing, indicating that the teacher’s mood is in a positive state when carrying out the expansion of ideological and political knowledge, and it is conducive to the activation of the teacher’s affective state.

Classroom Interaction Clip II focused on the explanation of the students’ ideological and political test questions, the questions discussed were closed questions, three students took the initiative to speak respectively, and after 1:05, the teacher made a positive evaluation of the students’ explanations and summarized their knowledge. The change of emotion in interaction fragment II is shown in Figure 14. The sentiment value P of clip two is mainly distributed between [1.5,2.1], before 1:05 seconds for students’ speeches, the sentiment is positive and positive, after 1:05 seconds, the teacher’s sentiment is changed from the range of [0,1.5] in clip one, to about 1.8, which indicates that the students’ active speeches have a positive effect on the teacher’s sentiment, and that there is a correlation between the teacher’s sentiment and the students’ sentiment. Comparing with clip one, clip two emotion value is significantly higher than clip one, indicating that when the classroom proceeds to 8 minutes, the speech emotion value P reaches a high level, and the classroom is already in a fully activated state.

Interactive voice clip three is after the students’ exercises and discussions, this clip is a practice Q&A clip, by the students for the questions to correct the errors, the overall affective value P distribution between [1.1,1.8], interactive clip three affective changes as shown in Figure 15. More similar to clip two affective value, the teacher is conducting continuous questioning, the students’ answers have been confident state, and the answers are correct, the teacher also give encouragement and affirmation, indicating that in the question and answer stage, the classroom has been in an active state, and the students’ emotions have been actively mobilized.

In classroom interaction segment IV, the students explained the reasons for the problem, which was an open-ended question, with three students taking the initiative to speak in the first minute, followed by the teacher guiding and communicating with the students according to the question, and the overall classroom affective value P was distributed in the range of [0.2,1.3], and the change in affect in interaction segment IV is shown in Figure 16. At 1:50, the teacher gives rewards to the students who perform well, and the voice affective value rises, while at 2:30, the teacher asks the students questions in a questioning tone, and the voice pleasantness decreases to near 0.8, which indicates a certain validity of the model.

Classroom interaction fragment V was preceded by a 5-minute accompanying exercise and a 1-minute discussion, and the classroom interaction fragment V emotion is shown in Figure 17, which is mainly an accompanying exercise to answer questions and a question and answer session, in which students explain the topic, and the emotion presents a large fluctuation because it is explained by five different students, and the emotional pleasantness is different for each of the students and there may be nervousness in the blackboard explanations. For Clip 5, because the emotion value P distribution in [0.3,1.3], compared with the previous pleasant emotion presents a significant decline, indicating that the late classroom, the students’ classroom motivation has declined.

Classroom Interaction Segment VI Emotions As shown in Figure 18, the content of the lesson is the classroom summary part, the teacher asked students to think about the learning gains of this lesson, two students shared learning gains, the classroom emotions also showed a trend of elevation, the P-value distribution of [0.8,1.4], which shows that the students’ emotions are higher in the free speech state.

Through the above experiment, it can be found that the overall classroom emotion value P is mostly distributed between [0,2], indicating that the classroom as a whole presents a positive and active state, and positive emotion has a positive impact on the effect of teacher-student interaction, indicating that the atmosphere of this ideological and political classroom is good, and the overall effect of classroom teacher-student interaction forms a benign cycle.

Analyzing from the whole, the classroom interaction pleasure emotion shows the trend of rising first and then falling, which may be related to the stage of the course, at the beginning of the classroom atmosphere is in the unactivated state, the emotion fluctuates greatly, the classroom carries on to the 8th minute or so, the classroom positive emotion is activated and has been in the higher degree of pleasantness, and in the classroom in 30 minutes still maintains a higher degree of pleasantness, in the classroom in the second 10 minutes, the whole is in the lower degree of emotional pleasantness. Was at a lower level of affective pleasantness, indicating that the students’ and teacher’s voice emotions began to weaken at a later stage of the class. The above analysis shows that after using the method of this paper for ideological and political education in colleges and universities, the classroom teacher has a good teaching atmosphere, the teacher’s teaching method is effective, the classroom is enriched through discussion, exercises, experiments and other forms to stimulate students’ enthusiasm for the class, so that the classroom pleasantness can be sustained for a long time and the teacher makes good use of the rewards and incentives to motivate the students to think positively.

3.3

Analysis of Teacher-Student Interaction in Ideological and Political Education Classrooms in Colleges and Universities

3.3.1

Analysis of statistical results

In this section, 28 video cases of ideological and political education classes in a university are selected for analysis. In the ideological and political education in colleges and universities, the basic information statistics of class 1, class 2, class 3 and class 4 ideological and political video class cases are shown in Table 3. From the table, it can be seen that the average value of the conversion rate of teacher-student interaction behavior in the ideological and political classroom of colleges and universities is 0.3462, the average value of the conversion rate of teacher-student interaction behavior in the classroom of class 1 is 0.3643, the average value of the conversion rate of teacher-student interaction behavior in the classroom of class 2 is 0.3371, the average value of the conversion rate of teacher-student interaction behavior in the classroom of class 3 is 0.3371, and the average value of the conversion rate of teacher-student interaction behavior in the classroom of class 4 is 0.3471. It can be concluded that the conversion rate of classroom teacher-student interaction behavior is the highest in classroom 1.

Table 3.

Video class case basic information statistics

Class	Province	Teacher Number	The Total Number Of Actions (N)	Serial Behavior Number (G)	Behavioral Conversion Rate (G-1)/N	Teaching Model
Class 1	Guizhou	A1	79	26	0.32	Interactive type
	Jiangsu	A2	78	35	0.44	Dialog type
	Shanghai	A3	72	26	0.35	Interactive type
	Ningxia	A4	74	24	0.31	Interactive type
	Shaanxi	A5	80	35	0.43	Dialog type
	Sichuan	A6	76	29	0.37	Interactive type
	Xinjiang	A7	72	25	0.33	Interactive type
Class 2	Anhui	B1	82	37	0.44	Dialog type
	Beijing	B2	84	22	0.25	Interactive type
	Hupei	B3	80	21	0.25	Interactive type
	Jilin	B4	87	23	0.25	Interactive type
	Jiangxi	B5	77	39	0.49	Dialog type
	Liaoning	B6	83	29	0.34	Interactive type
	Qinghai	B7	82	29	0.34	Interactive type
Class 3	Beijing	C1	79	32	0.39	Interactive type
	Guangxi	C2	66	34	0.50	Dialog type
	Hainan	C3	71	23	0.31	Practice type
	Heilongjiang	C4	82	19	0.22	Interactive type
	Inner Mongolia	C5	83	29	0.34	Interactive type
	Yunnan	C6	83	23	0.27	Interactive type
	Chongqing	C7	78	27	0.33	Interactive type
Class 4	Fujian	D1	71	26	0.35	Interactive type
	Guangdong	D2	80	31	0.38	Interactive type
	Hebei	D3	86	31	0.35	Interactive type
	Henan	D4	78	18	0.22	Practice type
	Hunan	D5	75	26	0.33	Interactive type
	Tianjin	D6	81	38	0.46	Dialog type
	Zhejiang	D7	77	27	0.34	Interactive type

Among the 28 video lesson examples selected, interactive lesson examples using the method proposed in this paper totaled 71%. It can be seen that in more than half of the actual teaching, the proportion of teacher-student activities is comparable, with more teacher-student interactions and more time for teachers to guide and comment on learning outcomes. The teaching concept of “teacher-led and student-led” advocated in the quality lessons is reflected. Through the leading role of the teacher to cultivate students’ independent learning spirit and realize classroom interaction. Through the use of the methods proposed in this paper for ideological and political education in colleges and universities, not only can cultivate students’ interest in learning, but also increase students’ confidence in learning, so as to achieve the spirit of mutual cooperation and to play the role of the main body, so as to truly achieve the students are the masters of learning.

3.3.2

Behavioral Sequence Analysis of Teachers and Students in Ideological and Political Classrooms in Colleges and Universities

According to the coding table of teacher-student interaction behaviors in the ideological and political education classroom in colleges and universities, coding was completed for the teacher-student interaction behaviors of the four classes, and the data of Class 1, Class 2, Class 3, and Class 4, which had been coded in NVivo software, were imported into GSEQ (the software for analyzing the teacherstudent interaction behaviors) according to the time series for the lagged sequence analysis, in which behavioral conversion sequences with Z-scores greater than 3 were selected from the residual table to show the significance level of the sequences. With the help of Gephi software, a schematic diagram of teacher-student interaction behavioral transition relationship was drawn, which more intuitively clarified the relationship between the behavioral sequences, analyzed the significant behavioral sequences of classroom teacher-student interactions, summarized the characteristics of teacherstudent interactions in ideological and political education classrooms in colleges and universities, and provided more useful references to the professional development of teachers and the level of classroom interactions.

1 class of ideological and political education classroom teacher-student interaction behavior by ideological and political education classroom teacher-student interaction behavior coding system coding after data processing, transformation, the use of software to analyze the data, generate behavioral conversion table after the formation of the adjusted residual table in Excel. Higher education ideological and political education classroom teacher-student interaction behavior conversion adjusted residuals as shown in Table 4 (Table 1-14 is the code names of 14 students).

Table 4.

The residual difference of the interaction behavior of teachers and students

	1	2	3	4	5	6	7	8	9	10	11	12	13	14
1	3.89	1.27	-0.82	-1.88	-0.61	5.87	0.89	0.14	0.43	-0.96	0.57	-1.9	-0.66	1.38
2	1.28	1	0.3	-0.63	3.35	4.5	-1.25	1.15	-1.52	1.4	-0.98	-1.54	4.6	1.99
3	1.08	-0.72	-0.74	0.26	-0.36	-0.68	1.53	0.42	0.49	1.11	1.46	0.91	-1.69	0.18
4	4.52	-1.17	4.85	11.87	0.06	2	-1.74	0.73	-1.82	-1.55	-0.99	-0.4	-0.14	0.75
5	-0.68	1.98	1.06	-1.03	21.21	-0.71	-0.33	-1.91	0.33	1.49	-0.41	0.2	-0.53	-1.66
6	-0.36	1.59	1.19	-0.68	1.75	4.47	-1.4	0.08	0.55	-1.31	0.19	-0.82	1.84	-1.28
7	0.01	0.91	-0.8	1.21	1.18	-1.35	22.84	-0.51	-0.38	-1.78	-0.71	3.25	-1.42	1.13
8	0.26	1.2	0.25	3.5	1.16	-0.7	0.04	-0.85	-0.08	-1.38	0.85	1.3	0.86	1.38
9	0.83	0.72	-1.98	1.63	0.34	1.56	1.79	-1.51	8.45	4.55	0.45	-1.91	-1.09	1.77
10	1.32	-0.76	0.42	-1.94	1.79	-1.13	-1.88	-1.61	3.6	12.84	-0.64	0.58	0.12	0.91
11	-0.74	-0.55	-1.25	0.45	-1.53	0.9	1.66	-0.89	-1.2	1.27	16.54	-0.45	-0.6	0.07
12	5.64	0.16	-0.17	-1.97	-1.16	-1.95	0.58	-0.44	1.09	-1.88	-1.95	4.87	-0.29	-1.56
13	1.45	0.87	-1.75	1.6	-0.68	3.28	-0.06	0.53	-1.91	-1.87	0.29	-0.02	5.21	-1.86
14	-1.87	0.79	1.05	0.99	-0.17	0.02	-1.67	0.24	-0.7	1.77	0.56	3.64	4.05	0.57

From the table, it can be seen that the interactive behavioral transitions between teachers and students were mainly focused on teacher instructions (6) and teacher feedback (4).The most intensive behavior in the ideological and political classroom of class 1 during the occurrence of the interaction was teacher instructions (6). For example, after the teacher asked a question, he/she issued a behavior that instructed the students (1→6, Z=5.87, p<0.05). Teachers give praise or encouragement for students’ behavior, students feel a sense of achievement and confidence (4→3, Z=4.85, P<0.05). From the results of the analysis, it can be seen that the teacher-student interaction behavior is mainly based on question-response-feedback, and more interaction occurs in verbal behavior between the teacher and student.

The teacher-student behavior transition relationship in the ideological and political classroom of Class A is shown in Figure 19. The figure indicates that the majority of the continuity sequence occurs between teacher and student behavior, and the continuity relationship of student behavior is significantly lower than that of teacher behavior. It shows that the interaction in the ideological and political education classroom teaching in colleges and universities is carried out by teachers’ organization and guidance.

4

Conclusion

This paper studies the interaction methods of ideological and political education in colleges and universities in the intelligent teaching environment, takes four classes in a college as the research object, and analyzes them by using the methods proposed in this paper, and finally draws the following conclusions:

In the video class examples of ideological and political education classrooms in a university, the highest conversion rate of teacher-student interaction behavior in class 1 was 0.3643. 71% of the classrooms used the method proposed in this paper for ideological and political teaching, and it was found that the proportion of teacher-student activities in these classrooms was comparable, and the teacher-student interactions were good.

The analysis and study of the teacher-student interaction methods in ideological and political education classrooms in colleges and universities through this study will have a certain reference value for the development of the cause of teacher education, and will also play a positive role in the implementation of the effectiveness of the ideological and political education curriculum.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Optimization Research on Interactive Methods of Ideological and Political Education in Colleges and Universities under Intelligent Teaching Environment

Rong Ji

Pubblicato online: 19 mar 2025

Ricevuto: 02 nov 2024

Accettato: 07 feb 2025

DOI: https://doi.org/10.2478/amns-2025-0454

Parole chiaveVR teaching scenarios, Speech recognition, Classroom interaction, Ideological and political education

© 2025 Rong Ji, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Parole chiave
VR teaching scenarios, Speech recognition, Classroom interaction, Ideological and political education