Research on the Optimization of Personalized Learning Paths and Teaching Practice Strategies of Deep Enhanced Learning for Dance Choreographers

With the development of the times and the improvement of people’s living standards, dance has become an important form of culture and art, and more and more people begin to learn and appreciate dance. In various colleges and universities, choreography has also become one of the popular majors [1-4]. Dance choreography is a comprehensive art discipline, which requires students to have solid basic skills and rich knowledge of dance, and to be able to skillfully use choreography skills to create and perform. However, the sustainable development of choreography in colleges and universities and the cultivation of professional talents cannot be separated from personalized professional teaching methods [5-8].

Dance choreography is an art form covering a variety of elements, which not only requires students to carry out skillful training and performance, but also needs to deeply explore and cultivate individual characteristics. Therefore, the teaching of choreography requires not only the grasp of commonality, but also the attention and cultivation of individuality [9-12]. With the development and change of education, more and more educators and students are inclined to personalized learning. Personalized curriculum design and implementation has become a very important educational task. For a course such as choreography, individualized design and implementation are more challenging [13-16].

Educators need to understand students’ backgrounds, interests, learning styles, and physical conditions. Based on students’ needs, appropriate choreography programs and activities are designed to meet students’ learning and developmental needs. In addition, when designing personalized dance choreography courses, educators need to emphasize the importance of students’ comprehensive literacy [17-20]. Choreography requires a great deal of physical training, cultural awareness, and aesthetic ability, and educators need to design challenging training and practice of choreography works according to students’ physical conditions and abilities [21-23] in order to help students improve their choreography level and quality. And deep reinforcement learning plays an optimizing role for personalized learning in order to promote the development of teaching practice [24-25].

Deep reinforcement learning combines the advantages of deep learning and reinforcement learning to provide an efficient decision optimization method. In this paper, we first review the basic principles of reinforcement learning and deep learning, and design an adaptive learning path recommendation model for dance choreographers based on the reinforcement learning algorithm that combines value and strategy. Learning goal features and domain knowledge features are added to the model, and LSTM and Transformer are utilized to predict the cognitive state and knowledge point coverage of the learner, respectively, while the change of the difficulty of the learning content is taken into account. Mathematical modeling of the learner’s state, action and reward values was performed using the Actor-Critic algorithm, and the D3ON algorithm was used to implement the choreography content recommendation function. In addition, the effect of learning path optimization was tested through experiments, and the practical effect of the new teaching strategy was verified through t-test.

2

Deep reinforcement learning algorithms

2.1

Enhanced Learning Algorithm

Reinforcement learning discusses the problem of how an intelligent body can maximize the rewards it can obtain in a complex, uncertain environment. Reinforcement learning consists of two parts: the intelligence and the environment. After an intelligent body acquires a state s_t in the environment, it uses that state to output an action a_t. The environment then outputs the next state s_t+l and the reward of the current action based on the action taken by the intelligent body r_t+1. The purpose of the intelligent body is to acquire as many rewards as possible from the environment, so that it can continuously optimize its behavioral trajectory and ultimately learn the optimal behavioral strategy [26].

A strategy is a model by which an intelligent body chooses its next action. The intelligent body decides its subsequent actions according to a certain strategy. Strategies can be categorized into two types: stochastic and deterministic strategies.

The stochastic strategy is commonly represented by π(a ∣ s) in reinforcement learning, as in Equation (1): (1) $π (a | s) = P (a_{t} = a | s_{t} = s)$

The deterministic strategy is as in equation (2): (2) $a^{*} = \arg \max_{a} π (a | s)$

The value function is used to evaluate the goodness of the current state. The goodness of the state lies in the influence of the current state on the high or low rewards brought by the subsequent actions. The larger the value of the value function, the more considerable the future reward is, and the more favorable the current state of the intelligence is to the future reward. For all s ∈ S, the value function is defined as shown in equation (3): (3) $\begin{array}{rcl} V_{π} (s) & = & E_{π} [G_{t} | s_{t} = s] \\ = & E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1} | s_{t} = s] \end{array}$

Where γ is the discount factor and the expected value of the subscript is given by the π function, the value of which indicates the level of reward that can be expected to be obtained when strategy π is adopted. Where G_t denotes the discounted reward and it is represented as in equation (4): (4) $G_{t} = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k + 1}]$

The next state an intelligent body is in is determined by the combination of its current state and the action it takes at this moment, a process that involves two key elements, the state transfer probability and the reward function. Define the transfer probability of taking action a from state s to state s′ at moment t as P, and for all s′ ∈ S, s ∈ S, a ∈ A(s), the state transfer probability is as in equation (5): (5) $P_{s s^{'}}^{a} = P (s_{t + 1} = s^{'} | s_{t} = s, a_{t} = a)$

The reward function, on the other hand, defines the extent to which the system can be rewarded for performing an action in a particular state: (6) $R (s, a) = E [r_{t + 1} | s_{t} = s, a_{t} = a]$

2.2

Deep reinforcement learning algorithm

Deep reinforcement learning integrates the parsing power of deep learning for complex data and the intelligent decision-making skill of reinforcement learning, which is capable of generating optimal decision responses directly based on multidimensional input information, constructing a seamless end-to-end decision control system. In this system, the intelligent body inputs various information into the network generated by interacting with the environment as a means of accumulating experience and driving iterative updating of the decision network parameters, with a view to learning the most optimal decision strategy [27].

2.2.1

Value-based deep reinforcement learning algorithms

The deep DQN algorithm is a classic value-based deep reinforcement learning algorithm. DQN combines convolutional neural network with Q-learning algorithm in traditional reinforcement learning, and uses an empirical playback mechanism to store the transfer sample (s_t, a_t, r_t, s_t+l) generated by moving the machine into the environment interaction within each time step t into the playback memory unit D, wherein s_t represents the machine into the current state, a_t represents the action that the machine may produce in the current state, r_t represents the timely reward, and represents the immediate reward return obtained by the machine when it executes action a in the current state s, and r, s_t+l represents in s_t influences under the state of the next moment. In the training phase, a small number of batch transfer samples are randomly selected from the dataset D at certain intervals, and the stochastic gradient descent algorithm is subsequently used to iteratively learn from this batch of samples in order to adjust the settings of the parameters θ of the network.

The DQN model uses a deep convolutional neural network to approximate the optimal action-value function (7), where θ_t is the initial network parameters: (7) $\begin{array}{rcl} Q (s, a, θ_{t}) & \approx & Q^{*} (s, a) \\ = & \max_{π} E [r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots | s_{t} = s, a_{t} = a, π] \end{array}$

DQN In addition to using a deep convolutional network to approximate the current value function, the DQN model uses another network Q(s, a, θ_t−) separately to generate the target Q values, where θ_t− is the updated network parameters. That is, Q(s, a, θ_t) is the output of the estimation network and Q(s, a, θ_t−) is the output of the target network. The parameters in the estimation network are updated in real time. After a number of steps, the parameters of the estimation network are replicated in the heliograph network and then the target network remains unchanged and the estimation network continues to be updated in real time. The update of the DQN network parameters is done by utilizing the TD error (temporal differencing) with the following equation: (8) $L (θ_{t}) = E_{s, a, r, s^{'}} [{(Y_{t} - Q (s, a | θ_{t}))}^{2}]$

At network initialization θ_t = θ_t, where Y_t is an approximate representation of the optimization objective of the value function: (9) $Y_{t} = r + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1} | θ_{t}^{-})$

The gradient formula is obtained by using SGD to obtain the partial derivatives for parameter θ: (10) $L (θ_{t}) = E_{s, a, r, s^{'}} [(Y_{t} - Q (s, a | θ_{t}) \nabla_{θ_{t}} Q (s, a | θ_{t})]$

A significant drawback of the DQN algorithm is the Q-value overestimation problem, where the intelligent body selects the action with the highest Q-value based on a greedy strategy, when the action is not necessarily optimal [28].

2.2.2

Policy-based deep reinforcement learning algorithms

Policy-based methods are suitable for continuous or high-dimensional action spaces and have the advantages of simple policy parameterization and fast convergence.

The REINFORCE algorithm is a typical policy-based deep reinforcement learning algorithm. It is based on the idea of gradient ascent to maximize long-term returns by directly updating the parameters of the policy function. The core idea of the REINFORCE algorithm is to use the policy function to define the probability distribution of choosing an action in a given state, and then compute the gradient based on the trajectories obtained from sampling, and ultimately use the gradient ascent method to update the parameters of the policy function.

The advantage of the REINFORCE algorithm is that it is able to self-adjointly optimize the policy function without the need to estimate the value function, making it suitable for problems in both discrete and continuous action spaces. However, the REINFORCE algorithm also has some disadvantages, such as low sampling efficiency and high variance [29].

2.2.3

Algorithms based on the combination of value and strategy

The Actor-Critic algorithm is a reinforcement learning algorithm that combines a policy gradient approach and a value function approach. It consists of two parts: 1)

Actor is responsible for tuning the parameters θ of strategy π_θ(a ∣ s).

2)

Parameterized vectors w are used to estimate the value function Q^w(s_t, a_t) ≈ Q^π(s_t, a_t) using a policy evaluation algorithm such as temporal difference learning.

Actor network can be described as a network that finds the probability of all available actions and selects the one with the highest output value, while Critic network can be described as a network that evaluates the selected action by estimating the value of the new state resulting from the execution of the action [30].

The Deterministic Policy Gradient (DPG) algorithm is a common Actor Critic algorithm, and the DPG models policies as deterministic policies μ(s). Deterministic policies are a special case of stochastic policies, where the objective function of the target policy is averaged over the distribution of states of the behavioral policy: (11) $\begin{array}{rcl} J (μ_{θ}) & = & \int_{s} ρ^{μ} (s) \int_{A} π_{θ} (s, a) r (s, a) d a d s \\ = & \int_{s} ρ^{μ} (s) r (s, μ_{θ} (s)) d s \\ = & E_{s - ρ^{μ}} [r (s, μ_{θ} (s))] \end{array}$

Compared to the stochastic strategy, the deterministic strategy gradient removes the integral over the action, the gradient only integrates over the state, so there is no need to sample the importance of the action, then the gradient becomes Eq. (12), which greatly improves the efficiency: (12) $\begin{array}{l} \nabla θ J (μ_{θ}) = \int_{s} ρ^{μ} (s) \nabla θ Q μ (s, μ_{θ} (s)) d s \\ = \int_{s} ρ^{μ} (s) \nabla θ μ θ (s) \nabla a Q μ (s, a) | a = μ θ (s) d s \\ = E_{s - ρ^{μ}} [\nabla θ μ θ (s) \nabla a Q μ (s, a) | a = μ θ (s)] \end{array}$

3

Personalized learning path optimization model for choreographers

3.1

Adaptive learning path recommendation model construction

In this study, the ALPRM graph shown in Fig. 1 is constructed based on the deep reinforcement learning framework, and the model consists of two layers: dynamic learning environment characterization and adaptive learning path recommendation. 1)

In the dynamic learning environment characterization layer, the core dynamic features in the learner personality traits and domain knowledge features are extracted to characterize the dynamic learning environment.

2)

In the adaptive learning path recommendation layer, the main components of the MDP are redefined, with “state” defined as the representation model of the dynamic learning environment, “action space” as the candidate learning object, and “return value” as the relevant learning object. Return value” is defined as a function of the difficulty feature. The dynamic environment feature variables are used to train the policy network for deep reinforcement learning, and finally the trained model is used to recommend the learning object that best fits the current learning state of the learner.

3.2

Characterization and computation of dynamic learning environments

3.2.1

Characterization of dynamic learning environments

In this study, learning target features and domain knowledge features are added to characterize the dynamic learning environment as State = [e_t, Target, p(K^t), p′(K^t), Dif₀], State_t specific descriptions are: e_t denotes the current learning object, Target is the target knowledge point concept, the learning target can be formulated by the teacher or freely decided by the learner according to his/her own situation before the start of the learning, p(K^t) is the cognitive state of the learner at the moment of t, p′(Kⁿ) is the predicted next step of the knowledge point at the moment of t concept coverage, Dif_t is the difficulty value of the learning target.

3.2.2

Calculation of the eigenvalues of the dynamic learning environment

In this study, the LSTM model was used to predict the cognitive state of the learner, the Transformer model was used to predict the conceptual coverage of the next knowledge point, and the dynamic difficulty value of the learning object was calculated based on the cognitive state of the learner.

Suppose there is a C course containing a total of K knowledge point concept, denoted as K = {k₁, k₂, ⋯, k_|K|}. The learner is denoted as S = {s₁, s₂, s₃, ⋯, s_||}, the exercise bank is denoted as EB = {e₁(k), e₂(k), ⋯, e_lenl(k)}, and the exercise is denoted as e_j(k) = [e_j(k₁), e_j(k₂), e_j(k₃)⋯, e_j(k_|K|)], e_j(k_i) taking the value of 0 or 1 (0 means that the question does not contain the ith knowledge point concept, and 1 means that the question contains the ith knowledge point concept). The history of answer records of a learner s_i is denoted as X_si = {x₁, x₂, x₃, ⋯, x_|si|}, and the answer of learner s_i to exercise e_j at the tth moment is denoted as: (13) $x_{t} = {(s_{i}, e_{j} (k), a^{t}) | s_{i} \in S, e_{j} (k) \in E B}$

The ISTM model is used to predict learners’ mastery of knowledge concepts and to track their cognitive state. The input to the LSTM model is $x_{t} = {(s_{i}, e_{j} (k), a^{t}) | s_{i} \in S, e_{j} (k) \in E B}$ , the single-hot encoding of the knowledge point concepts of Exercise e_j is denoted by ϕ(Kⁿ), and a^t takes the value of either 0 or 1 (with 0 and 1 denoting an incorrect and a correct answer, respectively). The output of the model h^t is a vector whose length is equal to the length of K. Each of its components represents the probability of answering the corresponding knowledge point concept correctly. In this study, this model is trained by constructing a loss function through binary cross-entropy, and the optimized loss function for a single learner is denoted as: (14) $L_{s} = \sum_{t = 0}^{T} l_{b} (h^{t} \cdot ϕ (K^{1}), a^{t + 1})$

where · denotes dot product and l_b denotes binary cross-entropy loss.

When the training of the LSTM model is finished, the historical answer records of a learner are inputted and the output of the model is his mastery of all the knowledge point concepts of the course, denoted as $p (K^{'}) = [p (k_{1}^{t}), p (k_{2}^{t}), p (k_{3}^{t}) \dots, p (k_{| K |}^{t})]$ .

The Transformer model is used to predict the knowledge point concept coverage to pinpoint the knowledge point concepts that the learner should learn next. The positional coding is embedded in the input of the Transformer model to characterize the sequential information in the historical learning record, and the input of the model is denoted as: (15) $ε (t) = E_{e t, k t} + P_{t}$

where E_e, k is the embedding vector connecting E_t and K_t, and P_t denotes the positional embedding.

The Transformer model utilizes a decoder to predict the conceptual coverage of knowledge points for the next question. The decoder is connected to the encoder through a self-attentive mechanism and finally the output of the model is obtained through the full neural network v^t. At the training moment t, v′ denotes the probability of occurrence of all the knowledge point concepts in the course i.e., K_t+l. The model is obtained by minimizing the loss function L′. The loss function of the training model is denoted as: (16) $L_{s}^{s^{'}} = \sum_{t = 0}^{T} l_{b} (v^{t} \cdot ϕ (k^{t + 1}), 1)$

When the training of the Transformer model is finished and a learner’s record of exercises is input, the output of the model is the probability of occurrence of all the Knowledge Point concepts in the course, denoted as $C (K^{t}) = [c (k_{1}^{t}), c (k_{2}^{t}), c (k_{3}^{'}), \dots, c (k_{| K |}^{'})]$ . In order to discover new Knowledge Point concepts that have to be learnt, a weighting variable is added to the output of the Transformer model, which is equal in length to the length of the Knowledge Point concepts, denoted as $ω (K^{r}) = [ω (k_{1}^{t}), ω (k_{2}^{t}), ω (k_{3}^{t}), \dots, ω (k_{| K |}^{t}]$ . The computation of $ω (k_{i}^{t})$ as follows: (17) $ω (k_{i}^{t}) = {\begin{matrix} 1 - \frac{r_{i}}{c_{i}} & c_{i} > 0 \\ 1 & c_{i} = 0 \end{matrix}$

where r_i is the number of times knowledge point concept k_i is answered correctly and c_i is the number of times k_i occurs. Utilizing $p^{'} (K^{'}) = c {(K_{i}^{'})}_{ω} (K_{i}^{'})$ , the final knowledge point concept coverage for the next question is $p^{'} (K) = [p^{'} (k_{1}^{'}), p^{'} (k_{2}^{'}), p^{'} (k_{3}^{'}) \dots, p^{'} (k_{| K |}^{'}]$ .

Difficulty is a core factor to be considered in the process of recommending study materials for choreography, and this study utilizes Equation (18) and Equation (19) to calculate the difficulty of the exercises: (18) $R_{e (K)} = \prod_{i = 1}^{n} (p (k_{i}^{t}) | e (k_{i}) = 1)$

R_e(K) is the probability of answering the exercise correctly, $p (k_{i}^{t})$ is the degree of mastery of each knowledge concept in the exercise. t the moment exercise difficulty value Dif_t for: (19) $D i f_{r} = 1 - R_{e (K)}$

3.2.3

Deep Reinforcement Learning Recommendation Algorithm

The Actor-Critic components designed in this paper include: states, actions, and return values.

State: this study considers the dynamic learning environment as the state of Actor-Critic, characterized as State = [e_t, Target, p(K^ι), p′(K^ι), Dif_t].

Action: the strategy network is a pre-trained neural network model, which accepts the learning environment state State, samples from the action space C_t according to the already saved model parameters θ, predicts and outputs the exercises that have the highest fitness with the current learning environment, and finally updates the learning environment to $S t a t e_{t}^{'}$ . In this study, we obtain the conceptual coverage of the knowledge points for the next prediction from the dynamic learning environment, and retrieve the relevant exercises from the exercise library to form a candidate. Relevant exercises are retrieved to form a candidate exercise set C_t, which is calculated as follows: (20) $C_{i} = {\begin{matrix} e_{j} (k_{i}) | e_{j} \in E B, k_{i} \in p^{'} (K^{'}) \end{matrix}}$

where e_j(k_i) is the candidate exercises at moment t, j is the number of exercises, and p′(K′) is the knowledge concept coverage predicted above.

Reward value: this study refines the method of calculating the reward value by giving a certain reward at each step of the intelligent body’s exploration and at the end of the exploration, and the reward value function is designed as follows: (21) ${\begin{array}{l} R_{S t a t e_{i}} = α * R_{s t e p_{i}} + β * R_{C S E A L} | α, β \in {0, 1} \\ R_{S t e p_{i}} = 1 - | δ - D i f_{n t} | \end{array}$

R_CSEAL is the design function for the reward value, and $R_{{Step}_{i}}$ is the reward value given at each step. In the early stage of the intelligent body’s exploration, this study sets α = 1, β = 0, then $R_{{State}_{i}} = R_{{Step}_{i}}$ , which represents the reward value given to the intelligent body at each step of exploration. Where δ is the difficulty of the exercise desired by the learner, the smaller the value of |δ − Dif_t|, the exercise is the one that best meets the needs of the learner. When the intelligent body completes the exploration, this study sets α = 0, β = 1, then $R_{S a t e_{i}} = R_{C S E A L}$ , which represents the return value of the whole adaptive learning path obtained by the intelligent body to reach the concept of the target knowledge point.

The D3ON algorithm is used to realize the choreography study material recommendation function. Two Q networks are set up as participants, i.e., the evaluation network Q is used to obtain the exercise corresponding to the maximum return value in state State_t+1, and then the target network Q is used to calculate the real return value of the exercise, so as to obtain the target value. The interaction of the two networks effectively avoids the problem of “overestimation” of the algorithm. Where θ and θ represent the parameters of the evaluation network and the target network, respectively. The target value is calculated as follows: (22) $y_{t} = R_{t + 1} + γ Q^{'} (S t a t e_{t + 1}, \arg \max_{a} Q (S t a t e_{t + 1}, a, θ), θ^{'})$

Where, argmax_aQ(State_i+1, a, θ) denotes the State_i+1-state evaluation network Q selects the exercise with the largest payoff value based on its parameters θ, and the action selected for this exercise is again computed by the target network Q to obtain the final true payoff value y_to On the basis of the computed y_t, the mean squared error loss function is used, the Loss is computed, and the parameters are updated by back-propagation θ. The formulae are as follows: (23) $L o s s = (\frac{1}{m}) \sum_{t = 1}^{n} (y_{t} - Q (S t a t e_{t}, C_{t}, θ))^{2}$

After the algorithm is run for many iterations, the policy network is trained. When all the variables in the dynamic learning environment model constructed above are input into the neural network, the corresponding exercises can be output.

4

Personalized Learning Path Optimization and Teaching Practice Strategies

4.1

Learning path recommendation optimization experiment

4.1.1

Experimental program

Considering the characteristics of personalized learning path recommendation and the current research status, this paper adopts the “Dance Choreography” learning platform constructed with JSP+MySQL technology as the experimental object, and analyzes the experimental effect of the constructed personalized learning path recommendation model.

The “Dance Choreography” e-learning platform consists of four modules, namely, learning navigation, learning resources, problem solving and exploration, and learning interaction module. The knowledge items in the modules are categorized according to the chapters of the knowledge points. Among them, the resource navigation module consists of learning objectives, knowledge tree, key points and difficulties. The learning resources module consists of videos, e-lessons, and textbooks. The Problem Solving and Inquiry module consists of example problem analysis, exercises, and quizzes. The learning interaction module consists of discussion forums.

In order to improve the system’s extraction and programming of learning user access paths, it is necessary to further optimize the processing of the original log data, firstly, the knowledge items under the learning module are redefined, as shown in Table 1.

Table 1.

Knowledge item mapping

Learning module	Knowledge item	Mapping
Topic selection	Overall design	K₁
Topic selection	Material selection and design	K₂
Movement design	Structure design	K₃
Movement design	Movement arrangement	K₄
Stage presentation	Music selection	K₅
	Creation and performance	K₆
	Stage composition	K₇
Movement foundation	Basic techniques	K₈
	Basic dance steps	K₉
	Common dance poses	K₁₀

4.1.2

Collection of experimental data

In this paper, a web page data collector was used to capture the learning data of 80 users in the log of the web platform. Considering the limitation of the learning materials of dance choreography on the scope of learning modules selected by users, the experiment chooses “Practice of Dance Choreography”, which has a comprehensive distribution of learning modules, as the experimental collection area. The number of visits to learning user nodes, learning paths, and test scores are obtained. Among them, the learning user node access volume refers to the click volume and time duration of the learning user for each knowledge item node, as shown in Table 2.

Table 2.

Learner’s node traffic

Knowledge item	Node traffic
Knowledge item	Clicks	Duration(seconds)
K₁	103	41856
K₂	97	28965
K₃	106	39604
K₄	471	299521
K₅	434	253799
K₆	240	217802
K₇	366	342194
K₈	584	344995
K₉	317	208394
K₁₀	281	120122

4.1.3

Experimental data processing

Based on the similar learning user model building method, 8 groups of similar user clusters were established for 80 learning users, and the recommended TopN-1 was calculated. The parameters were initialized and calculated according to the ant colony algorithm, which mainly included the user and learning style similarity value C_j, the heuristic information value η_ij obtained from the cognitive level of the learning user and the difficulty of the learning material d_j, and the optimization information value of the learning user’s evaluation, τ_ijnew, α pheromone factor, β heuristic function factor. Parameters are calculated as: η_ij = 0.5, τ_ijnew = 0.5, α = 3.5, β = 5. The calculation obtains the maximum probabilistic knowledge item recommendation TopN-2, and the personalized recommendation path is obtained after the ordered merger of TopN-1 and TopN-2, as shown in Table 3.

Table 3.

The learning path and personalized recommendation of similar user groups

Learning level	Similar user groups	Similar learning path	TOPN-2
90-100	A₁	K_1,K_3,K_2,K_4,K_5,K_6,K_7,K₈	K₉(70%)
90-100	A₂	K_1,K_2,K_3,K_5,K_6,K_4,K_7,K₈	K₁₀(13%)
80-90	B₁	K_2,K_4,K_7,K_6,K_9,K_10,K₈	K₃(71%)
80-90	B₂	K_2,K_3,K_4,K_5,K_7,K₉	K₆(80%) K₈(45%)
70-80	C₁	K_5,K_4,K_6,K_8,K₉	K₃(60%)
70-80	C₂	K_4,K_6,K_7,K_10,K₈	K₂(50%) K₃(65%)
60-70	D₁	K_5,K_4,K_10,K₇	K₂(55%) K₃(35%) K₉(60%)
60-70	D₂	K_6,K_9,K₁₀	K₃(51%) K₄(63%) K₈(72%) K₉(43%)

4.1.4

Experimental evaluation indicators

Starting from the target demand of personalized learning path recommendation, we introduce 2 performance indicators, learning efficiency and learning maze guide control effectiveness. Learning efficiency indicates the rate of improvement in learning performance after a period of continuous use of personalized learning path recommendation. Learning effectiveness is measured by the rate of increase of knowledge item check-in after learning users use the personalized learning path recommendation program compared with the previous one. The higher the concentration of knowledge item check-in, the higher the degree of the solution to the problem of learning lost in the process of students’ online learning.

To this end, a definition of knowledge quantity is first introduced: an e-learning platform is a network system composed of multiple knowledge nodes, each of which consists of n knowledge item. Knowledge nodes and knowledge items constitute a knowledge quantity, denoted as KI, where, for a certain knowledge point t the knowledge items that a student has checked in constitute a knowledge quantity of KI(t), and the personalized recommended knowledge items adopted by the learning user constitute a knowledge quantity of KI(t, s). Therefore, the effectiveness of the Lost Guide Control can be denoted as: (24) $E f (U_{c u r}) = (\sum_{t \in U_{c u r}} K I (t)) + \sum_{t \in U_{c u r}} K I (t, s) \times \frac{U_{a l l}}{K I_{a l l} \times U_{c u r}}$

U_cur = (t₁ + t₂, ⋯, t_L) denotes the set of learned knowledge points, and U_all is the set of all knowledge points in the learning platform. KI_all represents the total amount of knowledge in the whole learning platform. As shown by the learning maze guide effectiveness formula, the more recommended knowledge items an individual acquires when learning a certain knowledge, the more effective the learning maze solution is.

4.1.5

Analysis of experimental results

Five dance choreography learning users are randomly selected from each of the eight groups of similar user groups for personalized learning path recommendation of dance choreography, and the lost-guide-control rate of personalized learning path recommendation is obtained respectively, as shown in Table 4.

Table 4.

The misguidance control rate for personalized learning paths

Learning level	Similar user groups	Average misguidance control rate
90-100	A₁	2.3
90-100	A₂	1.1
80-90	B₁	6.0
80-90	B₂	6.9
70-80	C₁	11.5
70-80	C₂	11.7
60-70	D₁	15.4
60-70	D₂	16.7

Comparing the density of choreography knowledge item check-ins before and after personalized learning path recommendations, Figures 2 and 3 demonstrate the density of knowledge item check-ins before and after personalized learning path recommendations, respectively. Figure 4 shows the achievement trend after personalized learning path recommendation.

From the data, it can be seen that through the personalized learning path recommendation, the learning of choreography course learners have a certain degree of guidance and control, which is manifested in the acceptance of choreography recommended learning content recommendation, choreography learners check-in density is significantly higher than the recommended before the learning user check-in density. After receiving the path recommendation, the dance choreography performance of the dance choreography learning users are improved, especially for the dance choreography level between 60-70 points and 70-80 points of the learning users, their dance choreography level improved significantly.

4.2

Effectiveness of teaching strategies in practice

In order to test whether the teaching strategy of using reinforcement learning for personalized recommendation of learning paths in dance choreography is practically effective, two classes of dance majors in a university were randomly selected as an experimental class (N=43) and a control class (N=45). The new teaching strategy was applied in the experimental class, while the control class adopted the original teaching strategy. Before the experiment began, the two classes were pretested and compared in terms of their performance levels in dance choreography, and it was found that there was no significant difference between the pretest scores of the two classes (p>0.05), so it was considered that the two classes were homogeneous and fulfilled the requirements of the experiment.

4.2.1

Post-measurement data

The post-test data is the end-of-semester exam results at the end of the optimized and recommended teaching experiment of dance choreography learning path. The students chose the dance choreography test questions by drawing lots and completed the work choreography on the spot, and each student’s dance choreography exercise was scored and judged separately by the dance instructor of the preschool education major. The obtained post-test scores were obtained as shown in Table 5, and it was found that the average score of the experimental class was 0.54 points higher than that of the control class, and further T-test analysis was carried out to test whether the difference was statistically significant.

Table 5.

The performance of the students’ dance choreography skills

Items	Class A	Class B
Overall design	7.27	6.93
Material selection and design	6.88	6.45
Music selection	6.85	6.40
Structure design	6.60	5.79
Movement arrangement	6.48	5.88
Stage composition	6.82	6.38
Basic dance steps	6.64	6.19
Common dance poses	6.82	6.10
Basic techniques	6.79	6.26
Creation and performance	6.52	5.88
Mean	6.77	6.23

4.2.2

T-test results for experimental classes

The data of the pre and post-test scores of the experimental class of dance choreography were entered into SPSS19.0 software for paired samples t-test respectively, and the results obtained are shown in Table 6. The obtained Sig value is 0.000, which is smaller than the critical value of 0.05, so the original hypothesis that there is no significant difference between the overall means represented by the two samples can be rejected. In other words, the paired-sample t-test results of the achievement data measured before and after the implementation of the personalized learning path recommendation-oriented dance choreography teaching experiment for the students in the experimental class show a significant difference, with the post-test data showing a more pronounced increase than the pre-test data.

Table 6.

The result of paired sample t test

	Pre-test data (N=43)	Post-test data (N=43)	t	P(sig.)
Mean	6.23	6.77	13.871	0.000*

From this, it can be judged that the experimental class students’ performance in the dance choreography course improved more significantly at the end of the experiment applying the new teaching strategy than before the experiment began.

The t-test results for each sub-item of the experimental class are shown in Table 7. The p-values obtained from the paired samples t-tests of the pre- and post-test scores of each sub-item of the students in the experimental class A were all less than the critical value of 0.05.

Table 7.

T test results of each item of the experimental class

Items	Pre-test	Post-test	P(sig.)
Overall design	6.95	7.27	0.007
Material selection and design	6.23	6.88	0.000
Music selection	6.51	6.85	0.013
Structure design	5.91	6.60	0.000
Movement arrangement	5.97	6.48	0.000
Stage composition	6.41	6.82	0.001
Basic dance steps	6.38	6.64	0.000
Common dance poses	6.02	6.82	0.000
Basic techniques	5.92	6.79	0.000
Creation and performance	6.01	6.52	0.000

After the implementation of the personalized learning path recommendation-oriented choreography teaching experiment, the students’ choreography scores (post-test data) in the experimental class, including the three weaknesses of “structural design”, “movement choreography”, and “basic skills”, have increased significantly compared with the choreography scores before the experiment (pre-test data), by 11.68% and 8.54% respectively. After the implementation of the experimental dance choreography teaching experiment, the choreography performance of the students in the experimental class, including the three weak areas of “structural design”, “movement arrangement” and “basic skills” (the post-test data), has increased significantly compared with that of the choreography performance before the experiment (the pre-test data), with an increase of 11.68%, 8.54% and 16.69%, respectively. This also means that the students in the experimental class had their weak links in dance choreography effectively improved in this learning path recommended teaching experiment, and their dance choreography scores improved more significantly compared to those before the experiment started.

4.2.3

T-test results for control classes

The data of the pre-test and post-test scores of the control class B students were entered into the SPSS19.0 software for paired samples t-test respectively. The results of the paired samples t-test for the control class students on the pre-test and post-test of the experiment are shown in Table 8.

Table 8.

The result of paired sample t test

Items	Pre-test	Post-test	P(sig.)
Overall design	6.81	7.09	0.264
Material selection and design	6.30	6.31	0.452
Music selection	6.75	6.33	0.276
Structure design	5.50	5.63	0.166
Movement arrangement	5.91	5.68	0.395
Stage composition	6.42	6.33	0.381
Basic dance steps	6.24	6.60	0.213
Common dance poses	5.76	6.21	0.460
Basic techniques	6.73	6.50	0.509
Creation and performance	5.90	6.02	0.275
Mean	6.232	6.27	0.541

The Sig value of the t-test for the overall mean score is 0.541>0.05, so the original hypothesis that there is no significant difference in the overall level of dance choreography before and after the experiment in the control class is not rejected, i.e., the students in the control class before and after accepting the traditional mode of teaching do not show statistical significance despite the fact that there is a slight increase in the performance of dance choreography. In the t-tests for each subdimension of choreography achievement, the t-statistics for all subdimensions were greater than 0.05, indicating that the differences in pre- and post-test scores for each subdimension were not significant.

It is inferred that the students’ dance choreography level did not achieve significant improvement after the control class was taught using traditional teaching strategies, and its teaching effect was not as effective as the recommended teaching strategy of personalized learning path optimization proposed in this paper.

5

Conclusion

In this study, a learning path personalized recommendation model is designed mainly by deep reinforcement learning algorithm, which can dynamically provide students with appropriate choreography learning content recommendations according to the learning environment. The results show that the density of the dance choreography knowledge item check-in increases significantly after the recommendation using the method of this paper, and the learning users’ dance choreography scores all show an upward trend in the process of learning by accepting the method of this paper. In the comparative practice of teaching strategy practice, the post-test performance of the experimental class using the new teaching strategy is 0.54 points higher than that of the control class (p<0.05), and all the sub-projects of dance choreography have significant improvement. Among them, the optimization effects of “structural design”, “movement choreography” and “basic skills” were the most obvious, with improvement rates of 11.68%, 8.54% and 16.69% respectively. 16.69% respectively. On the other hand, the choreography level of the control class did not improve significantly. Accordingly, this paper concludes that the newly proposed deep reinforcement learning algorithm can effectively optimize the personalized learning path of dance choreography, and has reliable teaching practice effects.

Lingua:: Inglese

Frequenza di pubblicazione:: 1 volte all'anno
Argomenti della rivista:: Scienze biologiche, Scienze della vita, altro, Matematica, Matematica applicata, Matematica generale, Fisica, Fisica, altro

Feed RSS della rivista

Research on the Optimization of Personalized Learning Paths and Teaching Practice Strategies of Deep Enhanced Learning for Dance Choreographers

Liang Ma

Pubblicato online: 26 set 2025

Ricevuto: 19 gen 2025

Accettato: 20 apr 2025

DOI: https://doi.org/10.2478/amns-2025-1041

Parole chiaveActor-Critic, D3ON, Deep reinforcement, Dance choreography, Personalized learning path optimization

© 2025 Liang Ma, published by Sciendo.

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Actor-Critic, D3ON, Deep reinforcement, Dance choreography, Personalized learning path optimization