Construction of an Accurate Tracking and Ai Evaluation System for Dance Movements by Incorporating Image Recognition Technology
Published Online: Mar 17, 2025
Received: Nov 05, 2024
Accepted: Feb 10, 2025
DOI: https://doi.org/10.2478/amns-2025-0294
Keywords
© 2025 Jianchao Luan et al., published by Sciendo
This work is licensed under the Creative Commons Attribution 4.0 International License.
Dance is an artistic way for people to express their emotions and show their beauty through rhythmic body movements, and dance is one of the earliest artistic behaviors produced in human history. Dance, as a kind of behavioral art that requires the use of human limb trunks for elaboration, cannot be separated from the professional interpretation of every dancer [1-3]. Every excellent dance work requires dancers to have a solid dance foundation, and every dancer needs to improve his or her body coordination, flexibility, suppleness, and control of body strength through continuous training and efforts to perform skillful movement exercises to optimize the performance effect of the dance movements in order to fully demonstrate the charm of dance [4-6].
Traditional dance training are inseparable from the dance teacher’s on-site teaching, assistance. Due to the emergence of the new coronavirus pneumonia, some traditional classrooms have to change the teaching mode from the traditional on-site tutoring to the online classroom. Under the influence of epidemic prevention and control, many dance learners need to learn and train through online methods [7-9]. However, for the special discipline of dance, whether it is the online classroom of dance teachers or the online learning of dance practitioners, it is difficult for teachers to clearly find out where the problems of students’ dance movements lie, and it is also difficult for dance learners to gain insights into the problems of dance movements just through remote video observation. Therefore, online dance learning faces many difficulties [10-11]. First, dance teachers and dance practitioners can not capture the accurate data information of dance movements intuitively only through video, second, dance practitioners are difficult to recognize the key defects of the corresponding dance technique movements, and third, there is a lack of interactive feedback and intuitive evaluation of the quality of dance movements when learning dance movements online [12-13].
With the increasing maturity of image recognition technology, human posture data can be acquired in real time from captured videos or a video camera [14-15]. However, the amount of data for obtaining the coordinates of the joints of each frame of the dance movement of the video through image recognition is very large, and it is extremely difficult for both dance learners and dance teachers to obtain useful information about the dance movement from the huge amount of coordinate data. Therefore, how to accurately track dance movements and obtain effective movement quality assessment from the complicated gesture data is an important research direction in the field of online dance teaching [16-20].
This paper focuses on the accurate tracking and intelligent evaluation method of dance movements. The industrial camera is used to obtain dance movement images, and the images are transmitted to the switch and computing workstation through the network cable to transmit the dance movement images.The dance movements in the images are classified using the CTC human movement segmentation algorithm to obtain the gesture movement sequence.The captured gesture action sequences are analyzed by a human skeletal model. And using the GL-Compound similarity calculation method, the global parameters and local parameters are combined to represent the movement of each region of the body, so as to calculate the similarity matching sequence between the human posture skeleton joint point sequences, and to complete the quantitative analysis of the accuracy and completion of the dance learner’s movements.
The human body posture-based analysis system’s acquisition steps are depicted in Fig. 1, which mainly includes industrial camera, switch, computing workstation, and human-computer interaction interface. The industrial camera is used for image acquisition, the camera is connected to the switch through a network cable, and the computing workstation is connected to the switch through a network cable, thus realizing the transmission of image data from the camera to the computing workstation.

Human attitude analysis system
In recent years, artificial intelligence technology has been advancing by leaps and bounds, and AI technology has been applied in the fields of color selection, medical testing, unmanned driving, security, and so on. These technologies cannot be separated from the image acquisition. These technologies are inseparable from the acquisition and processing of images. The recognition process of the industrial camera system involves taking a picture of the target object and passing it to the switch through the network port. After the picture algorithm processing algorithm to obtain the target area. Dance action is in constant motion, and industrial cameras can continuously capture multiple frames of dance action at different moments.In this way, the image of the fruit can be acquired in all directions and without a dead angle.The captured images are transferred to the memory buffer of the industrial computer and processed by a specific algorithm to obtain the external characteristics of each fruit.Prepare for the later hit sorting.
To capture high-quality images of dance action at high resolution, it is necessary to study high frame rate, high resolution, low-cost industrial cameras, which are mainly divided into CMOS or CCD cameras.With the development of semiconductor technology, compared to the CCD industrial camera, CMOS image industrial cameras have greatly improved in terms of resolution, power, cost, and vibration resistance.So this paper studies the use of CMOS industrial cameras to capture images of dance movements.The camera’s image acquisition function allows for the collection of three-dimensional data on 25 skeletal joint points in the human dance posture. Based on the 3D data, the human skeletal structure is established, the depth information of the human body is replicated, the dynamic change of the 3D data of the skeletal joint points under the constant change of the dance posture is tracked in 3D space, and the 3D data set of the human dance posture movement is generated for subsequent research.
The full name of CTC is connectionist temporal classification, which is a classification model for temporal sequences. Considering the video as a continuous image sequence that preserves the human body pose information, the action sequence is segmented based on the trend of the human body’s major joint positions in consecutive frames. In the part of human action segmentation based on captured image data, with the idea of CTC, an end-to-end approach is used to generate the prediction results directly from the original data, thus transforming the segmentation problem into the annotation problem of the sequence.CTC can automatically determine the segmentation points, eliminating the need for human annotation of the samples, which greatly improves the efficiency of the segmentation, and at the same time, avoids the cumbersome feature extraction process in the traditional methods. The human action segmentation model based on CTC proposed in this section consists of three parts: GRU neural unit, logistic regression, and CTC Loss, of which the GRU neural unit is an important basic construction, which can avoid the long-term dependence problem of ordinary RNN units and requires fewer parameters and less training time than LSTM. And the true label sequence of the action sequence to calculate the CTC Loss. During training, the model updates the parameters by differentiating the CTC loss and updating them accordingly.The input to the model is in terms of human action time sequences, and the output is the probability that each frame in that action sequence belongs to each label or category.
For the sequence labeling task, assuming that all of its labels constitute the alphabet A, the number of softmax output layer units of CTC is one unit more than the alphabet A labels. The first |
With F representing a many-to-one mapping, a transformation that maps an output path
Mapping F can convert numerous different output paths into the same labeled sequence, a step made possible by CTC’s use of unsegmented data, which allows the network to predict labeled sequences with unknown segmentation points.
The analysis of human movement posture based on the acquired images is to estimate the movement posture of the human body from one or more viewpoints. In order to analyze the dance movement posture data, this paper uses feature points to calibrate the joint points of the human body, and the human skeleton is regarded as a multi-rigid-body connected model, and the connecting line between the feature points represents a rigid body, so that there is no deformation between the adjacent joint points at any moment. The main action postures are the movements of the head, torso, hips, and limbs.
The movement of human body is a complex process, without considering the role of muscles, nervous system and other conditions, the human body movement can be abstracted as a simple chain system movement connected by some rigid bodies. Taking Coor Joint to represent the absolute position coordinates of the feature point in the world coordinate system, the upper limb is composed of two rigid bodies of upper and lower arms connected by the elbow joint, and the lower limb is composed of two rigid bodies of thighs and calves connected by the hip joint, and the thighs and calves are connected by the knee joint. The head, torso, and hips are represented as one rigid body by a line connecting the joint points. With the rigid body structure of the human body described earlier, the spatial coordinates of the marker points are used as the data basis for calculating the vectors, which serves as the basis for the similarity matching analysis that follows.
The human skeleton, when abstracted by skeletal modeling, can be viewed as a rigid body whose relative positions of points within it remain unchanged during rotation and translation. The rotation of a rigid body is the rotation of any point
A quaternion is a mathematically generalized complex number consisting of one real part and three imaginary parts, capable of representing rotations in three-dimensional space. Quaternions are often used in visual SLAM, gaming, navigation, etc. to represent rotations and object poses. The definition of quaternion is discussed below.
In math, quaternions are higher order complex numbers containing four components, quaternion
In general, for convenience quaternions can be simplified to
Suppose a rotates in three dimensions with axis of rotation
Then this rotation can be expressed as a quaternion
Based on the advantages of the representation of quaternions, this paper chooses quaternions as the basic unit for describing human movements. For the human body movement sequence, the time sequence
Each frame
A quaternion consists of four components, so
Each frame of data in the image sequence represents the human pose at that moment in time, and the dimension of each frame is
Human dance motion detection, in addition to the pose features, has optical flow initial features with 3D-SIFT features. The displacement field formed by stacking the optical flow between consecutive pairs of frames can be used as an optical flow initial feature, the optical flow displacement field can explicitly describe the motion between video frames and is easier to recognize.3D-SIFT features are suitable for use on static images for point of interest matching between static images compared to 2D-SIFT features.
After fully considering the complexity and professionalism of its dance movements, in order to more accurately assess the standard degree of dance movements, this paper divides the 24 joint points of the human body captured by the posture estimation into regions, and the division regions are shown in Fig. 2, which represent the LH of the left hand, the RH of the right hand, the LL of the left leg, the RL of the right leg, the 5 core torso regions of the UB of the upper body, and the 1 lumbar-abdominal center Root region, respectively. Practicing dance requires an understanding of the movement of each joint and its relationship with the parent joint. Therefore, the core torso regions need to be divided from global to local positions. In this paper, we propose the GL-Compound method, which introduces the concepts of global motion parameter (GMP) and local motion parameter (LMP), and divides the limbs and the core regions of the upper body with respect to the center region of the body into the global motion of the limbs and the limbs at the end of the body and the local motion of the skeletal spatial position are superimposed to represent the dance movements.

GL-Compound physical activity area division
The motion parameters of the left and right arm parts are calculated in the same way, and in the case of the right hand RH, for example, its motion can be regarded as a superposition of the motion of two parts, one part of which is the global motion of the plane
Equation (11) solves for the Euclidean distance between joints in three-dimensional space, with
Any plane is formed by three joints, assuming that the coordinates of the joints are
After Equation (14) obtains the plane normal vector
Equation (15) calculates the cosine angles of the two plane normal vectors
The GL-Compound method was used after analyzing the dance posture changes, using global parameters combined with local parameters to represent the movement of each body region, and in this section, the similarity calculation of the posture between learners and professional dancers will be defined based on the dance.
Each beat movement contains five body torso and one core region movement parameters, the data dimension of GMP parameter is up to 4, the data dimension of LMP parameter is up to 3, and some body parts need not to be considered can be replaced by 0, so that the 7-dimensional movement parameters of each beat movement are established. According to different flying basic movements, the beats set up are also different, assuming that there are n beats of the dance movement being learned, then there are n samples of this set of dance movement data, each sample is a 7-dimensional vector, so that each set of flying basic movements is represented as an n × 7 matrix. The history of scoring dance movements cares about the coordination and proportion of movements between the whole body and parts of the learner, and does not go overboard with the length characteristics between body joints, so the use of cosine similarity is considered to calculate the degree of similarity of the movement parameter vectors of the corresponding body parts of the two sets of movements in the same beat. During the assessment process, in order to ensure that the student’s movements are synchronized with the template movements and to allow for certain time differences, a metronome interface needs to be designed with voice and text prompts for movement details, requiring the user to complete a specific movement within a given time frame, synchronizing the movement tempo as much as possible:
Equation (16) illustrates that when learners follow a dance movement, the feature vector of a body part used to represent the movement parameters is
When the overall similarity S < 0.5 it is considered that the learner is not learning well. When the overall similarity is S>0.5, it is considered that the learner’s movements are standard. When the overall similarity is S>0.8, the completion is considered to be high and excellent: point out to the user the body parts that are less similar and emphasize the exercises to focus on.
Before the proposed detection method is put into the actual dance posture detection work, the feasibility and detection effect of the method need to be objectively analyzed and confirmed to be feasible before being put into practical use. The NTU-RGB+D human dance posture movement dataset was selected as the basis for this experimental study. The dataset is comprised of 60 human dance movement samples with diverse postures and dance types, resulting in a total of 56880 movement samples. The ballet dance gesture action video of the dataset is selected and the video is segmented, and the segmentation result is shown in Table 1, which is segmented into six segments of different actions.
The result of the dance movement
Fragment number | Action description | Starting frame/one | End frame/one |
---|---|---|---|
ZT-01 | Bend your knees | 1 | 171 |
ZT-02 | Big kick | 172 | 233 |
ZT-03 | Winding | 234 | 296 |
ZT-04 | Toe | 297 | 355 |
ZT-05 | Bray | 356 | 399 |
ZT-06 | Toe drawing | 400 | 519 |
In order to more intuitively verify the detection effect of the human dance posture detection method proposed in this paper, the human dance posture detection method proposed in this paper is set as an experimental group, and the detection method based on the improved ViT and the detection method based on the OpenPose model are set as the control group 1 and the control group 2, respectively, to carry out the comparative experimental analysis. In this experimental test, the intersection and merger ratio index is selected to evaluate the effect of human dance posture detection. The intersection and union ratio (IoU) is a criterion that can measure the accuracy of object detection within a specific dataset, which can accurately describe the similarity between the target detection area and the actual area, and the computational expression is:
Where:A denotes the dance posture movement interval determined by the detection algorithm.U denotes the actual detection of dance posture movement interval. When the intersection ratio is within the interval of 0.5~1.0, the human dance posture movement detection is considered accurate, and the closer to 1.0, the higher the accuracy. When the intersection ratio is less than 0.5, it is considered that the human dance posture movement detection is inaccurate. Six groups of dance posture movement segments were detected separately, and the MATLAB simulation software was used to determine and calculate the intersection and concurrence ratio values after the application of the three detection methods, and finally, the results were compared, as shown in Figure 3.

The comparison of the three methods is compared to the index
There is a significant variability in the detection results after applying the three detection methods. In the experimental group, the cross-combination ratio of the human dance posture-based detection method proposed in the article is always in the range of 0.5~1.0, and all of them are over 0.85 and close to 1.0. After applying the method of control group 1, the cross-combination ratios of the dance posture segments 2 and 4 are less than 0.5, and the cross-combination ratios of the dance posture segments 3 and 6 are less than 0.5 after applying the method of control group 2. It is evident that the proposed dance posture detection method is highly feasible, and the detection accuracy advantage is significant. Proposed dance posture detection method is highly feasible and the advantage of detection accuracy is significant.
From the NTU-RGB+D dataset, different parts of the upper body, lower body, and the whole body regions are selected to extract RGB features, SIFT features, Flow features, and the recognition rate after combining SIFT and optical flow features for comparison. Table 2 displays the recognition results of different features extracted from different human body regions.
Different characteristics of different human regions are extracted
Body part | RGB | SIFT | Flow | SIFT=Flow |
---|---|---|---|---|
Upper body | 10.2 | 25.5 | 34.2 | 48.3 |
Lower body | 10.5 | 26.2 | 34.7 | 48.0 |
Upper body + lower body | 24.3 | 34.3 | 46.2 | 58.1 |
Whole body | 22.8 | 31.7 | 41.7 | 55.8 |
From the recognition rates obtained by different methods, it can be seen that the recognition rates of actions after feature extraction using the upper body or lower body are closer, but the recognition rate of the lower body is slightly higher (RGB, SIFT, Flow, and SIFT=Flow: 10.5, 26.2, 34.7, and 48.0), and the recognition rate is highest in the case of using a combination of human body regions, and in the case of the human body full-body region being recognized directly, the recognition rate is is lower than the recognition rate for the combination of body regions. The reason for this is that the number of dances with mainly upper or lower body movements is close to the number of dances with mainly lower body movements when selecting the movement category, and the number of dances with mainly lower body movements is high. In the application of the whole body region, when the movement is only upper body region or lower body region, the two regions will affect each other, and the naming of the dance movement often uses the lower body and upper body movements respectively, so the separate training can improve the accuracy of the recognition.
The experiment was done on a PC with Core(TM)i5-34703.2GHz CPU, 4GB RAM, and Matlab was used as the development environment. The motion database created contains 18 sets of dance movement clips, each set of dance movements of about 1200 frames, the experimental subjects are randomly selected college students, the experimental subjects have a basic knowledge of dance.
The experimental subjects were asked to imitate the standard movements in the database and make the corresponding dance movements under the motion capture system, to extract the joint point movement characteristics of the left arm of the experimental subjects and to compare them with the standard movements, and this paper mainly focuses on the final movements of the left arm in a single section of the dance movements (within 0-10s) for experimental comparison. Taking a segment of local motion sequence captured in real time as an example, the results are shown in Fig. 4, which shows the difference comparison between each main movement change of the subject to be tested and the standard movement. It is concluded from the difference comparison analysis that the degree of elbow bending of the subject to be tested in the 4-6s time interval and the amplitude of the left arm swing in the 6-8s time interval are significantly different from the standard action, and the difference comparison diagram of the left arm movement posture is given, from which it is obvious that there is a difference between the action to be tested and the standard action. Through the comparison of the experimental results, it is verified that the method of using GL-Compound similarity calculation and matching for the analysis of the movement posture can clearly and efficiently detect the differences and standardization between the movement objects, with high robustness, which lays the foundation for the scientific training of dance.

Timing diagram of the left arm motion parameters
The problem of movement quality assessment, which is the assessment of the level of performance of a player in accomplishing a given task, is of great significance in the fields of sports, medicine, and teaching. For example, in a sport, the performance of dancers is given by experts who have been trained for many years in the respective field based on the athletic performance of the players, but without the on-site evaluation by these experts, the dancers would not be able to get feedback to improve their performance. Therefore, there will be a need to create an automatic assessment tool to automatically assess the quality of the movements performed by the dancers.
In this paper, the GL-Compound method is used to represent the movement of each region of the body using global parameters in combination with local parameters, and the scoring results are shown in Fig. 5, where the line graphs show the attention of the high-quality score features of the branches of the dynamic flow of a sample, and for the convenience of the presentation only 32 of the 256 segments of the attention are selected for the plots, which show the two high-attention segments and two low attention segments, the length of the segments is 16 frames, in order to facilitate the display only 8 frames are shown in the figure.

Qualitative research on mass fraction decoupling module
According to the model design of this thesis, when a clip has a high similarity score of dance movements, it means that the clip shows movements that have a positive effect on the quality score, e.g., the contestant in the image clip from 50 to 100 frames completed a good Freeze movement with perfect body control, and thus received a high score, up to 0.94. The contestant in the image clip from 120 to 150 frames did not complete the movement, and their body posture was not perfect. movement was not completed and the control of the body posture was poor, thus receiving a lower score, as low as 0.17, and the clip represented by 225~250 in which the contestant only performed a turn, thus receiving a lower score for the calculation of the similarity of the dance movement.
In this paper, dance movement images are captured by an industrial camera, followed by the classification of human movements using the CTC segmentation method. The recognition of human posture is realized by the human skeletal model. The GL-Compound method is introduced to calculate the similarity matching sequence between the skeleton joint point sequences of human gestures to complete the quantitative analysis of the accuracy and completion of the dance learner’s movements. After analyzing, it can be seen that the human dance posture detection methods proposed in this paper have a good detection effect with their intersection and concordance ratios exceeding 0.85. And the advantage of detection accuracy is significant for dance movements with more lower body movements. Using the method of GL-Compound similarity calculation matching, the dance movement posture of the human body is precisely analyzed to obtain the effective difference of human body movement, which provides theoretical support for the scientific training of dance.