The application of topological data analysis to human motion recognition

Human motion analysis is a very important research topic in the field of computer vision, as evidenced by a wide range of applications such as video surveillance, medical assistance and virtual reality. Human motion analysis concerns the detection, tracking and recognition of human activities and behaviours. The development of low-cost range sensors enables the precise 3D tracking of body position. The aim of this paper is to present and evaluate a novel method based on topological data analysis (TDA) for motion capture (kinematic) processing and human action recognition. In contrast to existing methods of this type, we characterise human actions in terms of topological features. The recognition process is based on topological persistence which is stable to perturbations. The advantages of TDA are noise resistance and the ability to extract global structure from local information. The method we proposed in this paper deals very effectively with the task of human action recognition, even on the difficult classes of motion found in karate techniques. In order to evaluate our solution, we have performed three-fold cross-validation on a data set containing 360 recordings across twelve motion classes. The classification process does not require the use of machine learning and dynamical systems theory. The proposed classifier achieves a total recognition rate of 0.975 and outperforms the state-of-the-art methods (Hachaj, 2019) that use support vector machines and principal component analysis-based feature generation.


Introduction
Human motion analysis is an important field of applied computer science with many practical applications.Among these applications we can mention sport and health surveillance, medical and disability assistance, gaming and human-computer interaction (Mokari, Mohammadzade and Ghojogh, 2020).The process of human motion registration is called motion capture (MoCap).There are various sensors that are used to perform this registration, depending on the specific MoCap technologies used, for example, video cameras, internal measurement units (IMU), etc.After gathering motion data, it is processed in order to generate motion features that are useful for further analysis.In most cases, motion is modelled as a multidimensional time series in which each time series represents coordinates of a certain part of the body.However, these measurements of biological (kinematic) activities might differ between individuals even if they perform the same action.This is due to the fact that each person has slightly different body proportions and flexibility, which affects motion trajectories.Additionally, the same action may be performed with different speeds.Therefore, human action recognition is among the most challenging and still presents problems of digital signal classification.In the following subsections, we discuss the principal components of analysis--based methods of action recognition which are considered the most effective methods for motion feature selection.We then discuss state-of-the-art action classification with topological data analysis.

Action classification with topological data analysis
Besides most popular approaches that utilise already verified and effective PCA-based features and various classifiers, topological-based approaches have recently begun to emerge in the field of human action analysis.Homology theory can be successfully used in motion recognition; however, there have been relatively few studies reported on this topic, particularly in the field of human action classification.Periodic motion analysis based on dynamical systems theory is presented in Dirafzoon, Lokare and Lobaton (2016), Tralie (2016), Tralie andBerger (2018), Vejdemo-Johansson, Pokorny, Skraba andKragic (2015) and Venkataraman, Ramamurthy and Turaga (2016).These types of approaches are sometimes combined with machine learning methods such https://doi.org/10.37705/TechTrans/e2021011 as SVM (Anirudh, Venkataraman, Natesan Ramamurthy and Turaga, 2016;Som et al., 2018) or convolutional neural networks (Umeda, 2017).The method of analysing collective motion supported by machine learning is described in Bhaskar et al. (2019).
There are also other approaches that can be used to solve similar tasks.Comprehensive and state-of-the-art surveys on motion capture pattern recognition methods can be found in Presti and La Cascia (2016), Cornacchia, Ozcan, Zheng and Velipasalar (2016) and Idris et al. (2019).

Motivation for this paper
Based on the survey presented in the previous section, we can observe that the topological-based approaches focus on simple periodic activities such as walking, running, bicycling, waving, etc.Moreover, they need additional methods and tools from the field of dynamical systems and machine learning.In practice, many potential applications of motion analysis require the precise classification of complex, non-periodic movements.This classification should be independent of the speed of execution, insignificant deviations, the individual's proportions and other characteristics of the body.Improperly performed or classified movement may result in serious negative consequences, e.g.medical rehabilitation.In this paper, we propose a novel topological method of classifying complex human motion.In contrast to existing topological approaches, this method does not need to be supported by machine learning and dynamical systems tools.Despite this, the presented algorithm achieves high levels of accuracy in recognising very complex and varied movements such as martial arts techniques.Both implementation of the proposed method and the data set we used for its evaluation are available to download in order to enable our research to be reproduced.

Material and methods
In this section, we will present the proposed human action classification method and the data set we have used for evaluation purposes.

Topological overview
The number of n-dimensional holes in the topological space is counted by the rank of the n th homology group, called n th Betti number (β n ).For example, in X ⊂ ℝ 2 , β 0 , β 1 and β 2 are the numbers of the connected components, independent tunnels and independent voids, respectively.Topological spaces are usually constructed from simplexes (points, edges, triangles, tetrahedrons…) or hypercubes (points, edges, squares, cubes…) that form structures called simplicial complexes (Edelsbrunner, Letscher & Zomorodian, 2000) or cubical complexes (Mrozek, Żelawski, Gryglewski, Han & Krajniak, 2012), respectively.Unfortunately, Betti numbers are sensitive to noise and do not differentiate between small and large holes.This problem has been solved by so-called persistent homology (Edelsbrunner, Letscher & Zomorodian, 2000;Zomorodian & Carlsson, 2005).Persistent homology tracks the changes of the Betti numbers when the topological space is gradually built by adding simplexes or cubes in a specific order.The holes are born, persist for some time and then die.The holes which die quickly are likely to be caused by noise, whereas the holes with a long lifetime represent the actual topology of data.
A set of points in ℝ n , called a point cloud, can be transformed into a simplicial complex, for example, the Vietoris-Rips complex.In a filtration process, a sequence of increasing subcomplexes is created and the number of holes and their lifetime is calculated and visualised as a so-called persistence diagram or barcode (Edelsbrunner & Harer, 2010;Ghrist, 2008).The barcode https://doi.org/10.37705/TechTrans/e2021011 is a collection of intervals where the interval lengths correspond to the lifetime of the holes.The persistence diagram is a collection of points where the points coordinates are the birth and death times of the holes.These 2-dimensional diagrams visualise global geometrical properties of the multidimensional shape formed by the point cloud.
The strength of TDA lies in its ability to extract global structure from local information and its stability under perturbations.TDA is very successful in high-dimensional data analysis.

Data set
The data set we used for validation was an open data set containing the motion capture of three experienced Shorin-Ryu karate athletes (world and national medalists) (Github, online).This data set has already been used in research on human action classification (Hachaj, 2019) and proved to be challenging due to the complexity of the motions.Each of the three individuals on the recordings performed twelve types of karate techniques which were repeated 10 times (there are altogether 360 motion actions).There are four types of blocking techniques -age uke and gedan barai with the left and right hand; two types of elbow strikes -empi with the left and right elbow; six types of kicks -hiza geri (knee kick), mae geri and yoko geri with left and right leg.The detailed descriptions of those motions with illustrations can be found, for example, in Funakoshi (1996).This particular data set is a very interesting and challenging for action recognition purposes due to several factors.It consists of karate techniques which are well defined and are easily repeatable by skilled karatepractitioners.These techniques are performed with high speed and involve large parts of the human body.The proportions and flexibility of the human body are important: the same technique may have different ranges of motion depending on each person's individual characteristics.There are not many examples of each technique, which makes the machine-learning procedure difficult.All techniques start from the same initial stance (zenkutsu dachi) and only the middle parts of each action differ.We also have to emphasise that skilled fighters perform the initial parts of attacks (kicks and punches) in a similar manner, in order to avoid signalling to the opponent their intention of using a particular technique.To an amateur observer, limb trajectories of various techniques might seem very similar.This particular data set was acquired using the Shadow 2.0 wireless motion capture system consisting of 17 IMU (inertial measurement units) sensors with a 100 Hz tracking frequency set to 100 Hz, 0.5 degrees of static and 2 degrees of dynamic accuracy.Motion recordings contain twenty body joints.More details about the system, its calibration and the output data can be found in Hachaj, Piekarczyk and Ogiela (2017).

Action recognition
Suppose we have measurements coming from the location sensors used during any physical activity (for example a martial arts technique).The following notations are used: M -the number of sensors; K -the number of the measurements for each sensor; s ijk -the i-th coordinate of the location of the j-th sensor in the time t k , where In TDA, the point cloud is transformed into a nested family of simplicial complexes (for instance, Čech or Vietoris-Rips complex) in the filtration process.Persistent homology of the filtered simplicial complex (the lifetime of holes in the complex) is visualised by barcodes or persistence diagrams.
To measure the similarity of martial arts techniques represented by the point clouds C 1 , C 2 , one can compute the Bottleneck or Wasserstein distance (Edelsbrunner & Harer, 2010) W p (B(C 1 ), B(C 2 )) between their barcodes B(C 1 ), B(C 2 ).
To improve the classification process, an additional stage is introduced before the topological recognition.Let h be an index of the hip sensor.For the point cloud C and the j-th sensor, define minimal and maximal coordinates of the bounding box relative to the initial hip position as: For the point clouds C 1 and C 2 , the distance between bounding boxes of the j-th sensor location is defined as: where d denotes the Euclidean distance.
Let h 1 , h r , f 1 , f r be the indexes of the sensors on a left hand, right hand, left foot and right foot, respectively.The distance between the hand and foot bounding boxes is defined as:

Results
The algorithm was implemented in R using the TDA package.We have published t h e source code in an online repository in order to make the experiment reproducible 1 .
The data set presented in Section 2.2 has been split into two subsets: test (training) and validation data sets.Training was performed on the data of two people (on 240 action of twelve motion classes), while evaluation was performed on the data of the third person (120 recordings of twelve motion action classes).We performed a three-fold cross-validation in which we primarily used the data from individuals 1 & 2 for training and the data from individual 3 for validation, then the data from individual 2 & 3 for training and the data from individual 1 for validation.Finally, the data from individuals 1 & 3 were used for training and the data from individual 2 for validation.Results were averaged and we calculated the total recognition rate of all classes.The final result of our topological-based method was 0.975.
Figure 1 presents the point cloud and the obtained barcode for a sample karate technique.
Table 1 presents the confusion matrix.The matrix is row-normalised.Each row represents a judge label.Columns represent the predicted label.Numbers in the table are percentage values (the average accuracy and variance from the three-fold cross-validation).
As can be seen, the proposed method sometimes misclassifies left and right versions of the same techniques (empi, gedan barai) because topology is not changed by reflections and the bounding boxes may be too similar in both versions.The lowest accuracy and the highest variance is obtained in the case of the yoko geri technique (side kick), which is sometimes misclassified as the mae geri technique (front kick).These techniques differ mainly in the direction of the kick, but from a topological point of view, they are almost identical and only the bounding boxes can identify the key differences.
The results obtained by the proposed algorithm have been compared with three other algorithms that utilise PCA-based motion features and angle-based motion description, namely Das, Wilson, Lazarewicz and Finkel (2006), Zago et al. (2017), Hachaj (2019).The description of setups of these three algorithms on the data set presented in Section 2.2 has been published in Hachaj (2019).In Das, Wilson, Lazarewicz and Finkel (2006) PCA is performed on planar https://github.com/marcinz00/MartialArtsTDAangle-based features of motion and each action is mapped onto 2D space.Following this, PCA features were calculated and the data were classified by SVM, obtaining a recognition rate of 0.647 with three-fold cross-validation.In Zago et al. (2017) the movement was interpreted as a time series of postures, where a posture was defined as a 60-dimensional vector composed of the body joint positions at a given time.Finally, 40-dimensional vectors were used for classification with SVM obtaining the recognition rate of 0.628 with three-fold cross-validation.In the work of Hachaj (2019), there are 100 bagged classifiers trained on ten classes each, each of them utilising twenty-five PCA features, obtaining a recognition rate of 0.939 with three-fold cross-validation.The source code for this experiment can be downloaded from the online repository (Github, online).

Discussion
The presented algorithm is independent of body proportions and the speed of execution because TDA focuses on the study of mutual relations between elements of a data set.Topology is not changed by operations such as translation, rotation and scaling.Therefore, comparing similar movements located in other areas of space (for instance, left and right versions of the same techniques) does not make much sense because it is time consuming and may lead to misclassifications.The bounding boxes used in the first step of the described algorithm effectively eliminate unnecessary matches.Obviously, there are differences in hand and foot length between children and adults, but the length of the boxes may be normalised by estimating the performer's height (e.g. with the vertical distance between the initial positions of the head and hip sensors). https://doi.org/10.37705/TechTrans/e2021011 Topological persistence is stable to perturbations, so the algorithm can classify patterns effectively even in the case of inaccurate movements or noisy measurements from sensors.The quality and precision of techniques can be estimated by the Wasserstein distance.

Conclusion
The method we proposed in this paper deals very effectively with the task of human action recognition, even on difficult data sets, such as those relating to karate techniques.The presented algorithm does not require the use of additional methods such as dynamical systems tools or machine learning.The persistent homology approach provides high accuracy even with very complex and varied martial arts movements.The TDA classifier outperforms the state--of-the-art methods (Hachaj, 2019) based on PCA-based features generation and SVM by 0.036.These results are very promising and prove that TDA is an effective tool in the field of human action recognition.
representing the martial arts technique is defined as follows: 2 d hf measures the similarity of the bounding boxes of the point clouds representing hand and foot movements.This is a very effective way to significantly reduce the classification time because d hf detects the difference between techniques very quickly.As a result, the more precise topological classification analyses only those classes that are not rejected during the bounding box stage.The main algorithm recognises the action represented by point cloud C. C is matched with each point cloud from a given training set .Let Id(P) denote a unique identifier of the action represented by the point cloud P ∈ .First, fast recognition by using the bounding boxes is performed.If the classification is uncertain, i.e. the input point cloud is very similar to more than one training point cloud (the parameter ε), the algorithm performs the topological recognition process restricted to the most probable techniques.During topological recognition, the point cloud C is converted to the Rips complex.The obtained barcode is compared to the barcode of each training pattern by computing the Wasserstein distance.The measure of similarity is computed as the Wasserstein distance for 0-dimensional persistent homology.The algorithm returns the unique identifier of the action represented by the training pattern which constitutes the best match to the input point cloud.The parameter ε was set empirically.Alg. 1. Action recognition function RecognizeAction (C, ) Let R 1 ∈  be such that d hf (C, R 1 ) = min{d hf (C, P): P ∈ } I = Id(R 1 )  = {P ∈ : d hf (C, P) ≤ ε} if card() > 2 then Let R 2 ∈  be such th at W p (B(C), B(R 2 )) = min{W p (B(C), B(P)): P ∈  I = Id(R 2 ) return I https://doi.org/10.37705/TechTrans/e2021011

Fig. 1 .
Fig. 1.The point cloud (B) and the barcode (C) for yoko geri kick with the left leg (A -screen from BVHacker software).The positions of the left foot, right foot, left hand and right hand are marked in red, green, blue and orange, respectively (B)