INFORMAZIONI SU QUESTO ARTICOLO

Cita

Introduction

The action recognition technology has been widely used in video understanding, human–computer interaction, intelligent control and other fields, but restricted by background, illumination, occlusion and camera jitter. Thus, the accuracy of action recognition algorithm still faces a great challenge.

In the early stage of video understanding field, the approaches based on manual feature representation were the main research direction. Dense trajectories (DT) [1] and its improved version – improved Dense trajectories (iDT), show the best performance among the methods based on manual feature representation. With the development and maturity of deep learning technology, researchers turned to use deep learning algorithms for action recognition: Simonyan [3] proposed the double-stream method based on RGB and optical flow for action recognition; Feichtenhofer [4] introduced residual structure into double-stream convolutional network for information exchange; Tran proposed C3D [5] and its improved C3D [6] structure that used 3D convolution to learn static appearance and action characteristics. However, the above video feature-based methods (two-stream method, 3D convolution method) tend to be affected by background illumination, camera movement and others, cannot represent the human action sequence information very well, and has unobvious data concentration performance on complex actions.

Benefitted from the improved performance of human pose estimation and other algorithms, the method based on skeletons is not affected by background, illumination and other factors, and is more and more popular in the action recognition field. The traditional skeleton point method [7] requires the establishment of manual features and traversal rules, which is inefficient; the common skeleton method based on deep learning is to construct skeletons information into coordinate vector or pseudo image and input them into CNN or recurrent neural network (RNN) for action recognition [ 8, 9, 10, 11, 12]; the graph convolution methods [ 13, 14, 15] by constructing human skeletons points as the graph nodes, and the connection information between skeletons as graph edges use the method similar to the traditional 2D convolution method in the skeleton graph for action recognition, achieving significant results.

The spatial–temporal graph convolutional neural network (ST-GCN) method modelled the dynamic skeleton [13] based on the time sequence representation of human joint position, and extended the graph convolution into a spatial–temporal graph convolutional network. As the first method using graph convolution neural network for skeleton-based action recognition, it is different from the previous methods, because it can implicitly learn the human body information of various body parts by using the locality and time dynamics of graph convolution. By eliminating the requirement of manual allocation of various of human body parts, the model can be designed easier and can effectively learn better action representation. However, the convolution operation in the ST-GCN is performed only on the 1-neighbour of the root node, so the modelling and representation on the global node information cannot be realised. For example, the interactive joints for brushing teeth, clapping and other actions are not in the adjacent position, so it is necessary to learn the relationship of these joints through the attention mechanism to improve the action recognition performance. The common attention methods include SENet [16], convolutional block attention module (CBAM) [17] and non-local network [18].

To solve the above problems, this paper proposed an ST-GCN algorithm based on node attention (NA-STGCN). In the NA-STGCN network, we introduced the attention module to help the network focus on the connection relationship between different nodes (including adjacent and non-adjacent nodes) and learn the importance of nodes. Specifically, we introduced the attention mechanism of SENet into the convolutional layer of ST-GCN to learn the correlation between nodes. The effect of introducing attention module to network was verified by experiments. The experimental results on NTU-RGB+D show that our NA-STGCN with node attention module introduced has improved accuracy over ST-GCN.

The second part of this paper introduces the related work, and the third part introduces the original ST-GCN model and our proposed NA-STGCN model; the fourth part is the experimental results and analysis, and the last part is the algorithm summary.

Related work

With the rapid development of human pose estimation and graph neural network, now most common action recognition methods which are based on skeleton can be categorised into three methods: CNN, RNN and graph convolutional network-based methods.

The traditional method in [7] requires traversal rules and manual features to realise the skeleton action recognition, which is inefficient and inaccurate. Recently, deep learning has achieve great success which makes the deep learning based skeleton modelling methods rather hot now. As for CNN based methods, Liu et al. [8] put forward a new type of two-stream model which uses the 3D CNN. The model is very innovative and nobody has proposed it before this; Li et al. [9] put forward a new scheme called the global spatial aggregation scheme. The new scheme is better than local aggregation in the point of joint co-occurrence features. As for RNN-based methods, according to the principle of RNN with the long short-term memory (LSTM) and the convolutional neural network; Zhang et al. [10] designed VA-RNN and VA-CNN view adaptive neural networks. In order to analyse the hidden sources of information which has something to do with action, within the input data over the two domians concurrently, Liu et al. [11] tried to extend RNN-based methods to spatio-temporal domains; According to the physical structure of humans, Du et al. [12] divided the human skeleton into five different parts, and then separately fed the five different parts to five bidirectional recurrently. Based on skeleton action recognition, the method in [12] is an end-to-end hierarchircal RNN. Yan et al. [13] put forward the ST-GCN. The new dynamic skeletons model can automatically learning both the spatial and temporal patterns from data. Therefore, it is superior to the previous methods and can break out of limitations. In addition to enhancing expressive ability, the data patterns can also improve the generalisation ability. In graph convolution operation, Shi et al. [14] used the method of non-local attention to model the multi-level semantic information. In this way, the flexibility of graph construction model is increased; what is more, the generality to adapt to sundry data samples is also increased. Inspired by deformable part-based models (DPMs), Thakkar et al. [15] designed a part-based graph convolutional network which improves the recognition performance when compare with a model using the entire skeleton graph. In the network, the skeleton graph is divided into four subgraphs and shared joints between joints.

Method
Original ST-GCN model

In the method proposed in [13], the CUHK team put forward an idea to extend the graph neural network to a spatial–temporal graph model, which is also called ST-GCN, to design a general representation of skeleton sequences for action recognition. This is shown in Figure 1(a). The new model is built based on the sequence of skeleton graph, where each node corresponds to a joint of the human body. There are two different edge types, one is the space edge consistent with the natural connectivity of the joint, and the other is the time edge connected to the same joint on a continuous time step. A number of spatial–temporal graph convolution layers are constructed to extract feature graph information, and then the SoftMax classifier is often used to predict.

Fig. 1

(a) Spatial–temporal skeletal graph. (b) Partitioning strategy graph.

According to the study on motion analysis, the space structure of the graph in ST-GCN is divided as shown in Figure 1(b). Each node of partition 1 is divided into three subsets. Taking shoulder nodes as an example, the first subset is the node itself (green point), the second subset is the adjacent node sets (blue point) closer to the whole skeleton centre of gravity, and the third subset is the neighbouring node sets (yellow points) further away from the centre of gravity. Each colour represents a learnable weight for learning the information between nodes.

In a single frame, the graph convolution process of ST-GCN can be expressed by the following equation: fout=j(Λj12AjΛj12)MjfinWj {f_{out}} = \mathop {\sum}\limits_j \left({\Lambda _j^{- {1 \over 2}}{A_j}\Lambda _j^{- {1 \over 2}}} \right) \otimes {M_j}{f_{in}}{W_j} where fin represents the input feature graph of Cin × T × V, fout represents the input feature graph of Cout × T × V, Cin and Cout represent the number of input and output channels, T represents the number of video frames, V represents the number of skeletons, A represents the adjacency matrix of 18 × 18 × 3, 18 is the number of skeleton points in the human body (taking openpose as example), 3 is the number of partitions in partition strategy, the element Aij in matrix A represents the connectivity from node i to node j; A0 = I represents the node’s own connection matrix, A1 represents the near-centre subset, A2 represents the far-centre subset, Λjii=k(Ajki)+α \Lambda _j^{ii} = \sum\limits_k \left({A_j^{ki}} \right) + \alpha represents a normalised diagonal matrix; set α to 0.001, in order to avoid the matrix A from being null; ⊗ represents the point multiplication between matrices; Wj is the convolution kernel of Cout × Cin × 1 × 1, used for 1 × 1 convolution operation; and M is the learnable attention module of V × V, used for learning the importance of different skeletal points, with an initial value of all 1. It can be seen from Eq. (1) that when the element in the matrix A is 0, there is no connection between nodes; no matter what the M value is, the output result is still 0, so M can only learn the nodes in one-neighbour node.

Our proposed NA-STGCN model

In the ST-GCN network, the receptive field of convolution kernel is only in the range of one neighbour, so it can only extract the local feature information. Global feature information plays a more important role in actions where the adjacent distance between nodes is >1, such as clapping, drinking water, and so on. Therefore, we proposed a node attention model called NA-STGCN, which integrates the attention method of SENet into the ST-GCN. By introducing the attention module, the network can focus on the connection between different nodes (including adjacent and non-adjacent nodes) and learn the importance of nodes.

Figure 2 is a schematic diagram of the NA-STGCN network structure. The GCN represents the spatial convolution operation and the TCN represents the temporal convolution operation. The left part of Figure 2 is the overall structure diagram. There are nine GCN + TCN modules (layer 1-layer 9) in the network, and the residual part is added to each layer. The number of output channels of layer 1-layer 3 is 64, of layer 4-layer 6 is 128 and of layer 7-layer 9 is 256.

Fig. 2

Structure diagram of NA-STGCN. GCN, graph convolutional neural network; NA-STGCN, node attention spatial–temporal graph convolutional neural network.

The idea of attention module algorithm in our study comes from SENet. In SENet, the SE module first executes compress operation to the feature graph which is obtained by convolution, to get the global feature at channel level, then it executes excitation operation to the global feature, in order to learn the relationship and get weights of different channels; Therefore, it multiplies by the original feature graph to get the final feature. Virtually, on the channel dimension, the SE module executes the attention or gating operation. The attention mechanism has the advantage to focus on the channel features which has the most information and it can also suppress unimportant channel characteristics at the same time.

However, our proposed attention module is a little different from SENet. As shown in the dotted line box on the right side of Figure 2, we first globally pool the c × t dimensions of the GCN output feature graph to obtain the node-level features, and then carry out excitation operation to the node features to learn the relationship between nodes because the number of nodes is small, for example, 18 or 25, then carry out squeeze operation to obtain the importance of different nodes; finally, multiply by the original feature graph to get the final feature. By adding such an attention learning module, NA-STGCN learned the correlation between different nodes. In Section 4, according to the conclusion in [19] that used non-local attention in ST-GCN and added the attention modules in layer 2 and layer 3 acheive a better result, we experimentally explored the performance effect of adding our node-attention modules in layers 2 and 3. Moreover, in order to analyse the importance of every node, we estimate the class activation map (CAM [20]) of every node. Finally, we compare NA-STGCN with previous representative methods.

Experiment

To verify the extract ability of global features and recognition performance of NA-STGCN structure, the study on skeleton behaviour recognition is carried out based on NTU-RGB+D [21]. The experimental platform is as follows: Linux system, i7-7700 CPU and 1070 graphics card, 16 GB memory, and Pytorch depth learning framework.

Data sets

NTU-RGB+D data set is a public data set marked with 3D node information, which is used for identifying human actions. It contains 56,880 action fragments and 60 action categories. All the action clips were completed by 40 volunteers in a laboratory environment, and photographed from three cameras of the same height but different levels: −45°, 0°, 45°. The data set uses the 3D joint positions detected by Kinetic sensor in each frame. Each experimenter has 25 joints in the skeletal sequence. There are two partition methods for NTURGB+D data set, one is cross-action object partition (CS) and the other is cross-view partition (CV). Both are used to test the recognition accuracy of the model.

Experimental allocation

The experiments are carried out based on PyTorch deep learning framework; the optimisation strategy uses stochastic gradient descent (SGD); the momentum of Nesterov is 0.9; the initial learning rate is set to 0.1; and the learning rate decay is set (for 10 and 50 rounds learning rate, decay is 0.01 and 0.001); the training batch size is 16; the cross entropy is used as the loss function of gradient backpropagation; and the weight attenuation factor is 10−4.

Experimental results and analysis
Loss comparison between NA-STGCN and ST-GCN

In order to verify that the NA-STGCN has better global information modelling ability compared with the traditional spatial–temporal graph convolution, the loss comparison experiments were carried out on NTURGB+D data sets in CV partition. Figure 3 shows the loss curves of ST-GCN and NA-STGCN changing with the number of training epoches. As shown in Figure 3, Na-STGCN has a faster convergence rate than ST-GCN, and its loss value is also lower.

Fig. 3

Change curve of loss values. NA-STGCN, node attention spatial–temporal graph convolutional neural network; ST-GCN, spatial–temporal graph convolutional neural network.

Comparison of node activation response maps between NA-STGCN and ST-GCN

In order to verify NA-STGCN has the ability of global information modelling on nodes and the ability of learning the importance between different nodes, we estimated the response values of different nodes of people in a specific action segment by using the method in [20]. Figure 4 shows the node response maps of NA-STGCN and ST-GCN in the actions of clapping and brushing. The reason that we chose these two actions as the analysis examples is that these actions focus more on the information exchange between non-adjacent joints.

Fig. 4

Node activation response maps: (a) top left: ST-GCN clapping action; (b) top right: NA-STGCN clapping action; (c) bottom left: ST-GCN brushing action; (d) bottom right: NA-STGCN brushing action. NA-STGCN, node attention spatial–temporal graph convolutional neural network; ST-GCN, spatial–temporal graph convolutional neural network.

According to Figure 4(a) and 4(b), in terms of the clapping action, compared with ST-GCN, NA-STGCN has larger response values at the hand nodes, elbow nodes and shoulder nodes, and smaller response values at the trunk and lower limbs nodes, indicating that through the node attention method, NA-STGCN has learned the importance related to the category of actions for different nodes. Similarly, it can be seen from Figure 4(c) and 4(d) that, in terms of the brushing action, compared with ST-GCN, the NA-STGCN has larger response value in the left hand joint and neck nodes than in other parts. Since this action is a person brushing with the left hand, the response result of NA-STGCN is more in line with the actual situation.

Overall, NA-STGCN can model the global information of nodes through node attention method, and adaptively learn the importance of different nodes.

Comparison of results between NA-STGCN and representative methods

In order to verify the recognition accuracy of the NA-STGCN model, experiments and tests were carried out on the NTU-RGB+D skeleton action data sets in CV and CS partition modes. The experimental results are shown in Table 1.

Comparison with representative methods (%).

Model CS CV
Two-Stream 3DCNN [8] 66.8 72.6
TCN [22] 74.3 83.1
Clip + CNN + MTLN [23] 79.6 84.8
VA-LSTM [10] 79.4 87.6
ST-GCN [13] 81.5 88.3
NA-STGCN (ours) 85.8 89.3

CS, cross-action object partition; CV, cross-view partition; LSTM, long short-term memory; NA-STGCN, node attention spatial–temporal graph convolutional neural network; ST-GCN, spatial–temporal graph convolutional neural network.

As can be seen from Table 1, in the NTU-RGB+D data sets in CS and CV partition modes, the recognition accuracy of NA-STGCN is 85.8% and 89.3%, respectively. Compared with ST-GCN, the recognition accuracy of NA-STGCN in the CS and CV partition modes is improved by 4.3% and 1.0%, respectively. It can be seen that the accuracy of NA-STGCN in the CS partition mode is significantly improved, but the recognition accuracy in the CV partition mode is limited, at only 1%, which may be because the camera angle affects the learning of node attention.

In conclusion, according to the experimental results of (1)–(3), NA-STGCN shows higher accuracy and faster convergence speed compared with ST-GCN and other representative methods. Moreover, the study in 2 proves that the NA-STGCN can model the global information of nodes and adaptively learn the importance of different nodes.

Conclusion

In this study, a new skeleton graph neural network is proposed, namely NA-STGCN solving the algorithm flaw (only learns information from one-neighbour node) of the original ST-GCN model. Specifically, we add the SENet attention mechanism to the GCN layer, to enable the network to learn the interactive information of all joints. Therefore, the NA-STGCN can learn long-range relationships in skeletal action sequences, which overcomes the defect in ST-GCN which can only learn the information of one-neighbour node. The experiments carried out on NTU-RGC+D show that the recognition accuracy of NA-STGCN is better than that of ST-GCN. Moreover, in order to analyse the importance of every node, we estimate CAM of every node and the results show that NA-STGCN can focus on the connection relationship between different nodes (including adjacent and non-adjacent nodes) and learn the importance of nodes. In the future, we will study the effects of multiple attention mechanisms on ST-GCN, and explore the method of improving ST-GCN accuracy by combining multi-modality.

eISSN:
2444-8656
Lingua:
Inglese
Frequenza di pubblicazione:
Volume Open
Argomenti della rivista:
Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics