Open Access Paper
24 May 2022 Skeleton-based one-shot action recognition on graph neural network
Hao Chen, Yuzhuo Fu, Ting Liu
Author Affiliations +
Proceedings Volume 12260, International Conference on Computer Application and Information Security (ICCAIS 2021); 122600E (2022) https://doi.org/10.1117/12.2637828
Event: International Conference on Computer Application and Information Security (ICCAIS 2021), 2021, Wuhan, China
Abstract
In our work, we implemented one-shot action recognition that using the skeleton data. In terms of data preprocessing, we used the form of mapping skeleton sequence coordinates into signal images. In the feature extraction module, we used feature extraction based on resnet18. In the few-shot learning part, we adopted the metric neural network model based on graph neural network. Finally, the leading accuracy is realized on the ntu-rgbd120 one-shot dataset.

1.

INTRODUCTION

The research in action recognition in computer vision mainly focuses on one-shot action recognition, pedestrian rerecognition, face recognition and so on. These methods have a good effect on one-shot action recognition, but only focus on a single pattern such as images or skeleton sequences. We consider using a signal-level representation that allows flexible encoding of signals into images and fusion of signals from different sensor modes.

There are several benefits of making representation for encoding skeleton data into signal images. First, as long as the sensor generates multivariable higher modal data forms and sequences, it allows generalization across different sensor modes. Secondly, the representation of similar images makes the using of a better classification architecture which performs well becomes possible and feasible.

In our research, the signals come from the three-dimension skeleton sequences which is collected by the RGB-D video camera, and other video data measurements. For details on signal representation, see Section 3.

As few-shot learning algorithm needs to make full use of the relationship between the support set data and the query set data1, using GNN absolutely has great potential to come up with the solution of few-shot learning. Garcia and Bruna2 construct a graph in which the support set and the query set are closely connected. We have performed experimental verification on the model on a small number of shot image classifications, matching the most advanced performance with fewer parameters.

Our method achieves the first use of one-shot action classification based on GNN structure on skeletal data processing. And we achieved the best accuracy on 5 way-1 shot task on the dataset NTU-RGBD120 one shot.

2.

RELATED WORK

We briefly outline the methods related to representation of data and one-shot recognition methods and graph neural network in general.

2.1

Image representation

Our method is based on the image representation of the sensor sequence3. Wang et al.4 encoded the joint trajectory image into an image through three different spatial perspectives. Liu et al.5 proposed a combination of bone visualization methods and jointly trained these methods on multiple streams.

2.2

Few-shot classification

A matching network1 created a new type of method which can use attention algorithm to calculate the classifier of the nearest neighbor that can perform calculus and a prototypical network6 enlarges it by making the definition of prototypes as the average value of embedded support examples for each class. Meta-LSTM7 updates the model by using LSTM, making its model’s parameters to be the hidden states. MAML8 only concentrates on the initial parameters’ initial values and simply uses SGD. Reptile9 uses only first-order gradients.

2.3

Graph neural network

Gori et al.10 firstly proposed the graph neural network, which acts like a trainable cyclic message transfer whose fixed points can adjust differently. Li et al.11 further proposed a method using gated loop units and advanced optimization techniques. Kipf and Welling12 groundbreakingly applied semi-supervised learning of graph structure data, which is scalable. They2, 13 have applied the GNNs to few-shot learning and labeled the framework on the node level.

3.

APPROACH

To solve the action recognition task across a series of sensor modalities we consider the action recognition problem on a signal level. The signals are encoded in a differentiated image representation. The representation of similar images allowed direct adaptation of the established image classification framework to extract features. To solve the one-shot problem, we apply a metric neural network based on graph neural network. An illustration of our approach is given in Figure 1.

Figure 1.

The framework of one-shot action recognition. Φ(x) denotes the embedding network (Resnet18 or Resw).

00182_psisdg12260_122600e_page_2_1.jpg

3.1

Problem definition

We consider the one-shot action recognition problem as a few-shot metric learning problem. First, we encode the sequence of actions. Convert the signal level to an image representation. The input in our example is a signal matrix S ∈ ℝN×M where each row vector represents a discrete 1-dimensional signal and each column vector represents a sample of all sensors at one specific time step. The matrix is transformed to an RGB image I∈{0,…,255}H×W×3 by normalizing the signal length M to W and the range of the signals to H. The identity of each signal is encoded in the color channel. So our dataset consist of 00182_psisdg12260_122600e_page_2_2.jpg of N training images I1,…,N with labels yi ∈ {1,…,C}. We consider input-output pairs (TiYi)i drawn from a distribution P of partially-labeled image collections

00182_psisdg12260_122600e_page_2_3.jpg

and Y = (y1,…,yt)∈{1,K}t for arbitrary values of s,r,t and K. s is the number of labeled samples, r is the number of unlabeled samples and t is the number of samples to classification. K is the number of classes. 𝒫l (ℝN) denotes a class-specific image distribution over ℝN. Given a training set {(Ti,Yi)i}iL, we consider the standard supervised learning objective 00182_psisdg12260_122600e_page_3_1.jpg, using the model 00182_psisdg12260_122600e_page_3_6.jpg specified in Section 5 and is 𝒱 a standard regularization objective.

When r = 0, t = 1 and s = qK, there is a single image in the collection with unknown label. In addition, if each label occurs exactly q times, this setting is called q -shot, K -way learning. And in our work, we only focus on one-shot problem.

3.2

Representation

Our method is based on discernible image representation. Therefore, we propose a novel and compact signal level representation. Multivariable signals or higher-level feature sequences are recombined into a 3-channel image. Each row of the result image corresponds to one joint, and each channel corresponds to one sample in the sequence. The color channels, red, green and blue, represent respectively the signals’ x-, y- and z-values. The resulting images are normalized to the range of 0-1.

3.3

Feature extraction

The feature extraction module of our model mainly used a Resnet network-based structure. We consider that after the joint training is implemented, the feature extraction module can be strongly combined with few-shot classifier module. We used Resnet18 and Resw (the network optimized by our approach) for feature extraction.

On one hand, we apply a 1×1 convolution layer before the 2 layers of 3×3 convolution in every main stage of Resnet18, i.e., the residual block of Resnet. With this optimization, the nonlinearity of the model can be increased before each residual calculation module in the network is down-sampled, and the results show that the effectiveness of feature extraction is improved. On the other hand, we also postponed the down-sampling of the residual module by adjusting the stride of the two-layer convolution.

After the last feature layer, we use a two-layer perceptron to transform the features into the embedding size. The embedding is refined by the metric learning approach.

3.4

Metric network

The goal of few-shot learning is to propagate label information from labeled samples to unlabeled query images. This propagation of information can be formalized as a posterior inference over a graphical model determined by the input images and labels. The input 𝒯 contains a collection of labeled and unlabeled images. This information dissemination can be formalized as a posteriori reasoning of the graphical model determined by the input image and the label. We associate 𝒯 with a fully-connected graph G𝒯 = (V,E) where nodes vaV correspond to the images present in 𝒯 (both labeled and unlabeled).

In the graph neural network part, given an input signal F ∈ℝ V×d on the vertices of a weighted graph G. We apply a family M of graph intrinsic linear operators that act locally on this signal. And the adjacency operator 00182_psisdg12260_122600e_page_3_7.jpg where (AF)i :=∑j~i wi,jFj with i~j if (i,j) ∈ E and wi,j is associated weight. A GNN layer Gc(·) receives as input a signal 00182_psisdg12260_122600e_page_3_8.jpg and produces 00182_psisdg12260_122600e_page_3_9.jpg as

00182_psisdg12260_122600e_page_3_2.jpg

where 00182_psisdg12260_122600e_page_3_3.jpg, 00182_psisdg12260_122600e_page_3_4.jpg are trainable parameters and ρ(∙) is a point-wise non-linearity. Based on this basic formula, we did some exploration. Particularly, inspired by message-passing algorithms, we consider a Multilayer Perceptron stacked after the absolute difference between two vector nodes.

00182_psisdg12260_122600e_page_3_5.jpg

where ψ is a symmetric function parametrized with a neural network, etc. And ψ is a metric, which is learned by nonlinear combination of the absolute difference between the individual features of the two nodes. Then, we normalize the trainable adjacency to a random kernel by using a softmax along each row. By adding the edge feature kernel A(k)into the generator family M. and applying (2), the update rule of the node feature is obtained.

Adjacency learning is especially important in the following applications: the input set is considered to have a certain geometric structure, but the metric is unknown a priori, as in our case. In a general graph, the depth of the network is selected in the order of the diameter of the graph, so that all nodes obtain information from the entire graph. However, in our context, since the graphics are closely connected, depth is simply interpreted as giving the model greater expressive power.

3.5

Composition of initial node features

The input set 𝒯 is mapped into node features, as shown below. For signal images xi𝒯 with known label li, the one-hot encoding of the label is concatenated with the embedding features of the image at the input of the GNN.

00182_psisdg12260_122600e_page_4_1.jpg

where φ is the Resnet embedding network and 00182_psisdg12260_122600e_page_4_2.jpg is a one-hot encoding of the label.

4.

TRAINING

Our model was aimed to predict the label Y corresponding to the signal image to be classified 00182_psisdg12260_122600e_page_4_3.jpg, with node * in the graph. Therefore, the last layer of GNN is softmax, which maps node features to K-simplex. Then we apply the Crossentropy loss evaluated at node * :

00182_psisdg12260_122600e_page_4_4.jpg

5.

EXPERIMENT

We applied our methods to two datasets. First, we used skeleton sequences from the NTU RGB+D 12014 as our large-scale one-shot recognition dataset. And we also used the UTD-MHAD15 dataset to prove the generalization of our model.

5.1

Datasets

NTU RGB+D 120: The NTU RGB+D 12014 dataset is a wide-ranging action recognition dataset involving RGB+D mode data and skeleton sequences data. The dataset consists of 114,480 sequences containing 120 action classes from 106 subjects in 155 different views. This article uses the one-shot standard given by the author of the dataset. The categories of these standards are divided into two parts: training set and test set. The categories contained in the training set and the test set do not overlap. There are 100 categories for training and 20 categories for testing. A1, A7, A13, A19, A25, A31, A37, A43, A49, A55, A61, A67, A73, A79, A85, A91, A97, A103, A109, A115 are previously unseen.

b) UTD-MHAD: The UTD-MHAD15 is a dataset containing 27 actions of 8 individuals performing 4 repetitions each. RGB-D image source data, skeleton sequences and inertial data are involved. The RGB-D camera is placed frontal to the demonstrating person. This article used 23 classes for training and other classes for testing. Besides the 23/4 training/testing setting, we also testing the setting of 19/8 and 15/12 training/testing setting. Other works even applied the setting of 11/16 and 7/20, we believe our three setting is Forcefully and credibly.

5.2

Results

On the NTU RGB+D 120 one-shot dataset we compare against APSR14 and SL-DML3. Table 1 shows the results with a training set size of 100 action classes and a test set size of previously unseen 20 action classes. Our method (using our own feature extraction backbone Resw) is 1.2% higher than the best method (SL-DML) before. Table 2 shows results for an increasing amount of training classes (100 training classes and 20 test classes are considered as the standard protocol). And Table 2 shows the results of using different feature extraction networks to affect the classification effect. We use Resnet18 and our own feature extraction network Resw optimized based on Resnet. It can be seen that using our backbone, we can achieve better results than regular Resnet18. All our results are presented in percentage. The best results are bolded. By the results, we can see that our method performs better when the train classes is small, our method has a stronger robustness.

Table 1.

One-shot results on NTU-RGB+D 120 one-shot.

ApproachAccuracy (%)
Resw+GNN (Our)52.1
SL-DML50.9
APSR45.3
Average pooling342.9
Fully connected342.1

Table 2.

One-shot results with different train classes on NTU-RGB+D 120 one shot.

Train classesResnet18 (%)ReswSL-DMLAPSR
2038.939.336.729.1
4042.943.842.434.8
6048.748.549.039.2
10051.552.150.945.3

On the UTD-MHAD dataset, we conducted experiments based on two criteria. And we compare our results with the SL-DML on different training testing ratio, the results are shown in Table 3. On UTD-MHAD dataset we only use our backbone Resw, and all the results below are based on it.

Table 3.

One-shot results with different train classes on UTD-MHAD.

Train/testResw-skl. (%)Resw-fusedSL-DML-skl.SL-DML-fused
23/491.589.692.790.2
19/874.676.874.876.0
15/1280.778.581.178.7

6.

CONCLUSION

In our work, we have achieved the states-of-arts results of the one-shot action recognition on the NTURGB+D120 dataset. We are the first to apply graph neural network to the task of one-shot action recognition based on skeleton data. We have also optimized the structure of the embedding network Resnet, increase the nonlinearity of the model and increase the effectiveness of feature extraction part, leading to a better result on our new backbone.

REFERENCES

[1] 

Vinyals, O., Blundell, C. and Wierstra, D., “Matching networks for one shot learning,” NIPS, 3630 –3638 (2016). Google Scholar

[2] 

Garcia, V. and Bruna, J., “Few-shot learning with graph neural networks,” ICLR, (2018). Google Scholar

[3] 

Memmesheimer, R., Theisen, N. and Paulus, D., “SL-DML: Signal level deep metric learning for multimodal one-shot action recognition,” ICPR, 4573 –4580 (2020). Google Scholar

[4] 

Wang, P., Li, W., Li, C. and Hou, Y., “Action recognition based on joint trajectory maps with convolutional neural networks,” Knowledge-Based Systems, 158 43 –53 (2018). https://doi.org/10.1016/j.knosys.2018.05.029 Google Scholar

[5] 

Liu, M., Liu, H. and Chen, C., “Enhanced skeleton visualization for view invariant human action recognition,” Pattern Recognition, 68 346 –362 (2017). https://doi.org/10.1016/j.patcog.2017.02.030 Google Scholar

[6] 

Snell, J., Swersky, K. and Zemel, R., “Prototypical networks for few-shot learning,” NIPS, 4077 –4087 (2017). Google Scholar

[7] 

Ravi, S. and Larochelle, H., “Optimization as a model for few-shot learning,” ICLR, (2017). Google Scholar

[8] 

Finn, C., Abbeel, P. and Levine, S., “Model-agnostic meta-learning for fast adaptation of deep networks,” ICML, (2017). Google Scholar

[9] 

Nichol, A., Achiam, J. and Schulman, J., “On first-order meta-learning algorithms,” CoRR, (2018). Google Scholar

[10] 

Gori, M., Monfardini, G. and Scarselli, F., “A new model for learning in graph domains,” in Proc. IJCNN, (2005). https://doi.org/10.1109/IJCNN.2005.1555942 Google Scholar

[11] 

Li, Y., Tarlow, D., Brockschmidt, M. and Zemel, R., “Gated graph sequence neural networks,” (2015). Google Scholar

[12] 

Kipf, T. N. and Welling, M., “Semi-supervised classification with graph convolutional networks,” ICLR, (2017). Google Scholar

[13] 

Liu, Y., Lee, J., Park, M., Kim, S. and Yang, Y., “Transductive propagation network for few-shot learning,” ICLR, (2019). Google Scholar

[14] 

Liu, J., Shahroudy, A., Perez, M. L., Wang, G., Duan, L. Y. and Chichung, A. K., “NTU RGB+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 816 –833 (2019). Google Scholar

[15] 

Chen, C., Jafari, R. and Kehtarnavaz, N., “Utd-MHAD: A multimodaldataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in 2015 IEEE Inter. Conf. on Image Processing (ICIP), 168 –17 (2015). Google Scholar
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Hao Chen, Yuzhuo Fu, and Ting Liu "Skeleton-based one-shot action recognition on graph neural network", Proc. SPIE 12260, International Conference on Computer Application and Information Security (ICCAIS 2021), 122600E (24 May 2022); https://doi.org/10.1117/12.2637828
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Neural networks

Feature extraction

Sensors

Image fusion

Computer programming

Convolution

Data modeling

Back to Top