KEYWORDS: Action recognition, Convolution, Physics, Education and training, 3D modeling, Modeling, Visual process modeling, Neural networks, Deep learning
Human action recognition is important for many applications such as surveillance monitoring, safety, and healthcare. As 3D body skeletons can accurately characterize body actions and are robust to camera views, we propose a 3D skeleton-based human action method. Different from the existing skeleton-based methods that use only geometric features for action recognition, we propose a physics-augmented encoder and decoder model that produces physically plausible geometric features for human action recognition. Specifically, given the input skeleton sequence, the encoder performs a spatiotemporal graph convolution to produce spatiotemporal features for both predicting human actions and estimating the generalized positions and forces of body joints. The decoder, implemented as an ODE solver, takes the joint forces and solves the Euler-Lagrangian equation to reconstruct the skeletons in the next frame. By training the model to simultaneously minimize the action classification and the 3D skeleton reconstruction errors, the encoder is ensured to produce features that are consistent with both body skeletons and the underlying body dynamics as well as being discriminative. The physics-augmented spatiotemporal features are used for human action classification. We evaluate the proposed method on NTU-RGB+D, a large-scale dataset for skeleton-based action recognition. Compared with existing methods, our method achieves higher accuracy and better generalization ability.
Complex human events are high-level human activities that are composed of a set of interacting primitive human actions over time. Complex human event recognition is important for many applications, including security surveillance, healthcare, sports and games. Complex human event recognition requires recognizing not only the constituent primitive actions but also, more importantly, their long range spatiotemporal interactions. To meet this requirement, we propose to exploit the self-attention mechanism in the Transformer to model and capture the long-range interactions among primitive actions. We further extend the conventional Transformer to a probabilistic Transformer in order to quantify the event recognition confidence and to detect anomaly events. Specifically, given a sequence of human 3D skeletons, the proposed model first performs primitive action localization and recognition. The recognized primitive human actions and their features are then fed into the probabilistic Transformer for complex human event recognition. By using a probabilistic attention score, the probabilistic Transformer can not only recognize complex events but also quantify its prediction uncertainty. Using the prediction uncertainty, we further propose to detect anomaly events in an unsupervised manner. We evaluate the proposed probabilistic Transformer on FineDiving dataset and Olympics Sports dataset for both complex event recognition and abnormal event detection. The dataset consists of complex events composed of primitive diving actions. The experimental results demonstrate the effectiveness and superiority of our method against baseline methods.
The existing approaches usually perform facial landmark detection and head pose estimation independently and sequentially, ignoring their coupled relations. We introduce a unified framework, named coupled cascade regression (CCR), for simultaneous facial landmark detection and head pose estimation. Based on the cascade regression framework, we propose to learn two separate regressors to update the landmark locations and three-dimensional (3D) face model parameters at each cascade level. To capture the coupled relations of the landmark locations and head pose, we further apply the 3D face projection model to refine the prediction results in each cascade iteration and make them consistent. CCR can leverage both the learning methods and the projection model to simultaneously perform facial landmark detection and pose estimation to enhance the performances of both tasks. We also propose to learn the cascade regressors from the combination of real and synthesized face images to solve the problem of limited variations in head pose for training. Experimental results on Helen, labeled face parts in the wild, 300-W, and Boston University datasets show that our proposed CCR method outperforms other conventional methods both for landmark detection and head pose estimation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.