KEYWORDS: 3D modeling, Education and training, Image segmentation, Network architectures, 3D mask effects, RGB color model, Binary data, 3D image reconstruction, 3D image processing, Neural networks
In recent years, various methods have been proposed for reconstructing the 3D shape of an object from a single view image. While methods that reconstruct the object as a single model show promising results, they often lack part-level details. On the other hand, part-level reconstruction methods provide recognition of parts but struggle to represent detailed shapes due to the use of a single primitive. To address this issue, this paper proposes a Compositionally Generalizable 3D Structure Prediction Network using Multiple Types of Primitives (CompNet-MTP). CompNet-MTP first estimates the parameters of each type of primitive for every part and then selects the appropriate primitive type to construct the 3D shape of the object. In the experiments, we used cylinders in addition to cuboids, which are commonly used as primitive shapes. Experimental results confirm the effectiveness of the proposed network in handling multiple types of primitives.
The previous 3D pose and shape estimation methods often suffer from the problem of depth ambiguity. Hence, we present a novel method to reduce the depth ambiguity by explicitly considering the depth of a person’s body surface. The key idea is minimizing the difference between the depth estimated from an input image and the projected depth of a reconstructed 3D mesh. This allows the proposed method to estimate 3D pose and body shape with plausible 3D joint locations. Evaluations show that the proposed method produces more appropriate 3D meshes and reduces both 3D pose and shape estimation errors.
Gestures complement the utterance contents to help the human understand. In the field of gesture generation, the task of generating gestures from utterances have attracted attention. The main method for generating gestures from utterances is to associate utterances with gestures using deep neural networks. To associate utterances with gestures using deep neural networks, a co-speech gesture dataset is required. However, building such datasets is costly and time-consuming because it requires a reliable pose estimation system (such as motion captures) and manual adjustments. We proposed an automatic method to collect a co-speech gesture dataset from online speech videos. The method extracts various utterance and gesture pairs from online speech videos. In addition, we use the collected dataset to train a deep neural network and confirm that our automatically-collected dataset can be a supervisory signal for speech-driven gesture generation.
We propose a refinement module to improve action recognition by considering the semantic relevance between verbs and nouns. Existing methods recognize actions as a combination of verb and noun. However, they occasionally produce the semantically implausible combination, such as “drink a cupboard” or “open a carrot”. To tackle this problem, we propose a method that incorporates a word embedding model into an action recognition network. The word embedding model is trained to obtain co-occurrence between verbs and nouns and used to refine the initial class probabilities estimated by the network. Experimental results show that our method improves the estimation accuracy of verbs and nouns on the EPIC-KITCHENS Dataset.
Visual tracking gives a trajectory of a person’s movement in a video. It is an important element of human’s behavior analysis. However, the performance of the most visual tracking methods does not satisfy precision and speed for use in the real world. High accuracy and low computation complexity are demanded but these two have a relation of trade-off. Therefore, we aim to propose a tracking method that balances accuracy and computational complexity. In this paper, we proposed a method named Dual Cost Graph (DCG)-Tracker using the graphs referred to as a clique graph and a flow network. We evaluated DCG-Tracker with PNNL parking lot 1 dataset1, 2 and showed that it balances between accuracy and speed.
In classification tasks, the accuracy of classifiers depends on training data. It is known that inter-class imbalanced data degrade the classification accuracy. Previous approaches tend to use data augmentation to solve inter-class imbalance, but the possibility of intra-class imbalance has been ignored. In this paper, we propose a novel method to solve the intra-class imbalance with Generative Adversarial Networks (GAN). The key idea is to examine the distribution of training data in latent space. We experimentally demonstrate that the proposed method generates diverse images and improves classification accuracy on the CIFAR-10 dataset.
Recently, dense trajectories [1] have been shown to be a successful video representation for action recognition, and have demonstrated state-of-the-art results with a variety of datasets. However, if we apply these trajectories to gesture recognition, recognizing similar and fine-grained motions is problematic. In this paper, we propose a new method in which dense trajectories are calculated in segmented regions around detected human body parts. Spatial segmentation is achieved by body part detection [2]. Temporal segmentation is performed for a fixed number of video frames. The proposed method removes background video noise and can recognize similar and fine-grained motions. Only a few video datasets are available for gesture classification; therefore, we have constructed a new gesture dataset and evaluated the proposed method using this dataset. The experimental results show that the proposed method outperforms the original dense trajectories.
In industrial plants, a remote monitoring system which removes physical tour inspection is often considered desirable. However the image sequence given from the mobile inspection robot is hard to see because interested objects are often partially occluded by obstacles such as pillars or fences. Our aim is to improve the image sequence that increases the efficiency and reliability of remote visual inspection. We propose a new depth-based image processing technique, which removes the needless objects from the foreground and recovers the occluded background electronically. Our algorithm is based on spatiotemporal analysis that enables fine and dense depth estimation, depth-based precise segmentation, and accurate interpolation. We apply this technique to a real time sequence given from the mobile inspection robot. The resulted image sequence is satisfactory in that the operator can make correct visual inspection with less fatigue.
We propose Hierarchical Distributed Template Matching, which reduces the computational cost of template matching, while maintaining the same reliability as conventional template matching. To achieve cost reduction without loss of reliability, we first evaluate the correlation of shrunken images in order to select the maximum depth of the hierarchy. Then, for each level of hierarchy, we choose a small number of template points in the original template and build a sparse distributed template. The locations of the template points are optimized, so that they yield a distinct peak in the correlation score map. Experimental results demonstrate that our method can reduce the computational cost to less than 1/10 that of conventional hierarchical template matching. We also confirmed that the precision is 0.6 pixels.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.