Currently, state-of-the-art unsupervised person re-identification(Re-ID) algorithms employ cluster methods to create pseudo labels by grouping unlabeled data to train the model, but it inevitably generates noisy pseudo labels, which limits the performance of unsupervised person Re-ID task. To tackle the above issue, we suggest the contrastive regularization loss function for the model to concentrate on learning the representation information of correct pseudo labels and ignore those of noisy pseudo labels to the maximum extent. Under the guidance of this method, the model training focuses more on the learning of high quality correct pseudo labels and eliminates the negative consequences of noisy pseudo labels on it, thus boosting the efficiency of the unsupervised person Re-ID mission. The proposal is adequately experimented on three person Re-ID benchmark datasets, and the results prove the usefulness of the method and outperform other state-of-the-art approaches.
The aim of fusing infrared and visible images is to achieve high-quality images by enhancing textural details and obtaining complementary benefits. Since the details of the visible images are not obvious in low light, it is difficult for the current fusion methods to complete the complementary contours and texture details. With the intention of addressing the challenge of poor quality of infrared and visible light fusion images under low light conditions, a novel fusion method for infrared and visible light is presented in this study utilizing generative adversarial networks (referred to as UFIVL). Specifically, based on the existing densely connected decoder, pruning is introduced to reduce the network complexity without quality loss. A new overall optimization objective includes the adaptive limit contrast histogram equalization loss and the joint gradient loss are designed to deal with the defects of high contrast and brightness loss of the fused image, and the difficulty of capturing detailed features in low light scenes, respectively. Experimental results on LLVIP datasets show that compared with other state-of-the-art methods, the fused image generated by the proposed method has better subjective and objective performances.
The goal of human-object interaction (HOI) detection is to localize both the human and object in a picture and recognize the interactions between them. HOIs are always scattering in the image. The traditional methods based on CNNs are unable to aggregate the information scattered in the image. Many new methods utilizing the contextual features cropped from the outputs of the CNNs, which sometimes are not effective enough. To overcome the challenge, we utilize the deformable transformer to aggregate the whole features output form the CNNs. The attention mechanism and query-based predictions are the keys. In view of the success of the methods based on graph neural networks, the attention mechanism is proved to be effective to aggregate the contextual information image-wide. The queries can extract the features of each human-object pair without mixing up the features of other instances. The deformable transformer can extract effective embeddings and the prediction heads can be fairly simple. Experimental results show that the proposed method is effective in HOI detection.
As ultrasonic phased array total focus method (TFM) imaging technology can achieve full range of dynamic focusing with clear imaging and strong ability to characterize defects, TFM imaging algorithm has become the gold standard for testing other post-processing algorithms. However, due to the large amount of data and time-consuming calculation, the TFM imaging has limited the application in some industrial fields. With the reduction of the number of phased array transmitting elements, the imaging quality gets worse. To improve the imaging efficiency of the TFM algorithm and ensure the imaging quality, this paper proposes a method combining Siamese Convolutional Neural Network (SCNN) and Genetic Algorithm (GA) to obtain an optimal sparse array layout, to approximate the imaging effect of the full array with limited effective elements. After selection of an appropriate sparsity ratio, GA is used to optimize the sparse array layout. The sparse array elements emit ultrasonic waves, and the full array elements receive echo signals for imaging. SCNN is trained by a self-built industrial defect dataset, to output the similarity between the sparse-TFM image and the full-array TFM image. The similarity is used as an objective evaluation index to evaluate the imaging effect. The optimal sparse array layout is proposed by combining subjective evaluation with objective evaluation.
Modal translation between multimodal images is an effective complementary scheme when images with some certain modal are difficult to obtain. Since pixel level image modal translation method can obtain the images with the highest quality, it has become a research hotspot in recent years. Generative adversarial network (GAN) is a network for image generation, due to the complex structure of GAN and the complexity of the image generation task, the training results of GAN are not stable. In this paper, on the basis of U-nets, the dense block is used to increase the feature information in the subsampling encoding and up-sampling decoding operation, so as to reduce the loss of information and obtain higher quality images. At the same time, the dense long connection is introduced to connect the encoding and decoding operations of the same stage, so that the network can effectively combine the features at low and high level, and improve the performance of the network. Experimental results show that the proposed method is effective in modal translation of multimodal images, and the image quality is better than some state-of-the-art methods.
Recent years, deep neural networks have achieved impressive progress in object detection. However, detecting the interactions between objects is still challenging. Many researchers pay attention to human-object interaction (HOI) detection as a basic task in detailed scene understanding. Most conventional HOI detectors are in a two-stage manner and usually slow in inference. One-stage methods for direct parallel detection of HOI triples breaks through the limitation of object detection, but the extracted features are still insufficient. To overcome these drawbacks above, we propose an improved one-stage HOI detection approach, in which attention aggregation module and dynamic point matching strategy play key roles. The attention aggregation enhances the semantic expression ability of interaction points explicitly by aggregating contextually important information, while the matching strategy can filter the negative HOI pairs effectively in the inference stage. Extensive experiments on two challenging HOI detection benchmarks: VCOCO and HICO-DET show that our method achieves considerable performance compared to state-of-the-art performance without any additional human pose and language features.
With the development and maturity of object detection techniques, more and more researchers begin to utilize deep learning methods for object detection and classification of the image data obtained by dynamic vision sensors(DVS). Considering that the event stream data obtained by DVS does not have grayscale feature information, we want to convert it into a frame image and utilize the YOLOv3 neural network model for learning to achieve object detection. Since the IoU loss function in the original YOLOv3 network cannot represent the distance between the output predicted box and the grounding truth, this paper improves YOLOv3 by employing the GIoU loss function to achieve more accurate detection. By conducting experiments on a self-collected dataset, we find that the GIoU-improved YOLOv3 network has good performance and can accurately achieve human action recognition and classification.
Multimodal medical image fusion is to extract information from different modal images into a single one and obtain the organizational characteristics of source images. Different medical imaging measures such as CT and MRI often bring different visual morphology, however, the salient features of tissues are basically the same from the perspective of human eyes. According to this characteristic, an improved image fusion algorithm based on visual salience detection is proposed in this paper. First, the GBVS algorithm was introduced to calculate visual salience of two registered source images, and then decompose the source images in NSST domain to obtain their low-frequency and high-frequency sub-bands. For the low-frequency sub-bands, local energy and GBVS graph are input into fuzzy logic system to obtain the respective weights for the fused low-frequency sub-band. For the high-frequency sub-bands, the NSML values of each sub-band were calculated and compared to obtain the fused high-frequency sub-band. The final fused image was obtained by using the inverse NSST transformation. Applying this method to multimodal medical image fusion, the visual quality of the image can be enhanced effectively and the salient features of tissues can be preserved well. Experiments on multimodal fusion of different gray-scale medical images show that the proposed method has advantages in retention of image salient features and the overall image contrast, and has better objective index than the comparison models.
Human-object interaction (HOI) detection task is defined as inferring all the < human, verb, object > triplets in the image, which helps computers to obtain a more comprehensive understanding of the visual scene. Most existing HOI detection methods focus on instance local features, and rarely consider the information from backgrounds. Our core idea is that the relationship between human, object and other backgrounds contains important cues to facilitate HOI detection. According to the short-term memory selection (STMS) mechanism, we regard the interaction relationship as the result of human and object stimulating the union area, and simulate the stimulation process by the recurrent neural network. The features in the union area of human and object are taken as the input of RNN, human and object are the two inputs of RNN, and the output is the representation of the interaction relationship. Combined with the visual features and spatial features of instances, a multi-stream network is utilized to detect HOIs in the image. Experiments on V-COCO and HICO-DET show that the proposed model achieves better performance, verifying the effectiveness of our method.
In view of the problem that the typical convolutional neural networks fail to model actions at their full temporal extent, a novel video action recognition algorithm, which is based on improved 3D Convolutional Network (iC3D) architecture with K-means keyframes extraction and sparse representation classification (SRC), is proposed in this study. During the feature extraction process, the K-means keyframes extraction is constrained to reduce redundant information generated by continuous video frames and increase the temporal acceptance region. Meanwhile, to improve the noise immunity, sparse coding and its reconstruction errors are used for classification. The proposed method has 96.5% recognition accuracy on the typical video action classification dataset UCF101 that outperforms other competing methods. In addition, we built a wild test dataset to verify the generalization performance of the proposed model.
The instance segmentation for obstacle detection based on machine vision and deep learning is quite important for autonomous driving system. In this paper, a method using the Mask R-CNN based on feature fusion of RGB and depth images for instance segmentation is proposed. It extracts the features of depth image by designing a two-layer NiN network, and uses convolution to realize the feature fusion and dimension reduction of RGB image and depth image. The edge texture in depth image can improve the accuracy of boundary frame positioning. Experimental results on typical benchmark dataset demonstrates the effectiveness of the proposed method, which can improve the segmentation accuracy by 4% and the recall rate by 2%.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.