SPIE Journal Paper | 28 September 2022
KEYWORDS: Optical tracking, Video, Code division multiplexing, Network architectures, Convolution, Visualization, Chemical species, Adaptive optics, Head, Transformers
Recently, single object tracking has demonstrated great success. However, due to various problems caused by fast motion, occlusion, and deformation, it is still intractable for traditional trackers to adapt to changes in object appearance. In this work, a dynamic memory network is proposed for visual tracking to handle the template matching process globally. More specifically, we first build a memory model, which consists of a memory feature fusion module and a memory bank. By the memory model, the network not only accepts the first frame as initial information but also memorizes the selective frames in the video sequence to provide rich time-domain information. Second, an innovative sampling strategy is adopted in the tracking process. By updating the template and guiding the selection of memory frames, our model can output higher-quality features. In addition, a spatial-channel fused attention module that effectively improves the representational capability and discriminability of the model is introduced. Our proposed method obtains compelling results on six challenging tracking benchmarks, including the OTB100, VOT2019, UAV123, NFS, GOT-10k, and LaSOT datasets. Extensive experiments demonstrate that our approach shows satisfactory robustness and leading application potential in real-time speed.