Accurate crowd counting in congested scenes remain challengeable in the trade-off of efficiency and generalization. For solving this issue, we propose a mobile-friendly solution for the network deployment in high response speed demand scenarios. In order to introduce the profound potential of global crowd representations to lightweight counting model, this work suggests a novel crowd counting aimed mobile vision transformers architecture (CCMTNet), which strives for enhancing the efficiency of the model universality in real-time crowd counting tasks on resource constrained computing devices. The framework of linear CNN network interpolation structure with self-attention blocks endows the model with the ability of local feature extraction and global high-dimensional crowd information processing with low computational cost. In addition, several experimental networks with different scales based on the proposed architecture are comprehensively verified to balance the accuracy loss as compressing the computing costs. Extensive experiments on three mainstream datasets for crowd counting tasks well demonstrate the effectiveness of this proposed network. Particularly, CCMTNet achieves the feasibility of reconciling the counting accuracy and efficiency in comparisons with traditional lightweight CNN networks.
|