MHViTPose: multiscale hybrid vision transformer for human pose estimation

Junhui Qu; Ziyan Zhao; Xiang Yu; Wei Zhang

doi:10.1117/12.3006800

10 October 2023 MHViTPose: multiscale hybrid vision transformer for human pose estimation

Junhui Qu, Ziyan Zhao, Xiang Yu, Wei Zhang

Proceedings Volume 12799, Third International Conference on Advanced Algorithms and Signal Image Processing (AASIP 2023); 127991R (2023) https://doi.org/10.1117/12.3006800
Event: 3rd International Conference on Advanced Algorithms and Signal Image Processing (AASIP 2023), 2023, Kuala Lumpur, Malaysia

Abstract

Despite the significant progress achieved by visual Transformers, there are still some limitations that need to be addressed in human pose estimation. Firstly, Transformer lacks CNN’s inductive bias and local feature attention capabilities, which require extensive training data and iterations to achieve satisfactory results. Therefore, we propose a hybrid network that combines convolutional and Transformer. Besides, to address the recognition of human body images at different scales, we established a Transformer pyramid structure, which achieves recognition of human body images at different scales through progressive reduction of the input resolution. Specifically, our algorithm achieves an accuracy of 77.3% with a computational complexity of 19.6 GFLOPs. Compared to traditional direct regression methods, our algorithm considerably enhances detection accuracy while reducing the training complexity and significantly increasing the detection speed compared to traditional Transformer methods.

(2023) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Junhui Qu, Ziyan Zhao, Xiang Yu, and Wei Zhang "MHViTPose: multiscale hybrid vision transformer for human pose estimation", Proc. SPIE 12799, Third International Conference on Advanced Algorithms and Signal Image Processing (AASIP 2023), 127991R (10 October 2023); https://doi.org/10.1117/12.3006800

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
7 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Transformers

Pose estimation

Feature extraction

Visualization

Education and training

Feature fusion

Human vision and color perception

Show All Keywords

Keywords/Phrases

Search In:

Publication Years