Contextual feature aggregation pyramid vision transformer for polyp segmentation

Ziwei Lin; Kaiyue Zhang; Shanshan Li; Qing Qin

doi:10.1117/12.3039730

11 September 2024 Contextual feature aggregation pyramid vision transformer for polyp segmentation

Ziwei Lin, Kaiyue Zhang, Shanshan Li, Qing Qin

Author Affiliations +

Proceedings Volume 13270, International Conference on Future of Medicine and Biological Information Engineering (MBIE 2024); 1327004 (2024) https://doi.org/10.1117/12.3039730
Event: 2024 International Conference on Future of Medicine and Biological Information Engineering (MBIE 2024), 2024, Shenyang, China

Abstract

Polyp segmentation has consistently been a difficult task because of the varying sizes of polyps and the significant intrinsic similarity between polyps and the surrounding tissues. To address the above problems, a contextual feature aggregation polyp segmentation algorithm combining Pyramid Vision Transformer and convolution (CFA-PVT) is proposed. Firstly, the Pyramid Vision Transformer is used to extract image global features, and the stage bridging module(SBM) is employed to enhance the ability of the network to handle polyp details and aggregate high-level polyp features. Subsequently, a feature enhancement module (FEM) is used to explore shallow polyp information. Finally, cross-layer feature fusion is performed by a global adaptive module (GAM) to realize feature interaction. This algorithm is evaluated on the CVC-ClinicDB and Kvasir-SEG datasets and further tested for generalization capability on the CVC-ColonDB dataset. The results demonstrate that the proposed method effectively segments colorectal polyp images, offering a new approach to diagnosing colorectal polyps.

1. INTRODUCTION

Colorectal Cancer (CRC) originates from colorectal polyps. Advanced colorectal cancer has a very high mortality rate, posing a serious threat to human health. If detected early, CRC is one of the most curable cancers[1]. Therefore, accurately locating polyps in the early stages is crucial for the diagnosis of colorectal cancer. However, considering the experience-dependent nature and time-consuming aspect of the diagnosis, there is an urgent need for reliable automatic polyp segmentation methods that can precisely locate lesion areas in medical imaging.

In clinical practice, polyp segmentation demonstrates high accuracy, robustness, and efficiency. Early methods were primarily based on shape, color, and superpixels. For instance, Sánchez-González et al.[2] proposed automatic colon polyp segmentation through contour region analysis. Thanks to the powerful visual representation capabilities of deep learning, significant progress has been made in polyp segmentation tasks. Previous studies have used convolutional neural network (CNN)-based algorithms for polyp segmentation. Among them, UNet[3] has demonstrated excellent segmentation performance in biomedical imaging, and various UNet network variants have emerged, including UNet++[4], have shown outstanding performance in polyp segmentation tasks. However, these methods fail to focus on some small or complex areas. DeeplabV3[5] attempts to address this issue by connecting low-level feature maps from the backbone network, but it is insufficient to recover these details from the input image. To solve the above problems, a more powerful polyp segmentation architecture is needed. SANet[6] focuses on rich shallow features and introduces the Probability Correction Strategy (PCS).

In recent years, due to the successful application of Vision Transformer (ViT) in computer vision tasks[7], which first introduced transformers to image recognition. ViT divides the input image into patches and treats each patch as a token. It introduces multi-head attention mechanisms to extract feature information. Compared to CNN-based methods, ViT demonstrates stronger capabilities in capturing long-range dependencies. However, it suffers from inherent issues such as weak local feature representation and lacks the inductive biases inherent to CNNs, making it challenging for ViT to excel in dense prediction tasks requiring high resolution. To address these challenges, researchers introduced the Pyramid Vision Transformer (PVT)[8] and PVTv2[9]. These models benefit from a progressive shrinking pyramid approach. Compared to traditional ViT, PVT is more efficient. Although Polyp-PVT[10] achieved higher segmentation accuracy in polyp segmentation tasks using PVT, the capability of generating pyramid features is limited. The approach involves separate handling of features from different layers, resulting in a lack of effective information interaction among these layers. Subsequent advancements in polyp segmentation models have focused on enhancing context information to improve accuracy. Shi et al. [11] introduced an efficient context-aware Polyp-Mixer network using multi-head mixers to capture context information across various subspaces.

2. METHODOLOGY

2.1

Overview

As shown in Figure 1, the network architecture takes polyp images as input. It extracts multi-scale long-range dependency features laterally from the PVT encoder, generating feature maps X_j ∈ ℝ^{C_i×H^′×W^′} (j ∈ {1,2,3,4})with the spatial resolution of H / 4 × W / 4, H / 8 × W / 8, H / 16 × W / 16, H / 32 × W / 32, respectively, where C_i ∈{64,128,320,512}. These features are progressively aggregated across stages via SFM to capture richer semantic information, resulting in the fused features Γ₁ and the output polyp prediction mask M₁. The purpose of FEM is to use channel and spatial attention to enhance local detail information for better polyp localization and to obtain enhanced low-level feature Γ₂. Meanwhile, GAM is employed to align high-level and low-level features Γ₁ and Γ₂, outputting the polyp prediction mask M₂. Finally, we optimize the model by using Loss₁ and Loss₂.

Figure 1.

Illustration of CFA-PVT architecture

2.2

Stage bridging module

As shown in Figure 1, the network uses the Pyramid Vision Transformer PVTv2[9] as the encoder. PVTv2 is a variant of PVT[8]. Given an input image I ∈ ℝ^C×H×W, we first adjust the channels of the high-level features X₂, X₃, and X₄ from three stages to 32 through 1×1 convolutions. Subsequently, X₄ is upsampled by 2x and 4x, followed by a sequence of 1x1-3x3-1x1 residual convolutions, then concatenated with X₂ and X₃, respectively. These feature maps are then simply fused. After the above operations, the final feature map Γ₁ is obtained. Furthermore, by using CFM to integrate features, we aim to exploit local and global contextual information fully. This top-down multi-stage bridging feature fusion approach effectively guides the network to explore missing parts and details of objects.

2.3

Feature enhancement module

Typically, low-level features contain rich details such as textures and edges, but since they are close to the input, they also include a lot of redundant information. Inspired by[12], as shown in Figure 2, we use channel and spatial attention mechanisms to enhance global and local information interaction, refine the features, and filter out irrelevant features. This allows the model to focus more on important channels and spatial locations, thereby accurately capturing the details of polyps.

Figure 2.

Illustration of feature enhancement module

Subsequently, the channel information of the feature maps is aggregated using average pooling, max pooling, and soft pooling[13]. Each aggregated feature map undergoes channel attention enhancement individually through a shared MLP network. The formula for channel attention map (denoted as M_c) can be obtained by:

where σ(·) is sigmoid function and η denotes the MLP operation. P_avg (·), P_max (·), P_soft (·) represent avgpool, maxpool, and softpool, respectively. Next, concatenate the values from the three pooling operations along the channel dimension. Finally, apply a 7x7 convolutional layer followed by a sigmoid function to output a spatial attention map M_s.

where f^7×7 denotes the convolution operation with a filter size of 7x7, while , , represent the maximum value, average value, and exponentially weighted value obtained along the channel dimension.

2.4

Global adaptive module

This module takes two inputs: low-level features Γ₂ from FEM and high-level features Γ₁ from CFM, which are channel-wise concatenated. Inspired by[14], the module simultaneously captures spatial non-local contextual features and global correlation features across different channels. According to Figure 3, attention weights are initially obtained by applying a 1x1 convolution W_k and softmax function. Subsequently, feature linear transformation is achieved with a 1x1 convolution W_v that includes a reduction ratio, significantly reducing the parameter count. This is followed by a sequence of (layer normalization + ReLU) layers, and finally, features are fused through addition.

Figure 3.

Illustration of global adaptive module

2.5

Loss function

The total loss consists of the main loss Loss₂, which predicts the final segmentation result, and the Loss₁ supervises local information, trained jointly. This paper employs a combined loss comprising weighted binary cross-entropy loss and the weighted intersection over union loss . The total loss can be represented as:

3. EXPERIMENT

3.1

Implementation details

The experimental environment is established on the Pytorch framework, using PVTv2[9] as the encoder backbone network and the Adam optimizer, with an initial learning rate set to 1×10^-4. All experiments are conducted on an NVIDIA A6000 GPU (48 GB VRAM) with a batch size of 16. All experiments are trained for 100 epochs, with all input images set to 352×352 and utilizing a multi-scale training strategy of {0.75, 1, 1.25}.

We use three prominent polyp datasets, Kvasir-SEG[15], CVC-ClinicDB[16]^, and CVC-ColonDB[17], to evaluate the model’s performance. Table 1 summarizes the details of these datasets and their respective data splits. Notably, ColonDB is an unseen dataset used exclusively for evaluation purposes.

Table 1.

Details of the datasets, where ‘n/a’ indicates that the dataset is not available.

Datasets	Resolution	Train	Test
Kvasir[15]	332×487~1920×1072	900	100
CVC- ClinicDB[16]	574×500	550	62
CVC-ColonDB[17]	384×384	n/a	380

To quantitatively evaluate the performance of our proposed method, we employed four widely used metrics, including mean Dice (mDice), mean Intersection over Union (mIoU), weighted F-measure , and S-measure (S_α)[18] as accuracy evaluation indicators.

3.2

Comparisons with State-of-the-Art Methods

To comprehensively evaluate the effectiveness of our model, we compared it against four SOTA methods, including U-Net[3], UNet++[4], PraNet[19], and Polyp-PVT[10]. All of these predictions were directly provided by the authors. We used the Kvasir dataset to assess the model’s learning capability and the CVC-ColonDB dataset to evaluate its generalization ability. The experimental findings reveal that our proposed approach exhibited the highest performance levels across all evaluated metrics.

The quantitative analysis results are listed in Table 2. Specifically, it attained a mDice score of 0.924, an mIoU score of 0.872, a score of 0.917, and a S_α score of 0.930 on the Kvasir dataset. On the ClonoDB dataset, we achieved a mDice score of 0.812, an mIoU score of 0.733, a of 0.796, and a S_α of 0.869. As shown in Table 2, CFA-PVT achieved more significant improvements over the SOTA methods on both the Kvasir and ClonoDB datasets. This demonstrates that our model has strong learning and generalization capabilities.

Table 2.

Segmentation results of different methods. The best result is highlighted in bold and underlined fonts

Methods	Kvasir	ClonoDB
mDice	mIoU		Sα	mDice	mIoU		Sα
UNet[4]	0.818	0.746	0.794	0.858	0.512	0.444	0.498	0.712
UNet++[5]	0.821	0.743	0.808	0.862	0.483	0.410	0.467	0.691
PraNet[20]	0.898	0.840	0.885	0.915	0.712	0.640	0.699	0.819
Polyp-PVT[11]	0.917	0.864	0.911	0.925	0.808	0.727	0.795	0.865
Ours	0.924	0.872	0.917	0.930	0.812	0.733	0.796	0.869

Figure 4 depicts qualitative analysis results, illustrating the segmentation outcomes of our network and other methods across various polyp datasets. Polyp segmentation poses challenges, particularly with polyps of varying shapes and sizes, including those obscured by folds in the colon. Nevertheless, CFA-PVT demonstrates commendable segmentation performance, producing prediction maps that closely approximate the ground truth and exhibit relative completeness.

Figure 4.

Visualizations on Kvasir (top three rows) and ClonoDB (bottom three rows).

3.3

Ablation study

To further explore the impact of each component on the outcomes, we tested each component of CFA-PVT on the Kvasir dataset in this section, aiming for a deeper understanding of our model. PVTv2 is used as the baseline for ablation experiments, with components gradually removed, and the experimental results are listed in Table 3.

Table 3.

Ablation study results on Kvasir

Settings	CFM	FEM	GAM	Kvasir
backbone				0.910	0.856
w/o CFM		✔	✔	0.911	0.850
w/o FEM	✔		✔	0.920	0.869
w/o GAM	✔	✔		0.917	0.866
CFA-PVT(Ours)	✔	✔	✔	0.924	0.872

After removing the CFM, the performance of the model dropped significantly. The mDice decreased from 0.924 to 0.911, and the mIoU decreased from 0.872 to 0.850, indicating that the CFM effectively perceives the overall region and extracts rich semantic information. Removing the FEM resulted in a 0.4% and 0.3% reduction in mDice and mIoU metrics, respectively. This demonstrates that it focuses selectively on crucial channel responses while suppressing irrelevant and redundant information. Removing the GAM caused a 0.7% and 0.6% reduction in mDice and mIoU metrics, respectively, validating the GAM ability of the model for global polyp localization. The above results fully demonstrate the effective contribution of these modules to the final prediction.

4. CONCLUSION

This paper introduces CPA-PVT, a novel network for colorectal polyp segmentation. By integrating transformers with convolutional layers and employing contextual aggregation for adaptive feature fusion, CPA-PVT achieves highly accurate segmentation of polyp structures. Experimental results on Kvasir and ClonoDB datasets demonstrate superior performance compared to existing methods, showcasing comprehensive polyp segmentation and highlighting its efficacy. Future work aims to extend the encoder-decoder framework to incorporate advanced vision transformer architectures.

ACKNOWLEDGMENTS

This paper is supported by the Fund of Scientific Research Project of Tianjin Education Commission, China (Grant Nos. 2022ZD038).

REFERENCES

[1]

Kolligs, F. T., “Diagnostics and Epidemiology of Colorectal Cancer,” VISCERAL MEDICINE, 32 (3), 158 (2016). https://doi.org/10.1159/000446488 Google Scholar

[2]

Sánchez-González, A., García-Zapirain, B., Sierra-Sosa, D., and Elmaghraby, A., “Automatized colon polyp segmentation via contour region analysis,” COMPUTERS IN BIOLOGY AND MEDICINE, 100 152 –164 (2018). https://doi.org/10.1016/j.compbiomed.2018.07.002 Google Scholar

[3]

Ronneberger, O., Fischer, P., and Brox, T., “U-NET: Convolutional Networks for Biomedical Image Segmentation,” in Lecture notes in computer science, 234 –241 (2015). https://doi.org/10.1007/978-3-319-24574-4 Google Scholar

[4]

Zhou, Z. W., Siddiquee, M. M. R., Tajbakhsh, N., and Liang, J. M., “UNet plus plus : A Nested U-Net Architecture for Medical Image Segmentation,” DEEP LEARNING IN MEDICAL IMAGE ANALYSIS AND MULTIMODAL LEARNING FOR CLINICAL DECISION SUPPORT, 11045 3 –11 (2018). Google Scholar

[5]

Chen, L. C. E., Zhu, Y. K., Papandreou, G., Schroff, F., “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” COMPUTER VISION, PT VII, 11211 833 –851 (2018). Google Scholar

[6]

Wei, J., Hu, Y. W., Zhang, R. M., Li, Z., Zhou, S. K., “Shallow Attention Network for Polyp Segmentation,” MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, PT I, 12901 699 –708 (2021). Google Scholar

[7]

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., arXiv, Cornell University), (2020). Google Scholar

[8]

Wang, W. H., Xie, E. Z., Li, X., Fan, D. P., et al, in IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISIO, 548 –558 (2021). Google Scholar

[9]

Wang, W. H., Xie, E. Z., Li, X., Fan, D. P., et al, “PVT v2: Improved baselines with Pyramid Vision Transformer,” COMPUTATIONAL VISUAL MEDIA, 8 (3), 415 –424 (2022). https://doi.org/10.1007/s41095-022-0274-8 Google Scholar

[10]

Dong, B., Wang, W., Fan, D.-P., Li, J., “Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers,” CAAI Artificial Intelligence Research, 9150015 (2023). Google Scholar

[11]

Shi, J. H., Zhang, Q., Tang, Y. H., and Zhang, Z. Q., “Polyp-Mixer: An Efficient Context-Aware MLP-Based Paradigm for Polyp Segmentation,” IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 33 (1), 30 –42 (2023). https://doi.org/10.1109/TCSVT.2022.3197643 Google Scholar

[12]

Woo, S. H., Park, J., Lee, J. Y., and Kweon, I. S., “CBAM: Convolutional Block Attention Module,” COMPUTER VISION, PT VII, 11211 3 –19 (2018). Google Scholar

[13]

Stergiou, A., Poppe, R., Kalliatakis, G., and Ieee, in INTERNATIONAL CONFERENCE ON COMPUTER VISION, 10337 –10346 (2021). Google Scholar

[14]

Cao, Y., Xu, J. R., Lin, S., Wei, F. Y., in IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, 1971 –1980 (2019). Google Scholar

[15]

Jha, D., Smedsrud, P. H., Riegler, M. A., Halvorsen, P., “Kvasir-SEG: A Segmented Polyp Dataset,” MULTIMEDIA MODELING, PT II, 11962 451 –462 (2020). https://doi.org/10.1007/978-3-030-37734-2 Google Scholar

[16]

Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., Gil, D., “WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 43 99 –111 (2015). https://doi.org/10.1016/j.compmedimag.2015.02.007 Google Scholar

[17]

Tajbakhsh, N., Gurudu, S. R., and Liang, J. M., “Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information,” IEEE TRANSACTIONS ON MEDICAL IMAGING, 35 (2), 630 –644 (2016). https://doi.org/10.1109/TMI.2015.2487997 Google Scholar

[18]

Fan, D. P., Cheng, M. M., Liu, Y., Li, T., Borji, A., and Ieee, in IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 4558 –4567 (2017). Google Scholar

[19]

Fan, D.-P., Ji, G.-P., Zhou, T., Chen, G., “Pranet: Parallel reverse attention network for polyp segmentation,” in Lecture notes in computer science, 263 –273 (2020). https://doi.org/10.1007/978-3-030-59725-2 Google Scholar

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Ziwei Lin, Kaiyue Zhang, Shanshan Li, and Qing Qin "Contextual feature aggregation pyramid vision transformer for polyp segmentation", Proc. SPIE 13270, International Conference on Future of Medicine and Biological Information Engineering (MBIE 2024), 1327004 (11 September 2024); https://doi.org/10.1117/12.3039730

Access the abstract

PROCEEDINGS
6 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Polyps

Image segmentation

Transformers

Convolution

Finite element methods

Education and training

Feature extraction

1.

INTRODUCTION

2.

METHODOLOGY

2.1