|
1.INTRODUCTIONColorectal Cancer (CRC) originates from colorectal polyps. Advanced colorectal cancer has a very high mortality rate, posing a serious threat to human health. If detected early, CRC is one of the most curable cancers[1]. Therefore, accurately locating polyps in the early stages is crucial for the diagnosis of colorectal cancer. However, considering the experience-dependent nature and time-consuming aspect of the diagnosis, there is an urgent need for reliable automatic polyp segmentation methods that can precisely locate lesion areas in medical imaging. In clinical practice, polyp segmentation demonstrates high accuracy, robustness, and efficiency. Early methods were primarily based on shape, color, and superpixels. For instance, Sánchez-González et al.[2] proposed automatic colon polyp segmentation through contour region analysis. Thanks to the powerful visual representation capabilities of deep learning, significant progress has been made in polyp segmentation tasks. Previous studies have used convolutional neural network (CNN)-based algorithms for polyp segmentation. Among them, UNet[3] has demonstrated excellent segmentation performance in biomedical imaging, and various UNet network variants have emerged, including UNet++[4], have shown outstanding performance in polyp segmentation tasks. However, these methods fail to focus on some small or complex areas. DeeplabV3[5] attempts to address this issue by connecting low-level feature maps from the backbone network, but it is insufficient to recover these details from the input image. To solve the above problems, a more powerful polyp segmentation architecture is needed. SANet[6] focuses on rich shallow features and introduces the Probability Correction Strategy (PCS). In recent years, due to the successful application of Vision Transformer (ViT) in computer vision tasks[7], which first introduced transformers to image recognition. ViT divides the input image into patches and treats each patch as a token. It introduces multi-head attention mechanisms to extract feature information. Compared to CNN-based methods, ViT demonstrates stronger capabilities in capturing long-range dependencies. However, it suffers from inherent issues such as weak local feature representation and lacks the inductive biases inherent to CNNs, making it challenging for ViT to excel in dense prediction tasks requiring high resolution. To address these challenges, researchers introduced the Pyramid Vision Transformer (PVT)[8] and PVTv2[9]. These models benefit from a progressive shrinking pyramid approach. Compared to traditional ViT, PVT is more efficient. Although Polyp-PVT[10] achieved higher segmentation accuracy in polyp segmentation tasks using PVT, the capability of generating pyramid features is limited. The approach involves separate handling of features from different layers, resulting in a lack of effective information interaction among these layers. Subsequent advancements in polyp segmentation models have focused on enhancing context information to improve accuracy. Shi et al. [11] introduced an efficient context-aware Polyp-Mixer network using multi-head mixers to capture context information across various subspaces. 2.METHODOLOGY2.1OverviewAs shown in Figure 1, the network architecture takes polyp images as input. It extracts multi-scale long-range dependency features laterally from the PVT encoder, generating feature maps Xj ∈ ℝCi×H′×W′ (j ∈ {1,2,3,4})with the spatial resolution of H / 4 × W / 4, H / 8 × W / 8, H / 16 × W / 16, H / 32 × W / 32, respectively, where Ci ∈{64,128,320,512}. These features are progressively aggregated across stages via SFM to capture richer semantic information, resulting in the fused features Γ1 and the output polyp prediction mask M1. The purpose of FEM is to use channel and spatial attention to enhance local detail information for better polyp localization and to obtain enhanced low-level feature Γ2. Meanwhile, GAM is employed to align high-level and low-level features Γ1 and Γ2, outputting the polyp prediction mask M2. Finally, we optimize the model by using Loss1 and Loss2. 2.2Stage bridging moduleAs shown in Figure 1, the network uses the Pyramid Vision Transformer PVTv2[9] as the encoder. PVTv2 is a variant of PVT[8]. Given an input image I ∈ ℝC×H×W, we first adjust the channels of the high-level features X2, X3, and X4 from three stages to 32 through 1×1 convolutions. Subsequently, X4 is upsampled by 2x and 4x, followed by a sequence of 1x1-3x3-1x1 residual convolutions, then concatenated with X2 and X3, respectively. These feature maps are then simply fused. After the above operations, the final feature map Γ1 is obtained. Furthermore, by using CFM to integrate features, we aim to exploit local and global contextual information fully. This top-down multi-stage bridging feature fusion approach effectively guides the network to explore missing parts and details of objects. 2.3Feature enhancement moduleTypically, low-level features contain rich details such as textures and edges, but since they are close to the input, they also include a lot of redundant information. Inspired by[12], as shown in Figure 2, we use channel and spatial attention mechanisms to enhance global and local information interaction, refine the features, and filter out irrelevant features. This allows the model to focus more on important channels and spatial locations, thereby accurately capturing the details of polyps. Subsequently, the channel information of the feature maps is aggregated using average pooling, max pooling, and soft pooling[13]. Each aggregated feature map undergoes channel attention enhancement individually through a shared MLP network. The formula for channel attention map (denoted as Mc) can be obtained by: where σ(·) is sigmoid function and η denotes the MLP operation. Pavg (·), Pmax (·), Psoft (·) represent avgpool, maxpool, and softpool, respectively. Next, concatenate the values from the three pooling operations along the channel dimension. Finally, apply a 7x7 convolutional layer followed by a sigmoid function to output a spatial attention map Ms. where f7×7 denotes the convolution operation with a filter size of 7x7, while , , represent the maximum value, average value, and exponentially weighted value obtained along the channel dimension. 2.4Global adaptive moduleThis module takes two inputs: low-level features Γ2 from FEM and high-level features Γ1 from CFM, which are channel-wise concatenated. Inspired by[14], the module simultaneously captures spatial non-local contextual features and global correlation features across different channels. According to Figure 3, attention weights are initially obtained by applying a 1x1 convolution Wk and softmax function. Subsequently, feature linear transformation is achieved with a 1x1 convolution Wv that includes a reduction ratio, significantly reducing the parameter count. This is followed by a sequence of (layer normalization + ReLU) layers, and finally, features are fused through addition. 2.5Loss functionThe total loss consists of the main loss Loss2, which predicts the final segmentation result, and the Loss1 supervises local information, trained jointly. This paper employs a combined loss comprising weighted binary cross-entropy loss and the weighted intersection over union loss . The total loss can be represented as: 3.EXPERIMENT3.1Implementation detailsThe experimental environment is established on the Pytorch framework, using PVTv2[9] as the encoder backbone network and the Adam optimizer, with an initial learning rate set to 1×10^-4. All experiments are conducted on an NVIDIA A6000 GPU (48 GB VRAM) with a batch size of 16. All experiments are trained for 100 epochs, with all input images set to 352×352 and utilizing a multi-scale training strategy of {0.75, 1, 1.25}. We use three prominent polyp datasets, Kvasir-SEG[15], CVC-ClinicDB[16], and CVC-ColonDB[17], to evaluate the model’s performance. Table 1 summarizes the details of these datasets and their respective data splits. Notably, ColonDB is an unseen dataset used exclusively for evaluation purposes. Table 1.Details of the datasets, where ‘n/a’ indicates that the dataset is not available.
To quantitatively evaluate the performance of our proposed method, we employed four widely used metrics, including mean Dice (mDice), mean Intersection over Union (mIoU), weighted F-measure , and S-measure (Sα)[18] as accuracy evaluation indicators. 3.2Comparisons with State-of-the-Art MethodsTo comprehensively evaluate the effectiveness of our model, we compared it against four SOTA methods, including U-Net[3], UNet++[4], PraNet[19], and Polyp-PVT[10]. All of these predictions were directly provided by the authors. We used the Kvasir dataset to assess the model’s learning capability and the CVC-ColonDB dataset to evaluate its generalization ability. The experimental findings reveal that our proposed approach exhibited the highest performance levels across all evaluated metrics. The quantitative analysis results are listed in Table 2. Specifically, it attained a mDice score of 0.924, an mIoU score of 0.872, a score of 0.917, and a Sα score of 0.930 on the Kvasir dataset. On the ClonoDB dataset, we achieved a mDice score of 0.812, an mIoU score of 0.733, a of 0.796, and a Sα of 0.869. As shown in Table 2, CFA-PVT achieved more significant improvements over the SOTA methods on both the Kvasir and ClonoDB datasets. This demonstrates that our model has strong learning and generalization capabilities. Table 2.Segmentation results of different methods. The best result is highlighted in bold and underlined fonts
Figure 4 depicts qualitative analysis results, illustrating the segmentation outcomes of our network and other methods across various polyp datasets. Polyp segmentation poses challenges, particularly with polyps of varying shapes and sizes, including those obscured by folds in the colon. Nevertheless, CFA-PVT demonstrates commendable segmentation performance, producing prediction maps that closely approximate the ground truth and exhibit relative completeness. 3.3Ablation studyTo further explore the impact of each component on the outcomes, we tested each component of CFA-PVT on the Kvasir dataset in this section, aiming for a deeper understanding of our model. PVTv2 is used as the baseline for ablation experiments, with components gradually removed, and the experimental results are listed in Table 3. Table 3.Ablation study results on Kvasir
After removing the CFM, the performance of the model dropped significantly. The mDice decreased from 0.924 to 0.911, and the mIoU decreased from 0.872 to 0.850, indicating that the CFM effectively perceives the overall region and extracts rich semantic information. Removing the FEM resulted in a 0.4% and 0.3% reduction in mDice and mIoU metrics, respectively. This demonstrates that it focuses selectively on crucial channel responses while suppressing irrelevant and redundant information. Removing the GAM caused a 0.7% and 0.6% reduction in mDice and mIoU metrics, respectively, validating the GAM ability of the model for global polyp localization. The above results fully demonstrate the effective contribution of these modules to the final prediction. 4.CONCLUSIONThis paper introduces CPA-PVT, a novel network for colorectal polyp segmentation. By integrating transformers with convolutional layers and employing contextual aggregation for adaptive feature fusion, CPA-PVT achieves highly accurate segmentation of polyp structures. Experimental results on Kvasir and ClonoDB datasets demonstrate superior performance compared to existing methods, showcasing comprehensive polyp segmentation and highlighting its efficacy. Future work aims to extend the encoder-decoder framework to incorporate advanced vision transformer architectures. ACKNOWLEDGMENTSThis paper is supported by the Fund of Scientific Research Project of Tianjin Education Commission, China (Grant Nos. 2022ZD038). REFERENCESKolligs, F. T.,
“Diagnostics and Epidemiology of Colorectal Cancer,”
VISCERAL MEDICINE, 32
(3), 158
(2016). https://doi.org/10.1159/000446488 Google Scholar
Sánchez-González, A., García-Zapirain, B., Sierra-Sosa, D., and Elmaghraby, A.,
“Automatized colon polyp segmentation via contour region analysis,”
COMPUTERS IN BIOLOGY AND MEDICINE, 100 152
–164
(2018). https://doi.org/10.1016/j.compbiomed.2018.07.002 Google Scholar
Ronneberger, O., Fischer, P., and Brox, T.,
“U-NET: Convolutional Networks for Biomedical Image Segmentation,”
in Lecture notes in computer science, 234
–241
(2015). https://doi.org/10.1007/978-3-319-24574-4 Google Scholar
Zhou, Z. W., Siddiquee, M. M. R., Tajbakhsh, N., and Liang, J. M.,
“UNet plus plus : A Nested U-Net Architecture for Medical Image Segmentation,”
DEEP LEARNING IN MEDICAL IMAGE ANALYSIS AND MULTIMODAL LEARNING FOR CLINICAL DECISION SUPPORT, 11045 3
–11
(2018). Google Scholar
Chen, L. C. E., Zhu, Y. K., Papandreou, G., Schroff, F.,
“Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,”
COMPUTER VISION, PT VII, 11211 833
–851
(2018). Google Scholar
Wei, J., Hu, Y. W., Zhang, R. M., Li, Z., Zhou, S. K.,
“Shallow Attention Network for Polyp Segmentation,”
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, PT I, 12901 699
–708
(2021). Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., arXiv, Cornell University), (2020). Google Scholar
Wang, W. H., Xie, E. Z., Li, X., Fan, D. P., et al,
in IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISIO,
548
–558
(2021). Google Scholar
Wang, W. H., Xie, E. Z., Li, X., Fan, D. P., et al,
“PVT v2: Improved baselines with Pyramid Vision Transformer,”
COMPUTATIONAL VISUAL MEDIA, 8
(3), 415
–424
(2022). https://doi.org/10.1007/s41095-022-0274-8 Google Scholar
Dong, B., Wang, W., Fan, D.-P., Li, J.,
“Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers,”
CAAI Artificial Intelligence Research, 9150015
(2023). Google Scholar
Shi, J. H., Zhang, Q., Tang, Y. H., and Zhang, Z. Q.,
“Polyp-Mixer: An Efficient Context-Aware MLP-Based Paradigm for Polyp Segmentation,”
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 33
(1), 30
–42
(2023). https://doi.org/10.1109/TCSVT.2022.3197643 Google Scholar
Woo, S. H., Park, J., Lee, J. Y., and Kweon, I. S.,
“CBAM: Convolutional Block Attention Module,”
COMPUTER VISION, PT VII, 11211 3
–19
(2018). Google Scholar
Stergiou, A., Poppe, R., Kalliatakis, G., and Ieee,
in INTERNATIONAL CONFERENCE ON COMPUTER VISION,
10337
–10346
(2021). Google Scholar
Cao, Y., Xu, J. R., Lin, S., Wei, F. Y.,
in IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS,
1971
–1980
(2019). Google Scholar
Jha, D., Smedsrud, P. H., Riegler, M. A., Halvorsen, P.,
“Kvasir-SEG: A Segmented Polyp Dataset,”
MULTIMEDIA MODELING, PT II, 11962 451
–462
(2020). https://doi.org/10.1007/978-3-030-37734-2 Google Scholar
Bernal, J., Sánchez, F. J., Fernández-Esparrach, G., Gil, D.,
“WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”
COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 43 99
–111
(2015). https://doi.org/10.1016/j.compmedimag.2015.02.007 Google Scholar
Tajbakhsh, N., Gurudu, S. R., and Liang, J. M.,
“Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information,”
IEEE TRANSACTIONS ON MEDICAL IMAGING, 35
(2), 630
–644
(2016). https://doi.org/10.1109/TMI.2015.2487997 Google Scholar
Fan, D. P., Cheng, M. M., Liu, Y., Li, T., Borji, A., and Ieee,
in IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION,
4558
–4567
(2017). Google Scholar
Fan, D.-P., Ji, G.-P., Zhou, T., Chen, G.,
“Pranet: Parallel reverse attention network for polyp segmentation,”
in Lecture notes in computer science, 263
–273
(2020). https://doi.org/10.1007/978-3-030-59725-2 Google Scholar
|