|
1.INTRODUCTIONAccording to the report released by the World Health Organization (WHO) in 2021, the number of cancer deaths worldwide was increased by 37% from 2000 to 2019, reaching about 9.3 million in 20191. Among a variety of cancers, gastric cancer is the fifth most common cancer and the fourth highest cause of cancer death in the world2. Epstein-Barr virus (EBV) positive tumor is one molecular subtype of gastric cancer3 which often responds well to immune checkpoint inhibitors4 and has a favorable prognosis5. However, EBV testing is often time-consuming and costly. Therefore, it would be desirable to help pathologists more accurately determine whether a patient belongs to the EBV group only based on cost-efficient analysis of pathological images. Most recent works6-8 focused on the task of gastric cancer classification into positive and negative categories, with the exception of a recent work9 where a deep convolutional neural network (CNN) with the ResNet backbone10 was trained to predict the molecular subtypes of gastric cancer called microsatellite instability (MSI) and microsatellite stability (MSS). In this study, we propose an innovative classification method for EBV prediction based on self-supervised learning and multi-scale ensemble prediction. First, considering that adjacent regions in each whole slide image (WSI) of the same tissue often share similar pathological features, a novel formation of positive pairs was proposed for the contrastive learning of CNN feature extractor. With self-supervised learning, more unlabeled pathological images can be used to help train a more generalizable feature extractor for downstream classification tasks. Second, inspired by the diagnosis process of pathologists how often inspect WSIs at multiple multiplications, we proposed a multi-scale ensemble model for the prediction of EBV status (EBV vs. non-EBV). Experiments on two external pathological image datasets show that the proposed self-supervised learning can help gain a more effective EBV classifier and the multi-scale ensemble model can further improve the prediction stability. 2.METHODThe proposed multi-scale ensemble classifier consists of three individual classifiers. Each individual classifier is to classify image patches of a unique scale (with magnification 10×, 5×, or 2.5×). The feature extractor of each individual classifier is pre-trained by self-supervised learning without using labels of image patches and the classifier head is fine-tuned based on available labelled image patches. 2.1Self-supervised learning of feature extractorThe motivation of self-supervised learning of each feature extractor is to speed up the whole process of classifier training. For a specific scale of image patches, a huge number of patches can be generated (e.g., with overlapped regular sampling from each large-size WSI slide), and thus training an CNN classifier (both the feature extractor and the classifier head) using the huge number of image patches would be very time-consuming. While a pre-trained and fixed feature extractor based on natural image dataset (e.g., ImageNet) can be used such that only the classifier head needs to be trained quickly for classification of pathological image patches, the ability of the pre-trained feature extractor based on natural images may not be powerful enough for extraction and representation of pathological image features. To make full use of the huge number of pathological image patches and meanwhile speed up the training process, we propose using a relatively smaller number of image patches to train the feature extractor in a self-supervised manner, and then using the huge number of patches to train only the classifier head (often a two- or three-layer MLP), with the pre-trained feature extractor fixed. In this study, the contrastive learning strategy was applied to self-train each feature extractor, and a novel way of forming positive and negative pairs of image patches was proposed for contrastive learning. Specifically, positive pairs are generated not only based on two image augmentations of each image patch (Figure 1 left, from original patch xi to its augmented versions ), but also novelly based on augmentations of two neighboring image patches (Figure 1 left, from two original neighboring patches xi and xi to their augmented versions ). Augmented neighboring image patches are considered as positive pairs because adjacent image patches are from the same tissue which is composed of cell groups with similar morphology and same function and therefore often have similar morphological features. On the other hand, for negative pair generation, the MoCo method11 was adopted, by which a large number of negative pairs can be generated based on memory buffer storing feature representations of image patches from both current mini-batch and hundreds of previous mini-batches. In general, the corresponding two patches for each negative pair are often from two different WSI slides or from different locations of one WSI slide. Based on the generated positive and negative pairs with a specific patch scale, the corresponding MoCo model can be well trained. Consequently, the trained encoder from each well-trained MoCo was kept as the feature extractor for the corresponding patch classifier (Figure 1 right). 2.2Multi-scale ensemble classifierOnce a feature extractor was self-trained with image patches of a unique patch scale, the feature extractor can be used to extract a feature vector representation for each labelled image patch with the same scale, where the labelled image patches are from the annotated EBV regions from each WSI slide. Then, built on the feature extractor, a MLP as the classifier head can be trained for prediction of each patch into EBV or non-EBV category. Compared to training a CNN classifier (i.e., the CNN feature extractor and the final FC output layer), training a two- or three-layer MLP is much more efficient because of much smaller number of model parameters and smaller-size patch representation (i.e., feature vectors rather than image patches as input). Once each MLP is trained based on the corresponding feature extractor for image patches of a unique scale, then the feature extractor and the MLP are combined as a CNN classifier. With three different image magnifications, three such CNN classifiers can be ensembled for prediction of any new image patch (Figure 2). It is worth noting that during inference, at each location (often representing a local region) in a new WSI, three image patches with different magnifications are cropped and fed into corresponding classifiers, resulting in three output probability predictions. The average of the predictions is used as the final prediction probability for the location in the WSI slide. 3.EXPERIMENT3.1Experimental setupThree pathological image datasets of gastric cancer were used for evaluation of the proposed method, including one public dataset TCGA and two prividate datasets SYSUCC-Internal and SYSUCC-MultiCenter (Table 1 for more details). While SYSUCC-Internal was used to self-train the feature extractors, SYSUCC-MultiCenter and TCGA were used to train the classifier heads and evaluate the performance of the classifiers respectively. SYSUCC-MultiCenter and TCGA were respectively split into five folds at slide level and the five-fold cross-validation strategy was adopted for evaluation, each time with three folds for MLP training, one fold for validation, and the remaining fold for testing. The average area under the ROC curve (AUC) over five testing folds were reported on SYSUCC-MultiCenter and TCGA respectively. Note that only image patches from tissue regions in SYSUCC-Internal were regularly extracted for feature extractor training, and patches from annotated tumor regions in SYSUCC-MultiCenter and TCGA for classifier head training and classifier evaluations. Table 1.Statistics of datasets.
The three feature extractors have the same model backbone, i.e., the ResNet-50 convolutional layers. Since the MoCo v212 method was used train the feature extractor, a two-layer MLP projection head as in MoCo v2 was attached to the feature extractor to reduce the dimension of feature vector from 2048 to 128. Each feature extractor was trained by the MoCo method with suggested hyper-parameters. SGD optimizer with batch size 128 was used to train each feature extractor for 100 epochs. Each image patch was pre-processed by the Vahadane’s method13 for color normalization and randomly cropped to 224×224 pixels, and general data augmentations was performed, including random rotation, horizontal and vertical flip, brightness changes, contrast and saturation changes. The three classifier heads also shared one backbone, i.e., a three-layer MLP with layer output dimensions being 2048, 2048 and 2 respectively. Each MLP was trained by the SGD optimizer (batch size 256, momentum 0.9) over 100 epochs with the initial learning rate of 0.001. In the first 20 iterations, the learning rate was adjusted by linear warmup, and then dynamically adjusted by cosine annealing in the remaining epochs. In terms of data processing, each patch from TCGA and SYSUCC-MultiCenter was center cropped into the size of 224×224 pixels and fed to the corresponding well-trained feature extractor to obtain the feature vector as the input of classifier head. 3.2Performance evaluationBoth patch-level and WSI-level AUCs were reported with SYSUCC-MultiCenter and TCGA respectively. For slide-level AUCs, the classification probabilities of all patches in each WSI were averaged as the classification probability of the corresponding WSI slide. The original MoCo and MoCo v2 were used as two self-supervised learning baselines for comparison. In addition, simultaneously fine-tuning of each pre-trained feature extractor (initially based on ImageNet dataset) and training classifier head was also used as a baseline (‘Finetune’), where batch size was set 32 considering the large memory consumption. From Table 2, it can be observed that on TCGA, our method achieved the best patch-level and slide-level performance at 2.5× magnification. On SYSUCC-MultiCenter, our method achieved the best patch-level performance and comparable slide-level performance at 10× magnification. This clearly demonstrates that although the proposed self-supervised learning can help improve classification performance, patch scale affects model performance and its effect may vary over different datasets. Compared to those single-scale classifiers which were affected by patch scales, the proposed multi-scale ensemble classifier (Table 2, last row) was shown to be more stable, achieving either best patch-level and slide-level performance on SYSUCC-MultiCenter or comparable performance on TCGA. Table 2.Performance comparison between our method and baselines.
4.CONCLUSIONIn this study, we proposed a self-supervised learning method for classification pathological images into EBV or non-EBV, with the help of a novel positive pair formation based on neighboring patches in each WSI slide. By training a feature extractor with self-supervised learning, potentially more image patches can be used to efficiently train a classifier head only. In addition, the fusion of multi-scale classifiers further improved the stability of EBV prediction. REFERENCESWorld Health Organization,
“World Health Statistics 2021: Monitoring Health for the SDGs, Sustainable Development Goals,”
(2021). Google Scholar
Bray, F., Ferlay, J., Soerjomataram, I., et al.,
“Global cancer statistics 2018: Globocan estimates of incidence and mortality worldwide for 36 cancers in 185 countries,”
CA: A Cancer Journal for Clinicians, 68
(6), 394
–424
(2018). Google Scholar
Cancer Genome Atlas Research Network,
“Comprehensive molecular characterization of gastric adenocarcinoma,”
Nature, 513
(7517), 202
–209
(2014). https://doi.org/10.1038/nature13480 Google Scholar
Kim, S. T., Cristescu, R., Bass, A. J., et al.,
“Comprehensive molecular characterization of clinical responses to PD-1 inhibition in metastatic gastric cancer,”
Nature Medicine, 24
(9), 1449
–1458
(2018). https://doi.org/10.1038/s41591-018-0101-z Google Scholar
Qiu, M. Z., He, C. Y., Lu, S. X., et al.,
“Prospective observation: clinical utility of plasma Epstein-Barr virus DNA load in EBV-associated gastric carcinoma patients,”
International Journal of Cancer, 146
(1), 272
–280
(2020). https://doi.org/10.1002/ijc.v146.1 Google Scholar
Oikawa, K., Saito, A., Kiyuna, T., et al.,
“Pathological diagnosis of gastric cancers with a novel computerized analysis system,”
Journal of Pathology Informatics, 8
(1), 5
(2017). https://doi.org/10.4103/2153-3539.201114 Google Scholar
Song, Z., Zou, S., Zhou, W., et al.,
“Clinically applicable histopathological diagnosis system for gastric cancer detection using deep learning,”
Nature Communications, 11
(1), 1
–9
(2020). https://doi.org/10.1038/s41467-020-18147-8 Google Scholar
Tsaku, N. Z., Kosaraju S. C., Aqila T., et al.,
“Texture-based deep learning for effective histopathological cancer image classification,”
in IEEE International Conference on Bioinformatics and Biomedicine,
973
–977
(2019). Google Scholar
Kather, J. N., Pearson, A. T., Halama, N., et al.,
“Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer,”
Nature Medicine, 25
(7), 1054
–1056
(2019). https://doi.org/10.1038/s41591-019-0462-y Google Scholar
He K, Zhang X, Ren S, et al.,
“Deep residual learning for image recognition,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
770
–778
(2016). Google Scholar
He, K., Fan, H., Wu, Y., et al.,
“Momentum contrast for unsupervised visual representation learning,”
in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
9729
–9738
(2020). Google Scholar
Chen, X., Fan, H., Girshick, R., et al.,
“Improved baselines with momentum contrastive learning,”
arXiv:2003.04297,
(2020). Google Scholar
Vahadane, A., Peng, T., Albarqouni, S., et al.,
“Structure-preserved color normalization for histological images,”
in IEEE International Symposium on Biomedical Imaging,
1012
–1015
(2015). Google Scholar
|