Automated object detection is becoming more relevant in a wide variety of applications in the military domain. This includes the detection of drones, ships, and vehicles in video and IR video. In recent years, deep learning based object detection methods, such as YOLO, have shown to be promising in many applications for object detection. However, current methods have limited success when objects of interest are small in number of pixels, e.g. objects far away or small objects closer by. This is important, since accurate small object detection translates to early detection and the earlier an object is detected the more time is available for action. In this study, we investigate novel image analysis techniques that are designed to address some of the challenges of (very) small object detection by taking into account temporal information. We implement six methods, of which three are based on deep learning and use the temporal context of a set of frames within a video. The methods consider neighboring frames when detecting objects, either by stacking them as additional channels or by considering difference maps. We compare these spatio-temporal deep learning methods with YOLO-v8 that only considers single frames and two traditional moving object detection methods. Evaluation is done on a set of videos that encompasses a wide variety of challenges, including various objects, scenes, and acquisition conditions to show real-world performance.
Deep learning has emerged as a powerful tool for image analysis in various fields including the military domain. It has the potential to automate and enhance tasks such as object detection, classification, and tracking. Training images for development of such models are typically scarce, due to the restricted nature of this type of data. Consequently, researchers have focused on using synthetic data for model development, since simulated images are fast to generate and can, in theory, make up a large and diverse data set. When using simulated training data it is important to consider the variety needed to bridge the gap between simulated and real data. So far it is not fully understood what variations are important and how much variation is needed. In this study, we investigate the effect of simulation variety. We do so for the development of a deep learning-based military vehicle detector that is evaluated on real-world images of military vehicles. To construct the synthetic training data, 3D models of the vehicles are placed in front of diverse background scenes. We experiment with the number of images, background scene variations, 3D model variations, model textures, camera-object distance, and various object rotations. The insight that we gain can be used to prioritize future efforts towards creating synthetic data for deep learning-based object detection models.
Purpose: Ensembles of convolutional neural networks (CNNs) often outperform a single CNN in medical image segmentation tasks, but inference is computationally more expensive and makes ensembles unattractive for some applications. We compared the performance of differently constructed ensembles with the performance of CNNs derived from these ensembles using knowledge distillation, a technique for reducing the footprint of large models such as ensembles.
Approach: We investigated two different types of ensembles, namely, diverse ensembles of networks with three different architectures and two different loss-functions, and uniform ensembles of networks with the same architecture but initialized with different random seeds. For each ensemble, additionally, a single student network was trained to mimic the class probabilities predicted by the teacher model, the ensemble. We evaluated the performance of each network, the ensembles, and the corresponding distilled networks across three different publicly available datasets. These included chest computed tomography scans with four annotated organs of interest, brain magnetic resonance imaging (MRI) with six annotated brain structures, and cardiac cine-MRI with three annotated heart structures.
Results: Both uniform and diverse ensembles obtained better results than any of the individual networks in the ensemble. Furthermore, applying knowledge distillation resulted in a single network that was smaller and faster without compromising performance compared with the ensemble it learned from. The distilled networks significantly outperformed the same network trained with reference segmentation instead of knowledge distillation.
Conclusion: Knowledge distillation can compress segmentation ensembles of uniform or diverse composition into a single CNN while maintaining the performance of the ensemble.
Type 2 Diabetes (T2D) is a chronic metabolic disorder that can lead to blindness and cardiovascular disease. Information about early stage T2D might be present in retinal fundus images, but to what extent these images can be used for a screening setting is still unknown. In this study, deep neural networks were employed to differentiate between fundus images from individuals with and without T2D. We investigated three methods to achieve high classification performance, measured by the area under the receiver operating curve (ROC-AUC). A multi-target learning approach to simultaneously output retinal biomarkers as well as T2D works best (AUC = 0.746 [±0.001]). Furthermore, the classification performance can be improved when images with high prediction uncertainty are referred to a specialist. We also show that the combination of images of the left and right eye per individual can further improve the classification performance (AUC = 0.758 [±0.003]), using a simple averaging approach. The results are promising, suggesting the feasibility of screening for T2D from retinal fundus images.
Innovative technologies for minimally invasive interventions have the potential to add value to vascular procedures in the hybrid operating theater (HOT). Restricted budgets require prioritization of the development of these technologies. We aim to provide vascular surgeons with a structured methodology to incorporate possibly conflicting criteria in prioritizing the development of new technologies. We propose a multi-criteria decision analysis framework to evaluate the value of innovative technologies for the HOT based on the MACBETH methodology. The framework is applied to a specific case: the new HOT in a large teaching hospital. Three upcoming innovations are scored for three different endovascular procedures. Two vascular surgeons scored the expected performance of these innovations for each of the procedures on six performance criteria and weighed the importance of these criteria. The overall value of the innovations was calculated as the weighted average of the performance scores. On a scale from 0-100 describing the overall value, the current HOT scored halfway the scale (49.9). A wound perfusion measurement tool scored highest (69.1) of the three innovations, mainly due to the relatively high score for crural revascularization procedures (72). The novel framework could be used to determine the relative value of innovative technologies for the HOT. When development costs are assumed to be similar, and a single budget holder decides on technology development, priority should be given to the development of a wound perfusion measurement tool.
A pipeline of unsupervised image analysis methods for extraction of geometrical features from retinal fundus images has previously been developed. Features related to vessel caliber, tortuosity and bifurcations, have been identified as potential biomarkers for a variety of diseases, including diabetes and Alzheimer’s. The current computationally expensive pipeline takes 24 minutes to process a single image, which impedes implementation in a screening setting. In this work, we approximate the pipeline with a convolutional neural network (CNN) that enables processing of a single image in a few seconds. As an additional benefit, the trained CNN is sensitive to key structures in the retina and can be used as a pretrained network for related disease classification tasks. Our model is based on the ResNet-50 architecture and outputs four biomarkers that describe global properties of the vascular tree in retinal fundus images. Intraclass correlation coefficients between the predictions of the CNN and the results of the pipeline showed strong agreement (0.86 - 0.91) for three of four biomarkers and moderate agreement (0.42) for one biomarker. Class activation maps were created to illustrate the attention of the network. The maps show qualitatively that the activations of the network overlap with the biomarkers of interest, and that the network is able to distinguish venules from arterioles. Moreover, local high and low tortuous regions are clearly identified, confirming that a CNN is sensitive to key structures in the retina.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.