Multiple object tracking (MOT) interest has grown in recent years, both in civil and military contexts, enhancing situational awareness for better decision-making. Typically, state-of-the-art methods integrate motion and appearance features to preserve the trajectory of each object over time, using new detection information when available. Visual features are fundamental when it comes to solving temporary occlusion or complex trajectories, i.e. non-linear motion associated with high object speeds or low framerate. Currently, these features are extracted by powerful deep learning-based models trained on the re-identification (ReID) task. However, research focuses mostly on scenarios involving pedestrians or vehicles, limiting the adaptability and transferability of such methods to other use cases. In this paper we investigate the added value of a variety of appearance features for comparing vessel appearance. We also include recent advances in foundation models that show their out-of-the- box applicability to unseen circumstances. Finally, we discuss how the robust visual features could improve multiple object tracking performances in the specialized domain of maritime surveillance.
The use of simulated data for training deep learning models has shown to be a promising strategy for automated situational awareness, particularly when real data is scarce. Such simulated datasets are important in fields where access to environments or objects of interest is limited, including space, security, and defense. When simulating a dataset for training of a vehicle detector using 3D models, one ideally has access to high-fidelity models for each class of interest. In practice, 3D model quality can vary significantly across classes, often due to different data source or limited detail available for certain objects. In this study, we investigate the impact of this 3D model variation on the performance of a fine-grained military vehicle detector, that distinguishes 15 classes and is trained on simulated data. Our research is driven by the observation that variations in polygon count among 3D models significantly influence class-specific accuracies, leading to imbalances in overall model performance. To address this, we implemented four decimation strategies aimed at standardizing the polygon count across different models. While these approaches resulted in a reduction of overall accuracy, measured in average precision (AP) and AP@50, they also contributed to a more balanced confusion matrix, reducing class prediction bias. Our findings suggest that rather than uniformly lowering the detail level of all models, future work should focus on enhancing the detail in low-polygon models to achieve a more effective and balanced detection performance.
Artificial intelligence (AI) models are at the core of improving computer-assisted tasks such as object detection, target recognition, and mission planning. The development of AI models typically requires a large set of representative data, which can be difficult to acquire in the military domain. Challenges include uncertain and incomplete data, complex scenarios, and scarcity of historical or threat data. A promising alternative to real-world data is the use of simulated data for AI model training, but the gap between real and simulated data can impede effective transfer from synthetic to real-world scenarios. In this study, we provide an overview of the state-of-the-art methods for exploiting simulation data to train AI models for military applications. We identify specific simulation considerations and their effects on AI model performance, such as simulation variation and simulation fidelity. We investigate the importance of these aspects by showcasing three studies where simulated data is used to train AI models for military applications, namely vehicle detection, target classification and course of action support. In the first study, we focus on military vehicle detection in RGB images and study the effect of simulation variation and the combination of a large set of simulated data with few real samples. Subsequently, we address the topic of target classification in sonar imagery, investigating how to effectively integrate a small set of simulated objects into a large set of low-frequency synthetic aperture sonar data. We conclude with a study on mission planning, where we experiment with the fidelities of different aspects in our simulation environment, such as the level of realism in movement patterns. Our findings highlight the potential of using simulated data to train AI models, but also illustrate the need for further research on this topic in the military domain.
Space domain awareness has gained traction in recent years, encompassing the charting and cataloging of space objects, anticipating orbital paths, and keeping track of re-entering objects. Radar techniques can be used to monitor the fast-growing population of satellites, but so far this is mainly used for detection and tracking. For the characterization of a satellite’s capabilities, more detailed information, such as inverse synthetic-aperture radar (ISAR) imaging, is needed. Deep learning has become the preferred method for automated image analysis in various applications. Development of deep learning models typically requires large amounts of training data, but recent studies have shown that synthetic data can be used as an alternative in combination with domain adaption techniques to overcome the domain gap between synthetic and real data.
In this study, we present a deep learning-based methodology for automated segmentation of the satellite’s bus and solar panels in ISAR images. We first train a segmentation model using thousands of fast simulated ISAR images and then we finetune the model using a domain adaptation technique that only requires a few samples of the target domain. As a proof of concept, we use a small set of high fidelity simulated ISAR images closely resembling real ISAR images as the target domain. Our proof of concept demonstrates that this domain adaptation technique effectively bridges the domain gap between the training and target radar image domains. Consequently, fast simulated (low fidelity) synthetic datasets are proven to be invaluable for training segmentation models for ISAR images, especially when combined with domain adaptation techniques.
Automatic object detection is increasingly important in the military domain, with potential applications including target identification, threat assessment, and strategic decision-making processes. Deep learning has become the standard methodology for developing object detectors, but obtaining the necessary large set of training images can be challenging due to the restricted nature of military data. Moreover, for meaningful deployment of an object detection model, it needs to work in various environments and conditions, in which prior data acquisition might not be possible. The use of simulated data for model development can be an alternative for real images and recent work has shown the potential for training a military vehicle detector using simulated data. Nevertheless, fine-grained classification of detected military vehicles, using training on simulated data, remains an open challenge.
In this study, we develop an object detector for 15 vehicle classes, containing similar appearing types, such as multiple battle tanks and howitzers. We show that combining few real data samples with a large amount of simulated data (12,000 images) leads to a significant improvement in comparison with using one of these sources individually. Adding just two samples per class improves the mAP to 55.9 [±2.6], compared to 33.8 [±0.7] when only simulated data is used. Further improvements are achieved by adding more real samples and using Grounding DINO, a foundation model pretrained on vast amounts of data (mAP = 90.1 [±0.5]). In addition, we investigate the effect of simulation variation, which we find is important even when more real samples are available.
KEYWORDS: Data modeling, Object detection, Transformers, Education and training, Performance modeling, 3D modeling, Sensors, Visual process modeling, Linear filtering, Computer vision technology
Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic and real data remains a challenge. Existing methods usually build on top of baseline Convolutional Neural Network (CNN) models that have been shown to perform well when trained on real data, but have limited ability to perform well when trained on synthetic data. For example, some architectures allow for fine-tuning with the expectation of large quantities of training data and are prone to overfitting on synthetic data. Related work usually ignores various best practices from object detection on real data, e.g. by training on synthetic data from a single environment with relatively little variation. In this paper we propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images. Based on the state of the art, we incorporate data augmentation methods and a Transformer backbone. Besides reaching relatively strong performance without any specialized synthetic data transfer methods, we show that our methods improve the state of the art on synthetic data trained object detection for the RarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an in-house vehicle detection dataset.
Combining data from multiple sensors to improve the overall robustness and reliability of a classification system has become crucial in many applications, from military surveillance and decision support, to autonomous driving, robotics, and medical imaging. This so-called sensor fusion is especially interesting for fine-grained target classification, in which very specific sub-categories (e.g. ship types) need to be distinguished, a task that can be challenging with data from a single modality. Typical modalities are electro-optical (EO) image sensors, that can provide rich visual details of an object of interest, and radar, that can yield additional spatial information. Defined by the approach used to combine data from these sensors, several fusion techniques exist. For example, late fusion can merge class probabilities outputted by separate processing pipelines dedicated to each of the individual sensor data. In particular, deep learning (DL) has been widely leveraged for EO image analysis, but typically requires a lot of data to adapt to the nuances of a fine-grained classification task. Recent advances in DL on foundation models have shown a high potential when dealing with in-domain data scarcity, especially in combination with few-shot learning. This paper presents a framework to effectively combine EO and radar sensor data, and shows how this method outperforms stand-alone single sensor methods for fine-grained target classification. We adopt a strong few-shot image classification baseline based on foundation models, which robustly handles the lack of in-domain data and exploits rich visual features. In addition, we investigate a weighted and a Bayesian fusion approach to combine target class probabilities outputted by the image classification model and radar kinematic features. Experiments with data acquired in a measurement campaign at the port of Rotterdam show that our fusion method improves on the classification performance of individual modalities.
Automated object detection is becoming more relevant in a wide variety of applications in the military domain. This includes the detection of drones, ships, and vehicles in video and IR video. In recent years, deep learning based object detection methods, such as YOLO, have shown to be promising in many applications for object detection. However, current methods have limited success when objects of interest are small in number of pixels, e.g. objects far away or small objects closer by. This is important, since accurate small object detection translates to early detection and the earlier an object is detected the more time is available for action. In this study, we investigate novel image analysis techniques that are designed to address some of the challenges of (very) small object detection by taking into account temporal information. We implement six methods, of which three are based on deep learning and use the temporal context of a set of frames within a video. The methods consider neighboring frames when detecting objects, either by stacking them as additional channels or by considering difference maps. We compare these spatio-temporal deep learning methods with YOLO-v8 that only considers single frames and two traditional moving object detection methods. Evaluation is done on a set of videos that encompasses a wide variety of challenges, including various objects, scenes, and acquisition conditions to show real-world performance.
Deep learning has emerged as a powerful tool for image analysis in various fields including the military domain. It has the potential to automate and enhance tasks such as object detection, classification, and tracking. Training images for development of such models are typically scarce, due to the restricted nature of this type of data. Consequently, researchers have focused on using synthetic data for model development, since simulated images are fast to generate and can, in theory, make up a large and diverse data set. When using simulated training data it is important to consider the variety needed to bridge the gap between simulated and real data. So far it is not fully understood what variations are important and how much variation is needed. In this study, we investigate the effect of simulation variety. We do so for the development of a deep learning-based military vehicle detector that is evaluated on real-world images of military vehicles. To construct the synthetic training data, 3D models of the vehicles are placed in front of diverse background scenes. We experiment with the number of images, background scene variations, 3D model variations, model textures, camera-object distance, and various object rotations. The insight that we gain can be used to prioritize future efforts towards creating synthetic data for deep learning-based object detection models.
Purpose: Ensembles of convolutional neural networks (CNNs) often outperform a single CNN in medical image segmentation tasks, but inference is computationally more expensive and makes ensembles unattractive for some applications. We compared the performance of differently constructed ensembles with the performance of CNNs derived from these ensembles using knowledge distillation, a technique for reducing the footprint of large models such as ensembles.
Approach: We investigated two different types of ensembles, namely, diverse ensembles of networks with three different architectures and two different loss-functions, and uniform ensembles of networks with the same architecture but initialized with different random seeds. For each ensemble, additionally, a single student network was trained to mimic the class probabilities predicted by the teacher model, the ensemble. We evaluated the performance of each network, the ensembles, and the corresponding distilled networks across three different publicly available datasets. These included chest computed tomography scans with four annotated organs of interest, brain magnetic resonance imaging (MRI) with six annotated brain structures, and cardiac cine-MRI with three annotated heart structures.
Results: Both uniform and diverse ensembles obtained better results than any of the individual networks in the ensemble. Furthermore, applying knowledge distillation resulted in a single network that was smaller and faster without compromising performance compared with the ensemble it learned from. The distilled networks significantly outperformed the same network trained with reference segmentation instead of knowledge distillation.
Conclusion: Knowledge distillation can compress segmentation ensembles of uniform or diverse composition into a single CNN while maintaining the performance of the ensemble.
Type 2 Diabetes (T2D) is a chronic metabolic disorder that can lead to blindness and cardiovascular disease. Information about early stage T2D might be present in retinal fundus images, but to what extent these images can be used for a screening setting is still unknown. In this study, deep neural networks were employed to differentiate between fundus images from individuals with and without T2D. We investigated three methods to achieve high classification performance, measured by the area under the receiver operating curve (ROC-AUC). A multi-target learning approach to simultaneously output retinal biomarkers as well as T2D works best (AUC = 0.746 [±0.001]). Furthermore, the classification performance can be improved when images with high prediction uncertainty are referred to a specialist. We also show that the combination of images of the left and right eye per individual can further improve the classification performance (AUC = 0.758 [±0.003]), using a simple averaging approach. The results are promising, suggesting the feasibility of screening for T2D from retinal fundus images.
Innovative technologies for minimally invasive interventions have the potential to add value to vascular procedures in the hybrid operating theater (HOT). Restricted budgets require prioritization of the development of these technologies. We aim to provide vascular surgeons with a structured methodology to incorporate possibly conflicting criteria in prioritizing the development of new technologies. We propose a multi-criteria decision analysis framework to evaluate the value of innovative technologies for the HOT based on the MACBETH methodology. The framework is applied to a specific case: the new HOT in a large teaching hospital. Three upcoming innovations are scored for three different endovascular procedures. Two vascular surgeons scored the expected performance of these innovations for each of the procedures on six performance criteria and weighed the importance of these criteria. The overall value of the innovations was calculated as the weighted average of the performance scores. On a scale from 0-100 describing the overall value, the current HOT scored halfway the scale (49.9). A wound perfusion measurement tool scored highest (69.1) of the three innovations, mainly due to the relatively high score for crural revascularization procedures (72). The novel framework could be used to determine the relative value of innovative technologies for the HOT. When development costs are assumed to be similar, and a single budget holder decides on technology development, priority should be given to the development of a wound perfusion measurement tool.
A pipeline of unsupervised image analysis methods for extraction of geometrical features from retinal fundus images has previously been developed. Features related to vessel caliber, tortuosity and bifurcations, have been identified as potential biomarkers for a variety of diseases, including diabetes and Alzheimer’s. The current computationally expensive pipeline takes 24 minutes to process a single image, which impedes implementation in a screening setting. In this work, we approximate the pipeline with a convolutional neural network (CNN) that enables processing of a single image in a few seconds. As an additional benefit, the trained CNN is sensitive to key structures in the retina and can be used as a pretrained network for related disease classification tasks. Our model is based on the ResNet-50 architecture and outputs four biomarkers that describe global properties of the vascular tree in retinal fundus images. Intraclass correlation coefficients between the predictions of the CNN and the results of the pipeline showed strong agreement (0.86 - 0.91) for three of four biomarkers and moderate agreement (0.42) for one biomarker. Class activation maps were created to illustrate the attention of the network. The maps show qualitatively that the activations of the network overlap with the biomarkers of interest, and that the network is able to distinguish venules from arterioles. Moreover, local high and low tortuous regions are clearly identified, confirming that a CNN is sensitive to key structures in the retina.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.