Multiple object tracking (MOT) interest has grown in recent years, both in civil and military contexts, enhancing situational awareness for better decision-making. Typically, state-of-the-art methods integrate motion and appearance features to preserve the trajectory of each object over time, using new detection information when available. Visual features are fundamental when it comes to solving temporary occlusion or complex trajectories, i.e. non-linear motion associated with high object speeds or low framerate. Currently, these features are extracted by powerful deep learning-based models trained on the re-identification (ReID) task. However, research focuses mostly on scenarios involving pedestrians or vehicles, limiting the adaptability and transferability of such methods to other use cases. In this paper we investigate the added value of a variety of appearance features for comparing vessel appearance. We also include recent advances in foundation models that show their out-of-the- box applicability to unseen circumstances. Finally, we discuss how the robust visual features could improve multiple object tracking performances in the specialized domain of maritime surveillance.
KEYWORDS: Data modeling, Object detection, Transformers, Education and training, Performance modeling, 3D modeling, Sensors, Visual process modeling, Linear filtering, Computer vision technology
Collecting and annotating real-world data for the development of object detection models is a time-consuming and expensive process. In the military domain in particular, data collection can also be dangerous or infeasible. Training models on synthetic data may provide a solution for cases where access to real-world training data is restricted. However, bridging the reality gap between synthetic and real data remains a challenge. Existing methods usually build on top of baseline Convolutional Neural Network (CNN) models that have been shown to perform well when trained on real data, but have limited ability to perform well when trained on synthetic data. For example, some architectures allow for fine-tuning with the expectation of large quantities of training data and are prone to overfitting on synthetic data. Related work usually ignores various best practices from object detection on real data, e.g. by training on synthetic data from a single environment with relatively little variation. In this paper we propose a methodology for improving the performance of a pre-trained object detector when training on synthetic data. Our approach focuses on extracting the salient information from synthetic data without forgetting useful features learned from pre-training on real images. Based on the state of the art, we incorporate data augmentation methods and a Transformer backbone. Besides reaching relatively strong performance without any specialized synthetic data transfer methods, we show that our methods improve the state of the art on synthetic data trained object detection for the RarePlanes and DGTA-VisDrone datasets, and reach near-perfect performance on an in-house vehicle detection dataset.
Combining data from multiple sensors to improve the overall robustness and reliability of a classification system has become crucial in many applications, from military surveillance and decision support, to autonomous driving, robotics, and medical imaging. This so-called sensor fusion is especially interesting for fine-grained target classification, in which very specific sub-categories (e.g. ship types) need to be distinguished, a task that can be challenging with data from a single modality. Typical modalities are electro-optical (EO) image sensors, that can provide rich visual details of an object of interest, and radar, that can yield additional spatial information. Defined by the approach used to combine data from these sensors, several fusion techniques exist. For example, late fusion can merge class probabilities outputted by separate processing pipelines dedicated to each of the individual sensor data. In particular, deep learning (DL) has been widely leveraged for EO image analysis, but typically requires a lot of data to adapt to the nuances of a fine-grained classification task. Recent advances in DL on foundation models have shown a high potential when dealing with in-domain data scarcity, especially in combination with few-shot learning. This paper presents a framework to effectively combine EO and radar sensor data, and shows how this method outperforms stand-alone single sensor methods for fine-grained target classification. We adopt a strong few-shot image classification baseline based on foundation models, which robustly handles the lack of in-domain data and exploits rich visual features. In addition, we investigate a weighted and a Bayesian fusion approach to combine target class probabilities outputted by the image classification model and radar kinematic features. Experiments with data acquired in a measurement campaign at the port of Rotterdam show that our fusion method improves on the classification performance of individual modalities.
antic segmentation of aerial images is a critical task in various domains such as urban planning, and monitoring deforestation or critical infrastructure. However, the annotation process required for training accurate segmentation models is often time-consuming and labor-intensive. This paper presents a novel approach to address this challenge by leveraging the power of clustering techniques applied to the embeddings obtained from a SimCLRv2 model pretrained on the ImageNet dataset. By using this clustering approach, fewer training samples are needed, and the annotation only needs to be done for each cluster instead of each pixel in the image, significantly reducing the annotation time. Our proposed method uses SimCLRv2 to obtain rich feature representations (embeddings) from a dataset of unlabeled aerial images. These embeddings are then subjected to clustering, enabling the grouping of semantically similar image regions. In addition to directly using these clusters as class labels, we can treat these clusters as pseudo-classes, allowing us to construct a pseudo-label dataset for fine-tuning a segmentation network. Through experiments conducted on two benchmark aerial image datasets (Potsdam and Vaihingen), we demonstrate the effectiveness of our approach in achieving segmentation results in line with similar works on few-shot segmentation while significantly reducing the annotation effort required, thereby highlighting its practical applicability. Overall, the combination of SimCLRv2 embeddings and clustering techniques presents a promising avenue for achieving accurate image segmentation while minimizing the annotation burden, making it highly relevant for remote sensing applications and aerial imagery analysis.
Automated object detection is becoming more relevant in a wide variety of applications in the military domain. This includes the detection of drones, ships, and vehicles in video and IR video. In recent years, deep learning based object detection methods, such as YOLO, have shown to be promising in many applications for object detection. However, current methods have limited success when objects of interest are small in number of pixels, e.g. objects far away or small objects closer by. This is important, since accurate small object detection translates to early detection and the earlier an object is detected the more time is available for action. In this study, we investigate novel image analysis techniques that are designed to address some of the challenges of (very) small object detection by taking into account temporal information. We implement six methods, of which three are based on deep learning and use the temporal context of a set of frames within a video. The methods consider neighboring frames when detecting objects, either by stacking them as additional channels or by considering difference maps. We compare these spatio-temporal deep learning methods with YOLO-v8 that only considers single frames and two traditional moving object detection methods. Evaluation is done on a set of videos that encompasses a wide variety of challenges, including various objects, scenes, and acquisition conditions to show real-world performance.
Electro-optical (EO) sensors are essential for surveillance in military and security applications. Recent technological advancements, especially the developments in Deep Learning (DL), have enabled improved object detection and tracking in complex and dynamic environments. Most of this research focuses on readily available visible light (VIS) images. To apply these technologies for Thermal infrared (TIR) imagery, DL networks can be retrained using image data in the TIR domain. However, such a training set with enough samples is not easily available. This paper presents an unsupervised domain adaptation method for ship detection in TIR imagery using paired VIS and TIR images. The proposed method leverages on the pairing of VIS and TIR images and performs domain adaptation using detections in the VIS imagery as ground-truth to provide data for the TIR domain learning. The method performs ship detection from the VIS images using a pretrained convolutional neural network (CNN). These detections are subsequently improved using a tracking algorithm. The proposed TIR object detection model follows a two-stage training process. In the first stage, the model's head is trained, which consists of the regression layers that output the bounding boxes of the detected objects. In the second stage, the model's feature extractor is trained to learn more discriminative features. The method is evaluated on a dataset of recordings at Rotterdam harbor. Experiments demonstrate that the resulting TIR detector performs comparably with its VIS counterpart, in addition to providing reliable detections in adverse environmental conditions where VIS model fails. The proposed method has significant potential for real-world applications, including maritime surveillance.
Early threat assessment of vessels is an important surveillance task during naval operations. Whether a vessel is a threat depends on a number of aspects. Amongst those are the vessel class, the closest point of approach (CPA), the speed and direction of the vessel and the presence of possible threatening items on board the vessel such as weapons. Currently, most of these aspects are observed by operators viewing the camera imagery. Whether a vessel is a potential threat will depend on the final assessment of the operator. Automated analysis of electro-optical (EO) imagery for aspects of potential threats during surveillance can support the operator during observation. This can release the operator from continuous guard and provide him with the tools to provide a better overview of possible threats in the surroundings during a surveillance task. In this work, we apply different processing algorithms, including detection, tracking and classification, on recorded multi-band EO imagery in a harbor environment with many small vessels. With the results we aim to automatically determine the vessel’s CPA, number of people on board and the presence of possibly threatening items on board of the vessel. Hereby we show that our algorithms can support the operator in assessing whether a vessel poses a threat or not.
Automatic detection and tracking of persons and vehicles can greatly increase situational awareness in many military applications. Various methods for detection and tracking have been proposed so far, both for rule-based and learning approaches. With the advent of deep learning, learning approaches generally outperform rule-based approaches. Pre-trained neural networks on datasets like MS COCO can give reasonable detection performance on military datasets. However, for optimal performance it is advised to optimize the training of these pre-trained networks with a representative dataset. In typical military settings, it is a challenge to acquire enough data, and to split the training and test set properly. In this paper we evaluate fine-tuning on military data and compare different pre- and post-processing methods. First we compare a standard pre-trained RetinaNet detector with a fine-tuned version, trained on similar objects, which are recorded at distances different than the distance in the test set. On the aspect of distance this train set is therefore out-of-distribution. Next, we augment the training examples by both increasing and decreasing their size. Once detected, we use a template tracker to follow the objects, compensating for any missing detections. We show the results on detection and tracking of persons and vehicles in visible imagery in a military long range detection setting. The results show the added value of fine-tuning a neural net with augmented examples, where final network performance is similar to human visual performance for detection of targets, with a target area of tens of pixels in a moderately cluttered land environment.
Imaging systems can be used to obtain situational awareness in maritime situations. Important tools for these systems are automatic detection and tracking of objects in the acquired imagery, in which numerous methods are being developed. When designing a detection or tracking algorithm, its quality should be ensured by a comparison with existing algorithms and/or with a ground truth. Detection and tracking methods are often designed for a specific task, so evaluation with respect to this task is crucial, which demands for different evaluation measures for different tasks. We, therefore, propose a variety of quantitative measures for the performance evaluation of detectors and trackers for a variety of tasks. The proposed measures are a rich set from which an algorithm designer can choose in order to optimally design and assess a detection or tracking algorithm for a specific task. We compare these different evaluation measures by using them to assess detection and tracking quality in different maritime detection and tracking situations, obtained from three real-life infrared video data sets. With the proposed set of evaluation measures, a user is able to quantitatively assess the performance of a detector or tracker, which enables an optimal design for his approach.
Detecting maritime targets with electro-optical (EO) sensors is an active area of research. One current trend is to automate target detection through image processing or computer vision. Automation of target detection will decrease the number of people required for lower-level tasks, which frees capacity for higher-level tasks. A second trend is that the targets of interest are changing; more distributed and smaller targets are of increasing interest. Technological trends enable combined detection and identification of targets through machine learning. These trends and new technologies require a new approach in target detection strategies with specific attention to choosing which sensors and platforms to deploy.
In our current research, we propose a ‘maritime detection framework 2.0’, in which multi-platform sensors are combined with detection algorithms. In this paper, we present a comparison of detection algorithms for EO sensors within our developed framework and quantify the performance of this framework on representative data.
Automatic detection can be performed within the proposed framework in three ways: 1) using existing detectors, such as detectors based on movement or local intensities; 2) using a newly developed detector based on saliency on the scene level; and 3) using a state-of-the-art deep learning method. After detection, false alarms are suppressed using consecutive tracking approaches. The performance of these detection methods is compared by evaluating the detection probability versus the false alarm rate for realistic multi-sensor data.
New types of maritime targets require new target detection strategies. Combining new detection strategies with existing tracking technologies shows potential increase in detection performance of the complete framework.
Both normal aging and neurodegenerative diseases such as Alzheimer’s disease cause morphological changes of the brain. To better distinguish between normal and abnormal cases, it is necessary to model changes in brain morphology owing to normal aging. To this end, we developed a method for analyzing and visualizing these changes for the entire brain morphology distribution in the general aging population. The method is applied to 1000 subjects from a large population imaging study in the elderly, from which 900 were used to train the model and 100 were used for testing. The results of the 100 test subjects show that the model generalizes to subjects outside the model population. Smooth percentile curves showing the brain morphology changes as a function of age and spatiotemporal atlases derived from the model population are publicly available via an interactive web application at agingbrain.bigr.nl.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.