Fine-tuning a pretrained model with real data for a machine learning task requires many hours of manual work, especially for computer vision tasks, where collection and annotation of data can be very time-consuming. We present a framework and methodology for synthetic data collection that is not only efficient in terms of time taken to collect and annotate data, making use of free- and open-source software tools and 3D assets but also beats the state-of-the-art against real data, which is the ultimate test for any similar-to-real approach. We test our approach on a set of image classes from ObjectNet, which is a challenging image classification benchmark test dataset that is designed to be similar in many respects to ImageNet but with a wider variety of viewpoints, rotations, and backgrounds, which can make it more difficult for transfer learning problems. The novelty of our approach stems from the way we create complex backgrounds for 3D models using 2D images laid out as decals in a 3D game engine, where synthetic images are captured programmatically with a large number of systematic variations. We demonstrate that our approach is highly effective, resulting in a deep learning model with a top-1 accuracy of 72% on the ObjectNet data, which is a new state-of-the-art result. In addition, we present an efficient strategy for learning rate tuning that is an order of magnitude faster than regular grid search. |
1.IntroductionData, which are crucial to training machine learning models, can either be obtained from the real world or synthesized. Synthetic data generation is an increasingly popular technique for training deep learning models, especially in computer vision.1 A variety of methods can produce synthetic data with varying degrees of realism, based on how closely their properties resemble those of real data for the same task.2 Synthetic data hold a lot of promise as a cost-effective and scalable solution for data-hungry deep neural networks. Autonomous land vehicle in a neural network implemented in 19893 utilized synthetic images from a simulator to train its neural network. However, synthetic data often lack the diversity and richness of real data, so they are commonly used in conjunction with real data in training sets.4–6 With the advent of sophisticated open source game engines and renderers, high-quality, high-volume synthetic datasets have started to become more common. Another enabling factor is the availability of packages and plugins like UnrealCV,7 which enables easier programmatic access to generate images from game engines with ground truths. An example of such a high-quality and high-volume dataset is the virtual Karlsruhe Institute of Technology and Toyota Technological Institute (VKITTI) dataset4 that has seven times more images than the original KITTI dataset8 that it was based on. It was also demonstrated in Ref. 4 that pre-training with virtual data followed by fine-tuning with real data can outperform training on real data alone. More recent techniques, such as domain randomization,9 help generate larger and more diverse images than VKITTI4 that perform well on their own, even without real images. They do so by taking steps to bridge the reality gap between synthetic and real data. We do something similar, but we create complex backgrounds for objects in a 3D virtual world, using 2D images pasted on the floor of that 3D world (as described in Sec. 3.3.1). There has also been a paradigm shift in computer vision, where the practice of transfer learning has become widely accepted as the norm for many high-level tasks, such as semantic segmentation10–12 and object detection.13–15 We therefore exploit advances in open-source plugins and game engines, combined with unique techniques in synthetic data generation to perform state-of-the-art synthetic to real transfer learning, and make the following contributions.
2.Related Research2.1.Approaches to Generating Synthetic DataFor computer vision tasks, such as image classification, object detection, and semantic segmentation, the different approaches to generating synthetic datasets can be classified as follows: Cut and paste approach. It involves cutting and pasting foreground objects on to background scenes, thus creating numerous combinations of synthetic images.17 Dvornik et al.18 used a dedicated convolutional neural network (CNN) to choose potential bounding boxes where an appropriate object can be placed based on an object score for each box in a given image. This resulted in an improved performance in VOC12.19 Wang et al.20 utilized a simpler approach, where foreground objects of a class were carefully pasted on background images where a similar instance of that same class was removed. For example, a teddy bear instance from one image is pasted on another image where a similar teddy bear was removed. They called this instance switching, and the advantages were that context, shape and sometimes the scale also can be preserved to a certain extent. However, the major limitations of the cut and paste approach, such as inconsistent lighting (between foreground and background), and the creation of boundary artifacts remain unresolved. One exception is a complex pipeline involving geometrically consistent cut and paste methodology combined with 3D-specific image perturbation that improves upon state-of-the-art results in monocular 3D depth estimation21 on the nuscenes dataset.22 Realistic synthetic environment approach. A 3D environment with well-placed 3D objects can provide a testing ground for various applications, particularly navigation and mapping. There have been several works that created outdoor environments for training self-driving cars23–26 and unmanned aerial vehicles.27–29 A notable dataset created using this approach is VKITTI,4 modeled after KITTI,30 an urban self-driving dataset with ground-truth annotations of bounding boxes and semantic segmentation masks. A popular dataset for urban semantic segmentation is SYNTHIA,31 which has data from a virtual New York city for 13 urban classes, such as roads, buildings, and pedestrians. Both SYNTHIA31 and VKITTI4 used the Unity game engine,32 whereas there are other works including DeepDrive33 and VIVID,34 which use the Unreal Engine to create their virtual environment. The last couple of years has also seen a rise in high quality synthetic data generation pipelines,35,36 which can produce realistic synthetic scenes, albeit with the limitation that they require 3D scans of objects to work. Hybrid approach. It combines synthetically generated 3D foreground objects layered on background images taken from the real world. A good example is presented in Ref. 9, where the synthetic images are composed by combining 3D models with random textures, against a background of random images taken from the Flikr 8k.37 A 2022 approach called photorealistic neural domain randomization (PNDR)38 utilizes a neural rendering technique, which learns a combination of modular neural networks to generate high-quality renderings, randomizing different aspects of a scene including lighting and materials while still preserving realism. In our work, we use a simpler hybrid approach, as will be described in Sec. 3. Apart from using only a limited set of nine images from Google Images, a distinguishing feature of our approach is how we convert the 2D images into decal surfaces so that they integrate into the environment in a more realistic fashion than Ref. 9, plus properties, such as surface brightness of objects, can be varied to introduce greater diversity into the dataset. 2.2.Combination of Synthetic and Real ImagesIt has been shown in numerous instances39 that synthetic data alone is not useful to train general-purpose models. Earlier, synthetic datasets were used to complement real datasets so that models trained on such combined datasets can generalize well to real-world test data. This addition of synthetic datasets for training, helped outperform models trained on real datasets alone. In 2014, virtual human images from the video-game Half-Life 2 were used in conjunction with a real dataset called INRIA40 for the task of pedestrian detection, and it was shown on several benchmark datasets that this outperformed training with just INRIA.41 Synthetic humans were generated with the help of 3D templates from Ref. 5 and shape information from Ref. 42 to generate a synthetic dataset called SURREAL.43 The authors beat the state-of-the-art (approaches trained on real data only) on multiple tasks including body pose estimation and depth estimation, by training with a combination of real and virtual data. Synthetic data have been used in object detection starting from 2015, when Peng et al.6 used CAD images to fine-tune an RCNN object detector44 and showed that synthetic images of objects are useful when there is limited real data available, as in the Office dataset,45 where the performance was better when trained on synthetic CAD images compared to webcam images of the same objects. The state-of-the-art dataset generator called Kubric46 released in 2022, built using Blender47 and PyBullet,48 can flexibly generate synthetic data for 11 different types of computer vision tasks from 3D-NeRF49 to optical flow estimation. Even Kubric46 with all the flexibility and scalability of its pipeline only generates synthetic datasets, which beat the state-of-the-art, when used in conjunction with real data. 2.3.Pure Synthetic DataData from a virtual world were used to train a multiclass object detector50 that outperformed state-of-the-art methods at the time, deformable parts model (DPM)51 and aggregate channel features (ACF)52 on the PETS09-S2L1 challenge.53 They achieved the highest score on the precision metric even without using a pretrained network backbone and got competitive results on other metrics, such as recall and false positive count. McCormac et al.,54 using their SceneNet dataset, showed that pretraining purely on synthetic data for semantic segmentation resulted in an improvement over pretraining on ImageNet when the final transfer learning task is segmentation on real datasets. Tremblay et al.55 achieved state of the art in pose estimation with six degrees of freedom, using only synthetic images, utilizing a combination of domain-randomized, and photorealistic images to train their pose estimator. Similarly, Hinterstoisser et al.56 demonstrated that training on pure synthetic data can outperform training on real data. The detection works on a set of 64 retail objects under various poses, heavy background clutter, partial occlusion, and illumination changes. However, their work relies on obtaining high-quality 3D scans of objects to train an object detector. In 6D object detection, PNDR,57 mentioned in Sec. 2.1 achieves state-of-the-art results using purely synthetic data. 2.4.ObjectNetPopular datasets, such as ImageNet,58 have been a major driving factor in the advancement of algorithms and network architectures. One of the newer ones is ObjectNet,2 which was designed to be a difficult test-only dataset (as opposed to being used for both testing and training), and to challenge standard practices for transfer learning, which do not work well on it. Some samples from the ObjectNet paper2 are shown in Fig. 1. The leave-one-out contrastive learning (LooC)59 constructs separate embedding spaces for each invariant feature. When tested on a subset of ObjectNet containing 13 classes, their LooC network achieves 32.6% top-1 accuracy over a base benchmark of 30.9% using a supervised linear classifier. In context-gated-convolution (CGC),60 context-aware CNNs are created, where the weights are modified based on a global context, allowing an improved extraction of representative local patterns. Testing on ObjectNet, on a subset of 113 classes that are in common with ImageNet classes, their CGC architecture improves on the baseline Resnet50 (29.35%) by 2.18%, achieving a top-1 accuracy of 31.53% on the 113 ObjectNet classes. This is comparable to the same number of classes tested in the original ObjectNet paper.2 Big transfer (BiT)61 achieves 58.1% top-1 accuracy on ObjectNet, using large-scale pretraining and their internal dataset called JFT-300M, which has more than 1 billion labels spread over more than 300 million images. As their dataset has not been made public, one cannot directly compare against their work, let alone reproduce it or improve on it. A huge attention-based transformer with two billion parameters62 pretrained on the JFT-3B, an even larger version of JFT-300M61 dataset, achieved 70.53% on ObjectNet. Contrastive language-image pretraining,63 which uses natural language supervision in the way of text-image paired training, manages to achieve 72.3% top-1 accuracy on ObjectNet. A similar method called locked-image text tuning,57 where the image models are locked after pretraining and the text models are tuned for the task at hand, achieves 81.1%, albeit with the limiting requirement of JFT61 pretraining and millions of ground truth image-text pairs. 3.MethodologyOur methodology includes the following steps, which will be described in the sections that follow.
3.1.Test DatasetThere are 113 classes that overlap between ObjectNet and ImageNet. We organized those classes into categories by their purposes, such as chair and bench (category: furniture); cell phone and laptop (electronics); and vase, lampshade (home décor). We then randomly sampled one class from each of the top 10 categories, making 10 class labels in total, which we refer to as ObjectNet_subset. They are: mug, drill, umbrella, TV, cell phone, chair, bicycle, tennis racket, stuffed animal, and vase. 3.2.ToolchainFor each of our classes, we downloaded freely available 3D models from 3D marketplaces: TurboSquid, CGTrader, and free3D. Based on availability of models and the complexity of each object (some objects have more intraclass variability than others), a varied number of 3D models were downloaded for each object class. For instance, different umbrellas may vary in size, color, and slightly in design, and so do tennis rackets; but other classes such chairs vary far more widely in their types, from swivel chairs to basic plastic ones. Table 1 shows the number of 3D models downloaded and used for each object class. Geometry for the 3D models is meshes, a form of boundary representation. Table 1Count of 3D models used for each class.
We then imported those models into the Unreal Engine (v4.24),67 a 3D game engine that allows 3D modeling, animation, and game development, along with support for plugins and Python scripting. We used the in-built Unreal Python plugin for scripting the spawning, rotation, and scaling of 3D models; modifying the lighting; and changing object backgrounds. In Unreal, we used the AirSim plugin68 to capture images from predefined angles and distances from each object. Airsim enables API controls of virtual cars and RAVs inside Unreal Engine and has been specifically designed to enable research in autonomous vehicles, computer vision, and reinforcement learning. 3.3.Experimental Setup3.3.1.Background and object layoutMultiple copies of each object are laid out in a rectangular grid on the Unreal environment floor. Each copy of the object has variations in rotation, scale, and background surfaces. Each object is also surrounded by a floor material in the shape of a big square, which would serve as the object’s background when photographed from above. There is also point light above each object for illumination. A screenshot of some of the objects with simple backgrounds (such as wooden surfaces) and point lights above them are shown in Fig. 2. Decals. We also create more complex backgrounds using decals as mentioned in Sec. 2.1. Decals in the real world are special papers with design that are pasted onto surfaces, such as glass and metal. Decals inside an Unreal engine are analogous to real-world decals in that they are materials that can be wrapped around 3D polygons. We create decals out of photographs sourced from the Internet and wrap them around huge square tiles on top of which objects are placed. A screenshot of a diverse array of complex decal-wrapped background tiles is shown in Fig. 3. 3.3.2.LightingEvery point light is applied from a random position in a , 2.5 m above ground level. That point light has a random color in the red, green, and blue (RGB) color channels between the ranges 100 to 255 (e.g., white light has RGB channels 255, 255, 255). The light bulb icons represent the point lights above each object. In order to avoid point light colors mixing with each other, the objects and their respective point lights are placed far apart in the virtual environment as shown in Fig. 4. Apart from point lights, which correspond to object illumination, there is also a default “skylight” in an Unreal Engine that corresponds to scene illumination. The intensity of the scene illumination is set to a minimal level so that even when the random point light intensities become too low, the objects are visible enough in low-light conditions. There are also other scene variations that are being applied, namely saturation and contrast. Floor decals are applied evenly to all objects multiplicatively, i.e., each object has 10 different copies with 10 different decals—whereas saturation and contrast settings are only applied on subsets, i.e., the entire dataset is divided into subsets and a combination of the following saturation and contrast settings are applied on each subset: This is because saturation and contrast result in minor visual changes, and applying such settings multiplicatively would make the resulting dataset exponentially larger without adding as much visual variation within the dataset. 3.4.Capture of Synthetic ImagesThe second phase is image capture, done using AirSim after placing all the objects. At each object location, the images are captured at four different viewpoints, as shown in Fig. 5 where a stuffed animal is photographed from the top at height , then staying at , making the following displacements along the and axes: , , . Based on the displacements, the camera angles are also adjusted so that they face the centre of the object at all times. The synthetic images captured contain only two-dimensional information of the decal backgrounds. When the camera is rotated and photographs are not taken directly from top-down, the three-dimensional information of the 2D backgrounds is not accurate. Yet, our results in Sec. 4.2 show that the model performs well despite this loss of 3D information. To view the objects in the virtual environment on top of the backgrounds on their actual scale and spacing, a mug is photographed starting from the default height and then progressively zoomed out until some neighboring background tiles are visible, as shown in Fig. 4. The tiles and objects are far apart to prevent the light rays of nearby point light sources from mixing with each other. During the batch capture of images, parameters, such as image resolution, gamma, and field of view of cameras can be easily modified through a json file as mentioned in our Github page ( https://github.com/saiabinesh/hybrid-synth). 3.5.Collection of Real DataTo collect real data, a Google Chrome extension called “Download All Images69” to batch-download images was used. For every search query, which is the name of the object class, e.g., “drill,” all the search results from Google Images are downloaded after scrolling down to the end of the page, until the “end of the results” message is displayed. The extension helps download all the images displayed in the current Google images page. Finally, after scrolling down to the very end and downloading the images, they are manually inspected to see if there are some incorrect images, such as thumbnails and wrong results. 3.6.Testing on ObjectNetWe choose ResNet15270 as our backbone architecture, as it is a well-researched and popular architecture in the computer vision community, with good baseline performance figures for many of the common computer vision tasks. It has also been one of the primary networks benchmarked by the authors and creators of ObjectNet.2 After the collection of both synthetic and real data, a ResNet15270 backbone CNN (pretrained on ImageNet) is fine-tuned on these respective images and the top-1 accuracy for both the synthetic and real data are calculated. 3.7.Learning Rate TuningLearning Rate (LR) is one of the most important hyper-parameters to tune, and it can be inferred from Fig. 8 that LR significantly affects the performance of Resnet152 on the ObjectNet dataset (up to 40% points difference in validation accuracy between LRs). In this section, we explain in detail our LR tuning strategy, which involves taking an upper bound LR value from an LR range test and using that in an exponential StepLR decay scheduler. We elucidate our heuristics to calculate the optimal parameters for the decay schedule. 3.7.1.LR range testFor LR tuning, we start with the LR range test,71 which can be described as follows.
We modify the above method, wherein we calculate validation loss (instead of the training loss) on the entire validation set, for every fixed number of minibatches. Doing this ensures that any kind of covariate shift (difference in distribution between train and test data) is accounted for and the LR is tuned for a validation set that resembles the distribution of the test set. 3.7.2.Parameters for StepLRWe take the values from LR range test and use it in conjunction with an exponential decay schedule for LR, also known as “StepLR,” because it gave better top-1 accuracy. StepLR consists of decaying the LR in each step by multiplying the previous LR by the decay rate . We combine the range test from the CLR policy and the standard exponential decay policy to create a better-performing LR strategy. We also calculate other parameters, such as the LR decay factor and the step size for the StepLR schedule. This LR tuning strategy is a part of our methodology. The steps for the strategy can be summarized as shown as follows.
Here, and are the constants that are empirically determined. The optimal is set at 10,000 because reducing the LR more frequently than once every 10,000 iterations caused divergence in loss value. makes sure that for larger datasets with more than 10,000 iterations per epoch, the LR is reduced gradually every epoch for a total of 15 epochs. The exponential decay rate is calculated as shown as follows: (from Ref. 65).Therefore, . 4.Results4.1.Efficiency of Synthetic Data GenerationThe time taken to collect real data using Google Images includes the time taken to perform the following steps:
From our experiments, this process took an average of 37.5 minutes per class, which equates to 62.5 min per 1000 images. This is the best case scenario where one just has to query and download the images that are already indexed on the web by Google Images. A lot of times in real life, the effort and time involved are higher, as it involves having to acquire or find the actual objects belonging to that label, then photograph those objects. Collection of synthetic data involves the following steps:
Placing objects took 40 min for a dataset of 31,200 images (76.92 s per 1000 images). The overall time taken for placing objects and capturing images was 2.34 min per 1000 images. Synthetic data are not only more than faster to collect as shown in Table 2 but also easier to prototype and create new versions of data. Table 2Time taken. Synthetic data takes 11× less time.
4.2.Top-1 Accuracy4.2.1.Against state-of-the-artOur method has achieved the highest top-1 accuracy on ObjectNet data so far, as shown in Table 3. (This is excluding works that use huge transformers62 trained on proprietary datasets61 and other works that use natural language supervision.57) The closest comparable work is LooC,59 which achieves 32.6% top-1 accuracy. CGC60 uses context-aware CNNs, improving over the aforementioned Resnet50 baseline of 29.35% by 2.18% points, achieving a top-1 accuracy of 31.53% on 113 ObjectNet classes. BiT61 achieves 58.1%, using large scale pretraining using their internal dataset called JFT-300M dataset,75 with more than 300 million images. Table 3Top-1 accuracy of the state-of-the-art in ObjectNet classification, with the third column showing the number of test classes from ObjectNet used in each of the works. 4.2.2.Against real data scraped from InternetWe were also able to beat real data collected for our specific subset of classes as outlined in Sec. 3.5. We fine-tune the final layer of a pretrained Resnet152 backbone on both Real10 and Synthetic10_v4. Using our LR strategy, the upper bound LR for Real10 was fixed at , and the upper bound for Synthetic10_v4 was fixed at . Early stopping was implemented with a patience value of 5, where no improvement in validation accuracy for five epochs will terminate the training. Some other common training parameters that we used for both synthetic and real images are as follows: When we tested both models on ObjectNet_subset, the Synthetic10_v4 model outperformed the top-1 accuracy of Real10 by more than 3% points, as shown in Fig. 6. Night-time images. To test night-time and low-light performance of our model trained on synthetic data, we created the Objectnet_night subset, some samples of which are shown in Fig. 7. We then tested the top-1 accuracy on the Objectnet_night subset for both the Synthetic10_v4 and the Real10 models. These results tabulated in Table 4 show that our synthetic model loses only seven tenths of a percentage point when going from the full dataset to night subset, whereas the Real10 model loses 18.9% points. This is because in our data generation pipeline we included broad range of lighting variations as mentioned in Sec. 3. This includes minimum ambient scene lighting and random point light intensities that can go very low. Table 4Top-1 accuracy of models trained on real and synthetic data on Objectnet10 dataset and the Objectnet_night subset with only night-time and very low-light test images.
4.3.Learning Rate Scheduling and Training StrategyWe use the validation set of the Real10 Google Images dataset to tune LR. The graph in Fig. 8 shows validation accuracy at the end of each epoch up to 50 epochs, for six different LRs, on the same version of synthetic data. The LRs chosen for this graph are , , up to . We stopped after 50 epochs because none of the models were showing evidence of further improvement by then. The effect of LRs on convergence is even more pronounced on the extremes where very low LR of is very slow to learn (improving 0.44% per epoch), compared to (1.6% per epoch). Similarly, the highest LRs cause divergence in learning and a subsequent decrease in the validation accuracy. The curve of starts at 10.4% and gets back to 10.2% at the end of 30 epochs with very little learning in between, with even decreasing in accuracy compared to the first epoch. We postulate that the decrease in performance is because high LRs overshoot the minima and settle on solutions that are in less optimal regions of the weight space. Accordingly, this graph demonstrates the importance of selecting a suitable LR in affecting the maximum accuracy potential and the convergence speed of the network. Also Table 5 shows the effectiveness of our strategy compared to cyclical learning rate (CLR)71 and manual tuning. Our top-1 accuracy shows a nearly 30% increase to manual tuning and also beats CLR by 16%, when using the same LR. Table 5The difference in parameters and accuracy when using manual tuning or CLR71 versus using our strategy.
A standard parameter search takes at least 30 epochs of training to monitor the learning curve for different LRs to fix one value. Case in point, a typical LR search with logarithmically increasing LRs of , , , , , and , each run for 30 epochs, would take a total of 180 epochs to tune and set the LR schedule as opposed to two or three epochs. This equates to time/resource savings between a factor of 180/3 and 180/2 in this case, i.e., an average saving of 75×, similar to that of CLR, while achieving better accuracy. It is worth mentioning that our method resulted in higher accuracy of 72% compared to a standard grid search followed by stochastic gradient descent. 4.4.Synthetic Data Generation ParametersInspired by studies, such as Refs. 9, 56, and 76, many variations including point lights, image saturation, and contrast were included minimally from the beginning. But to what extent each parameter needs to be varied, was an open question demanding to be answered. Consider the case of lighting. In a previous study9 that used an Unreal Engine, random point lights with random intensities were shown to be a major factor in contributing to the performance of their object detector. But the optimal range of intensity and color variations of the point lights are unknown. To study the effect of each of those variations, we started with a minimal number of variations (two or three) in each factor to keep the final dataset size and the Unreal environment at a manageable level (five variations in ten factors for each object would result in a dataset size of ). Then we incrementally tuned a combination of factors to monitor the performance improvement, if any, and decided to keep or discard those factors and/or its additional variations. The variations that caused the most improvement in the top-1 accuracy are listed in Table 6. Table 6Model versions and the features added. v1: basic version with simple 90 deg rotations, and variations in point lights, contrast, and saturation; v2 and v1 with more number of 3D models; v3: random rotation in the range (−45 deg, 45 deg) on all objects, more point light variations and complex backgrounds from Google Images; v4 and v3 with multiple scale images.
As shown in Table 6, complex backgrounds combined with point light variations in intensity and color, improved the accuracy by 27% points. The complex backgrounds to the objects were downloaded from Google Images and scripted as decals to the floor of the Unreal engine. An example of a chair, laid out on the default grey floor of the Unreal engine, and on three other background images embedded on the floor, is shown in Fig. 9. In addition to the image capture angles (which simulate rotations), random object (around all three axes) in the range of (, 45 deg) also boosted the accuracy because image capture angles are fixed and so the random rotations help capture even more varied views of the objects. So did having images of objects in multiple scales; v4, which is v3, but added with the same pictures taken from multiple scales results in a 11.2% points as seen in Table 6. This was simulated by moving the virtual drone proportionally closer and away from the object when taking the pictures, i.e., to simulate zooming in and out, the camera on the virtual drone is moved closer and away from the object, respectively. 4.5.Ablation StudiesAlthough the incremental approach detailed in the previous section helped broadly understand the effect of various parameters on the performance of each feature added to the synthetic data, an ablation study is necessary to more accurately quantify loss of performance if each of those individual features were removed. The limitation of this ablation study is the resultant decrease in the dataset size if the dataset was generated without that particular feature. For example, the overall dataset size is 31,200 images. This included 3120 images from each of the 10 backgrounds, including the default grey background. When nine of the complex backgrounds were removed from the data generation process, the resulting dataset became smaller owing to the number of backgrounds reducing from 10 to 1. To combat this limitation, each of the smaller datasets resulting from the feature removals was duplicated to match the size of the full featured dataset size (31,200 images). It can be seen from Table 7 that the most important feature that contributed to the performance was complex backgrounds, with a 35.5% points decrease when just the default backgrounds were used. The second most important feature is rotations, which when removed, achieved only a 44.2% accuracy, which is a 28%-point-decrease. The remaining two major features of point light colors and multiple scales caused 12.6% and 11.1% point reductions in accuracy, which are smaller but nonetheless substantial. Table 7Ablation studies with comparison against the full dataset with no features removed.
4.5.1.Generalization performanceTo test whether the generalization performance was reduced due to the fine-tuning on synthetic dataset, we first tested the off-the-shelf ImageNet pretrained Resnet152 network on the subset of classes we call ImageNet10, which includes just a subset of ImageNet containing the 10 classes of our experiment. We then compared that vanilla network against the same network fine-tuned on our synthetic data, and the top-1 accuracy results from those experiments are listed as follows: This shows that taking the network that has been trained on ImageNet and fine-tuning with synthetic data not only increases the performance on ObjectNet but also increases it slightly on ImageNet as well. The fact that the top-1 accuracy did not decrease on ImageNet data demonstrates that synthetic data generated using our hybrid approach helps counter the effect of catastrophic forgetting and does not decrease generalization power of a network in the image classification task. 5.ConclusionsWe have successfully tackled the very challenging ObjectNet dataset by training on purely synthetic data and managed to outperform real data on it, a feat that only two other works have achieved to date.38,56 We have also beaten the state-of-the-art CNN classification performance for ObjectNet, with our 72% top-1 accuracy. We have also demonstrated with ablation studies (in Sec. 4.5) that the most important contributing factor (with 35% points) to our performance is the complex backgrounds created using our novel decal method. Moreover, the network trained on our synthetic data generalizes well to ImageNet also, as shown in Sec. 4.5.1. Our entire synthetic data generation pipeline is publicly available ( https://github.com/saiabinesh/hybrid-synth) and so is our dataset16 so that this research can be reproduced, extended, or repurposed for different tasks. We have inferred in Sec. 4.1 that it is more economical in terms of time and effort to collect and preprocess synthetic data using our pipeline than to batch download preindexed real data for image classification. In addition, we have presented an LR adjustment strategy, which is faster to tune, and 16% points more accurate than standard CLR71 in tackling ObjectNet using synthetic data (refer to Sec. 4.3). 6.Discussion and Future WorkAlthough our technique has proven effective on ObjectNet, the effectiveness of it might depend on the availability and quality of the 3D models. Some models do not import correctly and may need to be corrected for things, such as object pivot and surface normals, requiring additional time and effort. Synthetic data may be even more efficient in tasks, such as object detection and semantic segmentation, where annotating real data is much harder, requiring the need to hand label or draw polygons, bounding boxes, etc. That is something that we are pursuing at the moment and could be a good candidate for future work. AcknowledgmentsThis publication has emanated from research supported in part by a research grant from European Union’s Horizon 2020 Research and Innovation Program (Grant No. 700264] (ROCSAFE) and a research grant from the Science Foundation Ireland (SFI) (Grant No. SFI/12/RC/2289_P2) and co-funded by the European Regional Development Fund (Insight Centre for Data Analytics). The authors would like to acknowledge the Irish Centre for High-End Computing (ICHEC) for the provision of computational facilities and support. For the purpose of Open Access, the authors have applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. The authors have no relevant financial interests in the manuscript and no other potential conflicts of interest to disclose. ReferencesS. Nikolenko, Synthetic Data for Deep Learning, Springer International Publishing(
(2021). Google Scholar
A. Barbu et al.,
“Objectnet: a large-scale bias-controlled dataset for pushing the limits of object recognition models,”
in Adv. in Neural Inf. Process. Syst.,
(2019). Google Scholar
D. A. Pomerleau,
“ALVINN: an autonomous land vehicle in a neural network,”
(1989). Google Scholar
A. Gaidon et al.,
“Virtual worlds as proxy for multi-object tracking analysis,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit.,
4340
–4349
(2016). https://doi.org/10.1109/CVPR.2016.470 Google Scholar
M. Loper et al.,
“SMPL: a skinned multi-person linear model,”
ACM Trans. Graphics, 34
(6), 1
–16 https://doi.org/10.1145/2816795.2818013 ATGRDF 0730-0301
(2015).
Google Scholar
X. Peng et al.,
“Learning deep object detectors from 3D models,”
in Proc. IEEE Int. Conf. Comput. Vision,
1278
–1286
(2015). https://doi.org/10.1109/ICCV.2015.151 Google Scholar
W. Qiu et al.,
“Unrealcv: virtual worlds for computer vision,”
in Proc. 25th ACM Int. Conf. Multimedia,
1221
–1224
(2017). Google Scholar
J. Tremblay et al.,
“Training deep networks with synthetic data: bridging the reality gap by domain randomization,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit. Workshops,
969
–977
(2018). Google Scholar
L.-C. Chen et al.,
“Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”
IEEE Trans. Pattern Anal. Mach. Intell., 40
(4), 834
–848 https://doi.org/10.1109/TPAMI.2017.2699184 ITPIDJ 0162-8828
(2017).
Google Scholar
K. He et al.,
“Mask R-CNN,”
in Proc. IEEE Int. Conf. Comput. Vision,
2961
–2969
(2017). Google Scholar
J. Long, E. Shelhamer and T. Darrell,
“Fully convolutional networks for semantic segmentation,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit.,
3431
–3440
(2015). Google Scholar
R. Girshick,
“Fast R-CNN,”
in Proc. IEEE Int. Conf. Comput. Vision,
1440
–1448
(2015). Google Scholar
S. Ren et al.,
“Faster R-CNN: towards real-time object detection with region proposal networks,”
IEEE Trans. Pattern Anal. Mach. Intell., 39
(6), 1137
–1149 https://doi.org/10.1109/TPAMI.2016.2577031 ITPIDJ 0162-8828
(2016).
Google Scholar
A. Sharif Razavian et al.,
“Cnn features off-the-shelf: an astounding baseline for recognition,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit. Workshops,
806
–813
(2014). https://doi.org/10.1109/CVPRW.2014.131 Google Scholar
S. A. Natarajan and M. G. Madden,
“Hybrid synthetic data that outperforms real data in ObjectNet,”
https://dx.doi.org/10.21227/x84r-vh21
(2021).
Google Scholar
D. Dwibedi, I. Misra and M. Hebert,
“Cut, paste and learn: surprisingly easy synthesis for instance detection,”
in Proc. IEEE Int. Conf. Comput. Vision,
1301
–1310
(2017). https://doi.org/10.1109/ICCV.2017.146 Google Scholar
N. Dvornik, J. Mairal and C. Schmid,
“Modeling visual context is key to augmenting object detection datasets,”
in Proc. Eur. Conf. Comput. Vision (ECCV),
364
–380
(2018). Google Scholar
M. Everingham et al.,
“The Pascal visual object classes (VOC) challenge,”
Int. J. Comput. Vision, 88
(2), 303
–338 https://doi.org/10.1007/s11263-009-0275-4 IJCVEQ 0920-5691
(2010).
Google Scholar
H. Wang et al.,
“Data augmentation for object detection via progressive and selective instance-switching,”
(2019). Google Scholar
Q. Lian et al.,
“Exploring geometric consistency for monocular 3D object detection,”
in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit.,
1685
–1694
(2022). https://doi.org/10.1109/CVPR52688.2022.00173 Google Scholar
H. Caesar et al.,
“nuScenes: a multimodal dataset for autonomous driving,”
in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit.,
11621
–11631
(2020). Google Scholar
B. Paden et al.,
“A survey of motion planning and control techniques for self-driving urban vehicles,”
IEEE Trans. Intell. Veh., 1
(1), 33
–55 https://doi.org/10.1109/TIV.2016.2578706
(2016).
Google Scholar
S. Milz et al.,
“Visual slam for automated driving: exploring the applications of deep learning,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit. Workshops,
247
–257
(2018). https://doi.org/10.1109/CVPRW.2018.00062 Google Scholar
H. Lategahn, A. Geiger and B. Kitt,
“Visual slam for autonomous ground vehicles,”
in IEEE Int. Conf. Rob. and Autom.,
1732
–1737
(2011). https://doi.org/10.1109/ICRA.2011.5979711 Google Scholar
T. Fossen, K. Y. Pettersen and H. Nijmeijer, Sensing and Control for Autonomous Vehicles, Springer(
(2017). Google Scholar
C. Kanellakis and G. Nikolakopoulos,
“Survey on computer vision for uavs: current developments and trends,”
J. Intell. Rob. Syst., 87
(1), 141
–168 https://doi.org/10.1007/s10846-017-0483-z
(2017).
Google Scholar
J. Courbon et al.,
“Vision-based navigation of unmanned aerial vehicles,”
Control Eng. Pract., 18
(7), 789
–799 https://doi.org/10.1016/j.conengprac.2010.03.004 COEPEL 0967-0661
(2010).
Google Scholar
A. Al-Kaff et al.,
“Survey of computer vision algorithms and applications for unmanned aerial vehicles,”
Expert Syst. Appl., 92 447
–463 https://doi.org/10.1016/j.eswa.2017.09.033 ESAPEH 0957-4174
(2018).
Google Scholar
A. Geiger et al.,
“Vision meets robotics: the KITTI dataset,”
Int. J. Rob. Res., 32
(11), 1231
–1237 https://doi.org/10.1177/0278364913491297 IJRREL 0278-3649
(2013).
Google Scholar
G. Ros et al.,
“The synthetic dataset: a large collection of synthetic images for semantic segmentation of urban scenes,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit.,
3234
–3243
(2016). https://doi.org/10.1109/CVPR.2016.352 Google Scholar
S. Borkman et al.,
“Unity perception: generate synthetic data for computer vision,”
(2021). Google Scholar
C. Quiter and M. Ernst,
“deepdrive/deepdrive: 2.0?,”
https://zenodo.org/record/1248998#.Y_kt0nbP1aQ
(2018).
Google Scholar
K.-T. Lai et al.,
“VIVID: virtual environment for visual deep learning,”
in Proc. 26th ACM Int. Conf. Multimedia,
1356
–1359
(2018). Google Scholar
A. Eftekhar et al.,
“Omnidata: a scalable pipeline for making multi-task mid-level vision datasets from 3D scans,”
in Proc. IEEE/CVF Int. Conf. Comput. Vision,
10786
–10796
(2021). https://doi.org/10.1109/ICCV48922.2021.01061 Google Scholar
Z. Li et al.,
“Openrooms: an end-to-end open framework for photorealistic indoor scene datasets,”
(2020). Google Scholar
M. Hodosh, P. Young and J. Hockenmaier,
“Framing image description as a ranking task: data, models and evaluation metrics,”
J. Artif. Intell. Res., 47 853
–899 https://doi.org/10.1613/jair.3994 JAIRFR 1076-9757
(2013).
Google Scholar
S. Zakharov et al.,
“Photo-realistic neural domain randomization,”
Lect. Notes Comput. Sci., 13685 310
–327 https://doi.org/10.1007/978-3-031-19806-9_18 LNCSD9 0302-9743
(2022).
Google Scholar
S. I. Nikolenko,
“Introduction: the data problem,”
Synthetic Data for Deep Learning, 1
–17 Springer International Publishing, Cham
(2021). Google Scholar
N. Dalal and B. Triggs,
“Histograms of oriented gradients for human detection,”
in IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognit. (CVPR’05),
886
–893
(2005). https://doi.org/10.1109/CVPR.2005.177 Google Scholar
J. Xu et al.,
“Learning a part-based pedestrian detector in a virtual world,”
IEEE Trans. Intell. Transp. Syst., 15
(5), 2121
–2131 https://doi.org/10.1109/TITS.2014.2310138
(2014).
Google Scholar
K. M. Robinette et al.,
“Civilian American and European surface anthropometry resource (CAESAR), final report. Volume 1. Summary,”
(2002). Google Scholar
G. Varol et al.,
“Learning from synthetic humans,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit.,
109
–117
(2017). Google Scholar
R. Girshick et al.,
“Rich feature hierarchies for accurate object detection and semantic segmentation,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit.,
580
–587
(2014). https://doi.org/10.1109/CVPR.2014.81 Google Scholar
K. Saenko et al.,
“Adapting visual category models to new domains,”
Lect. Notes Comput. Sci., 6314 213
–226 https://doi.org/10.1007/978-3-642-15561-1_16 LNCSD9 0302-9743
(2010).
Google Scholar
K. Greff et al.,
“Kubric: a scalable dataset generator,”
in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit.,
3749
–3761
(2022). Google Scholar
B. O. Community, Blender—A 3D Modelling and Rendering Package, Blender Foundation(
(2021). Google Scholar
E. Coumans and Y. Bai,
“Pybullet, a python module for physics simulation for games, robotics and machine learning,”
(2016). http://pybullet.org Google Scholar
R. Martin-Brualla et al.,
“Nerf in the wild: neural radiance fields for unconstrained photo collections,”
in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit.,
7210
–7219
(2021). Google Scholar
E. Bochinski, V. Eiselein and T. Sikora,
“Training a convolutional neural network for multi-class object detection using solely virtual world data,”
in 13th IEEE Int. Conf. Adv. Video and Signal Based Surveill. (AVSS),
278
–285
(2016). https://doi.org/10.1109/AVSS.2016.7738056 Google Scholar
P. F. Felzenszwalb et al.,
“Object detection with discriminatively trained part-based models,”
IEEE Trans. Pattern Anal. Mach. Intell., 32
(9), 1627
–1645 https://doi.org/10.1109/TPAMI.2009.167 ITPIDJ 0162-8828
(2009).
Google Scholar
P. Dollár et al.,
“Fast feature pyramids for object detection,”
IEEE Trans. Pattern Anal. Mach. Intell., 36
(8), 1532
–1545 https://doi.org/10.1109/TPAMI.2014.2300479 ITPIDJ 0162-8828
(2014).
Google Scholar
J. Ferryman and A. Shahrokni,
“Pets2009: dataset and challenge,”
in Twelfth IEEE Int. Workshop on Perform. Eval. Tracking and Surveill.,
1
–6
(2009). https://doi.org/10.1109/PETS-WINTER.2009.5399556 Google Scholar
J. McCormac et al.,
“Scenenet RGB-d: can 5m synthetic images beat generic imageNet pre-training on indoor segmentation?,”
in Proc. IEEE Int. Conf. Comput. Vision,
2678
–2687
(2017). https://doi.org/10.1109/ICCV.2017.292 Google Scholar
J. Tremblay et al.,
“Deep object pose estimation for semantic robotic grasping of household objects,”
in Conf. Rob. Learn. (CoRL),
(2018). Google Scholar
S. Hinterstoisser et al.,
“An annotation saved is an annotation earned: using fully synthetic training for object detection,”
in Proc. IEEE/CVF Int. Conf. Comput. Vision Workshops,
(2019). Google Scholar
X. Zhai et al.,
“Lit: zero-shot transfer with locked-image text tuning,”
in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit.,
18123
–18133
(2022). https://doi.org/10.1109/CVPR52688.2022.01759 Google Scholar
J. Deng et al.,
“Imagenet: a large-scale hierarchical image database,”
in IEEE Conf. Comput. Vision and Pattern Recognit.,
248
–255
(2009). https://doi.org/10.1109/CVPR.2009.5206848 Google Scholar
T. Xiao et al.,
“What should not be contrastive in contrastive learning,”
in Int. Conf. Learn. Represent.,
(2021). Google Scholar
X. Lin et al.,
“Context-gated convolution,”
Lect. Notes Conmput. Sci., 12363 701
–718 https://doi.org/10.1007/978-3-030-58523-5_41
(2020).
Google Scholar
A. Kolesnikov et al.,
“Big transfer (BiT): general visual representation learning,”
Lect. Notes Comput. Sci., 12350 491
–507 https://doi.org/10.1007/978-3-030-58558-7_29 LNCSD9 0302-9743
(2020).
Google Scholar
X. Zhai et al.,
“Scaling vision transformers,”
in Proc. IEEE/CVF Conf. Comput. Vision and Pattern Recognit.,
12104
–12113
(2022). Google Scholar
A. Radford et al.,
“Learning transferable visual models from natural language supervision,”
in Int. Conf. Mach. Learn.,
8748
–8763
(2021). Google Scholar
M. Wisdom and A. Wisdom,
“TurboSquid,”
https://www.turbosquid.com/Search/3D-Models/marketplace
(2000).
Google Scholar
S. Shah et al.,
“Airsim: high-fidelity visual and physical simulation for autonomous vehicles,”
Field and Service Robotics, 621
–635 Springer(
(2018). Google Scholar
“Download all images,”
https://chrome.google.com/webstore/detail/download-all-images/ifipmflagepipjokmbdecpmjbibjnakm?hl=en
(2021).
Google Scholar
K. He et al.,
“Deep residual learning for image recognition,”
in Proc. IEEE Conf. Comput. Vision and Pattern Recognit.,
770
–778
(2016). https://doi.org/10.1109/CVPR.2016.90 Google Scholar
L. N. Smith,
“Cyclical learning rates for training neural networks,”
in IEEE Winter Conf. Appl. Comput. Vision (WACV),
464
–472
(2017). Google Scholar
J. R. A. Solares et al.,
“Deep learning for electronic health records: a comparative review of multiple deep neural architectures,”
J. Biomed. Inf., 101 103337 https://doi.org/10.1016/j.jbi.2019.103337
(2020).
Google Scholar
G. Hacohen and D. Weinshall,
“On the power of curriculum learning in training deep networks,”
in Proc. 36th Int. Conf. Mach. Learn.,
2535
–2544
(2019). Google Scholar
R. Yedida and S. Saha,
“Beginning with machine learning: a comprehensive primer,”
Eur. Phys. J. Spec. Top., 230
(10), 2363
–2444 https://doi.org/10.1140/epjs/s11734-021-00209-7
(2021).
Google Scholar
C. Sun et al.,
“Revisiting unreasonable effectiveness of data in deep learning era,”
in Proc. IEEE Int. Conf. Comput. Vision,
843
–852
(2017). https://doi.org/10.1109/ICCV.2017.97 Google Scholar
A. Prakash et al.,
“Structured domain randomization: bridging the reality gap by context-aware synthetic data,”
in Int. Conf. Rob. and Autom. (ICRA),
7249
–7255
(2019). https://doi.org/.1109/ICRA.2019.8794443 Google Scholar
BiographySai Abinesh Natarajan graduated his MSc degree in data analytics. He is a PhD student at the School of Computer Science of the National University of Ireland, Galway (NUIG) under the supervision of Prof. Michael Madden. His PhD work is in the area of procedural generation of 3D/2D hybrid synthetic image generation for data scarce computer vision tasks. He has worked on the image analysis module for the Horizon 2020 EU project, ROCSAFE. Michael G. Madden is an established professor of computer science at the National University of Ireland, Galway. His research is focused on new theoretical advances in machine learning and data mining, motivated by important practical applications, on the basis that challenging applications foster novel algorithms, which in turn enable innovative applications. Specific research topics include: new methods for combining domain knowledge with data mining; time series data analysis; reasoning under uncertainty; Bayesian networks; reinforcement learning; and applications in science, engineering, and medicine. He was the coordinator of a project called ROCSAFE (Remotely Operated CBRN Scene Assessment and Forensic Examination) that was funded by the EU’s Horizon 2020 program. To date, his research has led to more than 100 publications, four patents, 12 PhD graduates, and a spin-out company. |