KEYWORDS: Image processing, Electronic imaging, Evolutionary algorithms, Video, Machine learning, Visual process modeling, Video processing, Super resolution, Data processing, Data modeling
In recent years, deep learning has become prevalent to solve applications from multiple domains. Convolutional neural networks (CNNs) particularly have demonstrated state-of-the-art performance for the task of image classification. However, the decisions made by these networks are not transparent and cannot be directly interpreted by a human. Several approaches have been proposed to explain the reasoning behind a prediction made by a network. We propose a topology of grouping these methods based on their assumptions and implementations. We focus primarily on white box methods that leverage the information of the internal architecture of a network to explain its decision. Given the task of image classification and a trained CNN, our work aims to provide a comprehensive and detailed overview of a set of methods that can be used to create explanation maps for a particular image, which assign an importance score to each pixel of the image based on its contribution to the decision of the network. We also propose a further classification of the white box methods based on their implementations to enable better comparisons and help researchers find methods best suited for different scenarios.
KEYWORDS: Convolutional neural networks, Buildings, Image classification, RGB color model, Visualization, Cultural heritage, Image processing, Data modeling, Convolution, Video
We propose a convolutional neural network to classify images of buildings using sparse features at the network’s input in conjunction with primary color pixel values. As a result, a trained neuronal model is obtained to classify Mexican buildings in three classes according to the architectural styles: prehispanic, colonial, and modern with an accuracy of 88.01%. The problem of poor information in a training dataset is faced due to the unequal availability of cultural material. We propose a data augmentation and oversampling method to solve this problem. The results are encouraging and allow for prefiltering of the content in the search tasks.
Prediction of visual saliency in images and video is a highly researched topic. Target applications include Quality assessment of multimedia services in mobile context, video compression techniques, recognition of objects in video streams, etc. In the framework of mobile and egocentric perspectives, visual saliency models cannot be founded only on bottom-up features, as suggested by feature integration theory. The central bias hypothesis, is not respected neither. In this case, the top-down component of human visual attention becomes prevalent. Visual saliency can be predicted on the basis of seen data. Deep Convolutional Neural Networks (CNN) have proven to be a powerful tool for prediction of salient areas in stills. In our work we also focus on sensitivity of human visual system to residual motion in a video. A Deep CNN architecture is designed, where we incorporate input primary maps as color values of pixels and magnitude of local residual motion. Complementary contrast maps allow for a slight increase of accuracy compared to the use of color and residual motion only. The experiments show that the choice of the input features for the Deep CNN depends on visual task:for th eintersts in dynamic content, the 4K model with residual motion is more efficient, and for object recognition in egocentric video the pure spatial input is more appropriate.
In the computer world, the consumption and generation of multimedia content are in constant growth due to the popularization of mobile devices and new communication technologies. Retrieve information from multimedia content to describe Mexican buildings is a challenging problem. Our objective is to determine patterns related to three building eras (Pre-Hispanic, colonial and modern). For this purpose, existing recognition systems need to process a plenty of videos and images. The automatic learning systems trains the recognition capability with a semantic-annotated database. We built the database taking into account high-level feature concepts, user knowledge and experience. The annotations helps correlating context and content to understand the data on multimedia files. Without a method, the user needs a super mind to remember all and registry this data manually. This article presents a methodology for a quick images annotation using a graphical interface and intuitive controls. Emphasizing in the most two important features: time-consuming during annotations task and the quality of selected images. Though, we only classify images by its era and its quality. Finally, we obtain a dataset of Mexican buildings preserving the contextual information with semantic-annotations for training and test of buildings recognition systems. Therefore, research on content low-level descriptors is other possible use for this dataset.
Multimedia content production and storage in repositories are now an increasingly widespread practice. Indexing concepts for search in multimedia libraries are very useful for users of the repositories. However the search tools of content-based retrieval and automatic video tagging, still do not have great consistency. Regardless of how these systems are implemented, it is of vital importance to possess lots of videos that have concepts tagged with ground truth (training and testing sets). This paper describes a novel methodology to make complex annotations on video resources through ELAN software. The concepts are annotated and related to Mexican nature in a High Level Features (HLF) from development set of TRECVID 2014 in a collaborative environment. Based on this set, each nature concept observed is tagged on each video shot using concepts of the TRECVid 2014 dataset. We also propose new concepts, -like tropical settings, urban scenes, actions, events, weather, places for name a few. We also propose specific concepts that best describe video content of Mexican culture. We have been careful to get the database tagged with concepts of nature and ground truth. It is evident that a collaborative environment is more suitable for annotation of concepts related to ground truth and nature. As a result a Mexican nature database was built. It also is the basis for testing and training sets to automatically classify new multimedia content of Mexican nature.
Content-based image retrieval (CBIR) has become an interesting and urgent research topic due to the increase of necessity of indexing and classification of multimedia content in large databases. The low level visual descriptors, such as color-based, texture-based and shape-based descriptors, have been used for the CBIR task. In this paper we propose a color-based descriptor which describes well image contents, integrating both global feature provided by dominant color and local features provided by color correlogram. The performance of the proposed descriptor, called Dominant Color Correlogram descriptor (DCCD), is evaluated comparing with some MPEG-7 visual descriptors and other color-based descriptors reported in the literature, using two image datasets with different size and contents. The performance of the proposed descriptor is assessed using three different metrics commonly used in image retrieval task, which are ARP (Average Retrieval Precision), ARR (Average Retrieval Rate) and ANMRR (Average Normalized Modified Retrieval Rank). Also precision-recall curves are provided to show a better performance of the proposed descriptor compared with other color-based descriptors.
The paper contributes to No-Reference video quality assessment of broadcasted HD video over IP networks and DVB. In
this work we have enhanced our bottom-up spatio-temporal saliency map model by considering semantics of the visual
scene. Thus we propose a new saliency map model based on face detection that we called semantic saliency map. A new
fusion method has been proposed to merge the bottom-up saliency maps with the semantic saliency map. We show that
our NR metric WMBER weighted by the spatio-temporal-semantic saliency map provides higher results then the
WMBER weighted by the bottom-up spatio-temporal saliency map. Tests are performed on two H.264/AVC video
databases for video quality assessment over lossy networks.
Human vision system is very complex and has been studied for many years specifically for purposes of efficient
encoding of visual, e.g. video content from digital TV. There have been physiological and psychological evidences which
indicate that viewers do not pay equal attention to all exposed visual information, but only focus on certain areas known
as focus of attention (FOA) or saliency regions. In this work, we propose a novel based objective quality assessment
metric, for assessing the perceptual quality of decoded video sequences affected by transmission errors and packed loses.
The proposed method weights the Mean Square Error (MSE), Weighted-MSE (WMSE), according to the calculated
saliency map at each pixel. Our method was validated trough subjective quality experiments.
The paper presents the Argos evaluation campaign of video content analysis tools supported by the French Techno-
Vision program. This project aims at developing the resources of a benchmark of content analysis methods and
algorithms. The paper describes the type of the evaluated tasks, the way the content set has been produced, metrics and
tools developed for the evaluations and results obtained at the end of the first phase.
Data visualization techniques are penetrating
in various technological areas. In the field of
multimedia such as information search and
retrieval in multimedia archives, or digital
media production and post-production, data visualization
methodologies based on large graphs give an
exciting alternative to conventional storyboard
visualization. In this paper we develop a new
approach to visualization of multimedia (video)
documents based both on large graph clustering
and preliminary video segmenting and indexing.
In the domain of video indexing, one of the research topics is the automatic extraction of information to reach the objective of automatically describing and organizing the content. Thinking of a video stream, different kinds of information can be taken into account, but we can suppose that most of the information is contained in the foreground objects so that number of objects, their shape, their contours and so on, can constitute a good guess for the content description. This paper describes a new approach to extract foreground objects in MPEG2 video stream, in the framework of "rough
indexing paradigm" we define. This paradigm leads us to reach the purpose in near real time, nevertheless maintaining a good level of details.
In this paper, we describe a new way to create an object oriented video surveillance system that monitors activity in a site. The process is performed in two steps: first, detection of human faces as a guess for objects of interest is done and tracking of these entities through a video stream. The guidelines here are not to perform a very accurate detection and tracking, based on the contours for example, but to provide a global image processing system on a simple Personal Computer taking advantage from co-operation of detection and tracking. So the scheme we propose here provides a simple, fast solution that tracks few specific points of interest on the object boundary and possibly engage a motion based detection in order to recover the object of interest in the scene or to detect new object of interest as well. This tracker also enables learning motion activities, detecting unusual activities, and supplying statistical information about motion in a scene.
The problem of tracking of triangulated meshes for Video Object Planes in video sequences is considered. Triangulated meshes are constructed on the basis of hierarchical region- based models of VOP. The mesh is articulated, that is each polygonal region in a VOP is Delaunay triangulated separately and all partial meshes are connected in the global triangulation. Nodal optical flow is computed from motion parameters of associated regions according to an affine motion model using a hierarchy of region-based models. The tracking of such a VOP along video sequence is based on a full region based tracking of a hierarchical region-based model of a VOP. A forward tracking scheme and a backward motion compensation for a hybrid coding are proposed. Experimental results for a complex articulated VOP are presented.
This paper presents a novel approach for an automatic partitioning of video sequences based on scene change detection and global motion estimation. The method is based on a 1D representation of images, the Bin transform, which is a discrete version of the Radon transform. Analysis of the motion and detection of the scene change are realized in the transform domain using online statistical techniques. The analysis of a 1D signal rather than the mostly used 2D image signal limits computational complexity by itself and permits fast algorithms.
In an architectural database that is to be used by architects, urbanists, sociologists, geometers, etc., querying must be simplified. The aim of this work is to retrieve the images of a building that best fit a specified point of view. Original data are provided in DXF and TIFF formats (maps and images respectively.) A loose linking between these two types of information is obtained through textural attributes. However, the same building is photographed several times and more than a single building can appear on a picture. After determining the point of view by simple `clicks' on a map, we take advantages of the geometrical description of the building in order to draw its outline. Then, the images that have been textually associated with the selected building undergo a five-steps image-processing algorithm: conversion from the RGB color- space to intensity component, Nagao filtering, oriented gradient filtering, thresholding, and correlation-based hierarchical full search matching. If the building objects are not completely masked by natural ones, the `rectangular' shapes of frontage and side walls correspond well to the sketch and the requested images are returned to the user.
Region-based coding methods can provide a solution of maximal quality for the transmission of video sequences through channels of low bit-rate due to their 'scene- adaptive' nature. Here, the visual quality of frames predicted by motion compensation on nearly semantic homogeneous regions is better than for the conventional block-based and quad-tree based methods. If the error signal is not encoded, the most bit-consuming component is the description of the regions borders and of the topology of the segmentation map. A planar graph with polygonal arcs is used to represent the geometrical form and relations of regions in each frame. A method allowing to adapt the segmentation description to the variable available bit-rate is proposed based on rate-distortion theory and constrained optimization. The method uses the concept of description layers. The 'basic' layer contains only the triplet nodes of the graph and the vertices of the highest contrast and curvature. The 'maximal enhancement' layer contains all the nodes and polygonal vertices of the segmentation graph. The choice of 'optimal' layer for each polygonal arc is done independently, minimizing Lagrangian cost function. The last one combines rate and distortion measures. The entropy estimates for the encoding of all vertices of a given arc are taken as the rate measure. The distortion measure is the total sum of squared DFD in the area delimited by the basic layer and maximal enhancement layer for a given arc. The experiments on the 420-625 CCIR videoconference sequence showed a 30 percent decrease of the bit-rate for an unnoticeable loss in quality.
We present a two-step segmentation scheme for the very low bit rate coding of general purpose video scenes. Our objective is to determine the regions of interest before the actual segmentation procedure, so as to reduce the computational overhead introduced by this relatively complex process, and to avoid the phenomenon of over-segmentation of the scene. Simulation results show that the proposed scheme results in a radical reduction of the number of discrete spatio-temporal regions, while the background is identified as one uniform region, even when it is characterized by complex global motion.
Content-based image coders become the center of attention now for the currently emerging standard MPEG-4. A method based on the spatio-temporal segmentation for motion image coding is developed in this paper. The method is designed for the sequences characterized by homogeneous global motion (camera motion) and the presence of semantic objects having proper motions. The property of the method is the fitness of the moving border to spatial contours of regions which allows for a high quality of predicted images without any error encoding.
KEYWORDS: Image segmentation, Image compression, Computer programming, Image quality, Error analysis, Video coding, Quantization, Visualization, Signal to noise ratio, Video
In this paper a method for very low bit rate coding of video sequences is described. The method is based on spatio-temporal segmentation of the sequence into semantically significant regions with polygonal shapes. The motion parameters, structural description of regions and reconstruction by motion compensation error should be encoded to reconstruct high quality images at the decoder. In the coding scheme, the temporal coherence of the regions structure and of motion compensation error are taken into account in the encoding scheme to achieve a very low bit rate.
This paper describes a spatio-temporal segmentation of image sequences for object-oriented low bit rate image coding. The spatio-temporal segmentation is received by merging the spatial homogeneous regions into motion homogeneous ones. Each of these regions is characterized by a motion parameters vector and structural information which represents a polygonal approximation of region border. The segmentation of the current frames in the sequence is obtained by prediction and refinement. A predictive coding scheme is developed. Some estimates of the quantity of information are given.
This paper describes a method of image coding based on contour detection and region-growing procedure. Contour detection allows the most evident frontiers of homogeneous regions to be found. The region-growing procedure serves to close contours and to obtain more precise segmentation. The goal of this segmentation is to obtain regions with borders which are easy to code. So the split-and-merge procedure and quad-tree representation are used, that gives a more regular closure of contours, than usual contour-wise methods. The particularity of the method is the choice of growing centers. They are placed on the skeleton of non-closed regions and used in split-and-merge procedure of region growing. The centers placement scheme aims to obtain a uniform centers distribution inside a contour which results in approximately constant speed of growing.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.