In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the dynamic interaction between different objects. This task inherently requires reasoning the visual relationships among the objects of image. Meanwhile, the visual reasoning process should be guided by the information of the question. In this paper, we proposed a semantic relation graph reasoning network, the process of semantic relation reasoning is guided by the cross-modal attention mechanism. In addition, a Gated Graph Convolutional Network (GGCN) constructed based on cross-modal attention weights that novelly injects the semantic interaction information between objects into their visual features, and the features with relational awareness are produced. In particular, we trained a semantic relationship detector to extract the semantic relationship between objects for constructing the semantic relation graph. Experiments demonstrate that proposed model outperforms most state-of-the-art methods on the VQA v2.0 benchmark datasets.
|