Interactive video object segmentation (iVOS), which aims to efficiently produce high-quality segmentation masks of the target object in a video with user interactions. Recently, numerous works are proposed to advance the task of iVOS. However, their usages on user intent are limited. First, typical modules usually try to direct generate the segmentation without any further exploration on the input interaction, which misses valuable information. Second, recent iVOS approaches also do not consider the raw interactive information. As a result, the final segmentation results will be poisoned by the erroneous information given by the previous round’s segmentation masks. To solve the aforementioned weaknesses, in this paper, an Iterative Segmentation and Propagation based iVOS method is proposed to conduct better user intent exploration, namely ISP. ISP directly models user intent into the PGI2M module and TP module. Specifically, ISP first extracts a coarse-grained segmentation mask by analyzing the user’s input. Subsequently, this mask is used as a prior to aid the PGI2M module. Secondly, ISP presents a new interaction-driven self-attention module to recall the user’s intent in the TP module. Extensive experiments on two public datasets show the superiority of ISP over existing methods.
|