Paper
9 October 2024 Retrieval-augmented prompts for text-only image captioning
Jinbo Huang, Gang Xie, Li Cao
Author Affiliations +
Proceedings Volume 13288, Fourth International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2024); 1328811 (2024) https://doi.org/10.1117/12.3045215
Event: Fourth International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2024), 2024, Chengdu, China
Abstract
Supervised image captioning methods have made significant progress, but high-quality human-annotated image-text paired datasets are costly to collect. Recently, pre-trained vision-language models, such as CLIP, have shown exceptional performance in cross-modal associations, offering novel solutions for image captioning, such as zero-shot captioning through purely textual training. However, the modality gap exists between image and text presents an obstacle to cross-modal alignment within text-only captioning. Moreover, insufficient visual understanding and over-reliance on textual data in training lead to hallucinations (e.g., object misidentification and inaccurate object counts), resulting in irrational captioning. To tackle these issues, this paper presents RAPCap, a text-only method with retrieval-augmented prompts for text-only image captioning. Specifically, RAPCap enhances the language model's understanding of images by incorporating similar captions obtained through retrieval augmentation, thereby alleviating hallucinations. During inference, RAPCap translates the image to textual space to bridge the modality gap. Experimental results demonstrate that RAPCap achieves a new state-of-the-art performance on the Flickr30k and performs competitively on the MSCOCO compared to previous zero-shot captioning.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Jinbo Huang, Gang Xie, and Li Cao "Retrieval-augmented prompts for text-only image captioning", Proc. SPIE 13288, Fourth International Conference on Computer Graphics, Image, and Virtualization (ICCGIV 2024), 1328811 (9 October 2024); https://doi.org/10.1117/12.3045215
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Education and training

Visualization

Visual process modeling

Data modeling

Information visualization

Image retrieval

Performance modeling

Back to Top