Contrastive Language-Image Pre-training (CLIP) is vulnerable to adversarial attacks which cause misclassification by subtle modifications undetectable to the human eye. Although adversarial training strengthens CLIP models against such attacks, it often degrades their accuracy on clean images. To tackle this challenge, we propose a novel defense strategy that leverages human brain activity data. The proposed method combines features of brain activity with those of adversarial examples, which enhances the robustness of CLIP while maintaining high accuracy on clean images. Experimental results demonstrate the effectiveness of our method for the accurate retrieval of clean and adversarial images. These results highlight the potential ability of brain data to overcome the existing challenges of adversarial defense in foundation models.
Clustered Federated Learning (CFL) is a modern approach that addresses the heterogeneous settings in Federated Learning. However, conventional models often fall into overfitting to the data within the cluster due to insufficient attention to communication between clients in different clusters. To tackle the issue of insufficient knowledge sharing in CFL, we present a novel approach to CFL that facilitates inter-cluster communication. Our proposed method promotes effective knowledge sharing among clusters while also improving the efficiency of clustering by layer separation. Experimental results show that our proposed method significantly enhances the model’s performance.
KEYWORDS: Image processing, Image classification, Data modeling, Image enhancement, Semantics, Classification systems, Atmospheric modeling, Systems modeling, Education and training
The training process of classification models is commonly based on real-world data and models may learn certain spurious relationships, leading to an over-reliance on features not directly relevant to the subject. In this paper, we propose a novel framework for generating counterfactual images. Our framework enables us to confirm whether classification models are sensitive to changes in the features under consideration. We have introduced the latest caption and image generator, which enables better counterfactual image generation as well as more efficient processing. Experimental results show that the counterfactual images generated by our method have superior feature perturbation capabilities, which allows us to assess the robustness of the classification model more effectively. The main improvements our framework offers over existing methods are the generation of higher-quality counterfactual images and the reduction of the computational cost of this process.
This paper presents a method for generating expert comments on human motion in sports videos using a Multimodal Large Language Model (MLLM). In the proposed method, a pretrained Vision Transformer (ViT) encoder and a transformer encoder realize the extraction of tokens from sports videos, enabling expert comment generation by considering temporal information. Experiments using basketball videos in the Ego-Exo4D dataset validated the effectiveness of incorporating temporal information in expert comment generation, demonstrating the proposed method’s superiority over existing techniques.
KEYWORDS: Video, Education and training, Video coding, Video acceleration, Curium, Feature extraction, Information technology, Information science, Design, Video processing
With the rapid development of recording and storage hardware, efficient methods to retrieve the desired videos are required. Among the video retrieval methods, cross-modal video retrieval that aims at retrieving a target video from natural language queries has attracted attention. Cross-modal video retrieval is realized by learning a common representation of videos and texts so that their similarity can be calculated directly only based on their contents. However, traditional cross-modal video retrieval approaches only focus on the global features and ignore the fine-grained information such as a single action or event in the video. In this paper, we propose to use the large language model to extract rich action and event information from the text and match them with the paired video hierarchically. We design a prompt to get the semantically informative action and event components in the form of natural language. Experimental results demonstrate the effectiveness of our method.
This paper presents an action event prediction based on masked modeling for bidirectional time-series analysis in soccer. Since optimal action events should be selected based on changes in match situations, it is important to consider bidirectional time-series changes in data. To predict the next action event with the consideration of the bidirectional time-series, the proposed method learns the contexts of action event sequences by predicting the masked action events from the preceding and following contexts. The prediction accuracy of our method is improved from that of the unidirectional method, which shows the effectiveness of taking the bidirectional time series into account in soccer.
This paper presents a novel Transformer-based customer interest estimation method using videos from a security camera in a real store. Expectations for the application of Artificial Intelligence (AI) technology in various industrial fields have increased. In this study, we focus on the retail industry field rather than on the manufacturing industry field where AI technology has already been widely introduced. In the retail industry, understanding customer interest in products is one of the significant issues, and Point-of-Sales (POS) data have been used for the analysis of customer data. However, the information of customers before the purchase cannot be obtained from the POS data, which was a limitation due to the characteristics of the data. To provide a solution to the problem, we propose a new customer interest estimation method using visual information obtained from a security camera. The proposed method consists of three phases: Re-identification phase, 3D posture estimation phase, and interest estimation phase. The advantage of our architecture is that the Reidentification phase enables individual identification of multiple persons, and the 3D posture estimation phase obtains highly expressive posture information. Then interest estimation phase is used to enable individual interest estimation. Finally, we achieve a high-level customer interest estimation performance using the data obtained from a real store.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.