PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344301 (2024) https://doi.org/10.1117/12.3057395
This PDF file contains the front matter associated with SPIE Proceedings Volume 13443, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344302 (2024) https://doi.org/10.1117/12.3055583
Applying single-image super-resolution methods to real-world images with complex and unknown degradation is a challenge. Recent research has made significant progress in solving this problem by employing more complex degradation models to emulate real-world scenes, thereby improving perceptual image quality. However, these methods are often limited by network structure, leading to the generation of over-smoothed results with insufficient detail. In this paper, a texture-attention discriminator architecture based on U-Net is proposed for real-world super-resolution tasks. The architecture effectively directs attention to the high-frequency details of the image by leveraging frequency information extracted through the Laplacian pyramid. As a result, it complements GAN-based super-resolution methods in recapturing complex real textures, resulting in better perceptual image quality when applied to real-world images.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344303 (2024) https://doi.org/10.1117/12.3055769
This paper proposes a two-stage video scene segmentation method based on multimodal semantic interaction. The method divides the video scene segmentation task into two stages: shot audio-visual representation and multimodal scene segmentation. In the first stage, the method leverages the high correlation and complementarity between audio-visual information by using an interactive attention module to deeply explore audio-visual semantic information. Simultaneously, it introduces a self-supervised learning strategy to improve the model's generalization ability by utilizing the temporal structure characteristics of scenes. In the second stage, the method constructs a multimodal feature fusion module, learning a unified shot representation from the audio-visual representation based on the attention mechanism. Additionally, it builds a visual discrimination loss to regulate the influence of audio-visual features, further enhancing the discriminative power of shot representation. Experimental results on the MovieNet benchmark dataset show that the proposed method can achieve more accurate video scene segmentation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344304 (2024) https://doi.org/10.1117/12.3055873
Science and technology are constantly updating and iterating, and industrial robot technology is also strengthening. Many enterprises are committed to improving the level of factory automation, configuring various high-end mechanical equipment to replace manual labor, and improving production and manufacturing efficiency. The automatic sorting system can automatically sort various complex materials according to certain parameters such as product specifications, size, color, etc., without human intervention, and be placed in a specific area. Traditional material sorting is mostly manual, using a large amount of manpower to identify, classify, and place items. This sorting method has high labor intensity, workers are prone to fatigue, and the effective time is limited. Therefore, the sorting efficiency is low, and the labor cost is also quite high. The visual system guides the machine to sort items, which can completely avoid the problems caused by manual sorting, Therefore, this article adopts the OpenMV guided machine for material sorting system design.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344305 (2024) https://doi.org/10.1117/12.3055740
The advent of deep learning technology has significantly expanded in the application of human pose estimation in various fields such as image processing and human-computer interaction. Nonetheless, the current techniques have limitations in their ability to accurately recognize poses in multi-person scenes, occlusion situations, and under varying lighting conditions. This paper introduces a novel approach for human bone recognition that is applicable to complex environments. The aim is to improve recognition accuracy and robustness in these scenarios. The method utilizes a deep learning model that incorporates multi-branch structure and occlusion techniques. It was tested on numerous video datasets, and the results demonstrate its superior bone recognition ability in complex scenes. Additionally, it exhibits high adaptability and accuracy in multi-person scenes, occlusion situations, and lighting changes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344306 (2024) https://doi.org/10.1117/12.3055603
To solve the problem of incomplete acquisition of context information in mode in existing video retrieval algorithms, this paper proposes to introduce sine-cosine 2D position coding into single mode coding to capture fine-grained local details and improve the detection accuracy of moment localization and highlight detection algorithms. First, the video and text are processed to extract the visual features, audio features and text features. The visual and audio features are input into the single mode coding embedded in the 2D position of sine and cosine respectively to capture the global time relationship, and then multi-mode fusion is carried out. The fused features and text features are then used to generate always-aligned queries. Finally, the query decoding and prediction methods are used to obtain the results of moment localization and highlight detection, and the video retrieval is completed. The proposed method was experimentally verified on four datasets: QVHighlights, Charades-STA, TVSum and YouTube Highlights. On QVHighlights dataset, the mAP indexes of moment localization and highlight detection tasks reached 39.09 and 39.68, respectively. On the Charades-STA dataset, the Recall@1 index with IoU threshold of 0.5 reached 51.13. The mAP indicator on the TVSum and YouTube Highlights datasets reached 83.7 and 75.3, respectively. The research work of this paper provides theoretical support for the realization of video retrieval technology.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344307 (2024) https://doi.org/10.1117/12.3056633
Building upon the verified neural mechanisms of the biological visual system, this paper introduces a bio-inspired model for object motion state detection. The model aims to ensure high detection accuracy while addressing the interpretability issues prevalent in current deep learning models. The proposed Object Motion Detection System (OMDS) draws inspiration from the directional and speed sensitivity observed in the biological visual system, which should inherently arise from physiological structures rather than learned behaviors. Therefore, it is feasible to replicate motion detection functionalities through bio-inspired modeling by simulating the structure of the visual system. To validate this concept, we conducted extensive experiments to assess the detection accuracy and robustness of OMDS under various conditions. Additionally, we compared its performance with convolutional neural networks ResNet and ResNeXt, under identical conditions. The results demonstrate that in the given dataset, OMDS not only surpasses the performance of convolutional neural network models but also reflects characteristics observed in the biological visual system, which results in high accuracy, low hardware requirements, and enhanced interpretability.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Joel C. de Goma, Izaac Manuelle Lachica, Mikaela Queqquegan, Red Stephen Villarama, Alberto C. Villaluz
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344308 (2024) https://doi.org/10.1117/12.3057285
This study presents a comprehensive methodology for developing and testing a machine-learning model, utilizing the YOLOv8 architecture, to analyze handgun handling states in videos. Four datasets, including ready-to-fire, low ready, holstered, and no handgun images, were meticulously curated and annotated for model training, validation, and testing. The YOLOv8 model was trained with varying epochs and batch sizes, demonstrating robust performance in detecting and classifying handgun poses, with an overall mean Average Precision (mAP) of 98.02%. Comparative analysis against six other handgun detection methods revealed YOLOv8's superior performance, particularly in precision and mAP. Lastly, the study emphasizes on the model's effectiveness in real-world scenarios and recommends further exploration of its applications, hyperparameter optimization, continuous dataset refinement, and leveraging its strengths for enhanced public safety measures.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Intelligent Information System and Management Based on Data
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344309 (2024) https://doi.org/10.1117/12.3056791
In the fashion recommendation system, with the increase in the number of clothing, the combination of clothing matching is growing exponentially, which leads to the slowdown of training speed and excessive memory usage, and other problems are becoming more and more prominent. Hashing technology can effectively reduce memory usage and improve the recommendation speed. With more and more product images accompanied by text descriptions, multi-modal modeling has become a research hotspot in recommendation systems. Most of the past research adopts the method of processing different modal information separately and then weighting and averaging, but the correlation between the two modalities is not fully utilized. We believe that the visual and textual features of the same product are semantically consistent and have the same aesthetic characteristics. Therefore, we try new modeling approaches to mine the higherorder connections between the two modalities. Another issue is that visual information is much more important than textual information in fashion recommendation systems, so it is not reasonable to equate the two modalities. We model the hash-processed binary representation, optimize the network structure to include visual features in the ranking to compute the loss, and also use textual features to assist the image features for modeling. Tests on two larger datasets from Polyvore show that our model exhibits significant results compared to state-of-the-art models on key evaluation metrics such as user-personalized recommendation and compatibility modeling.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 134430A (2024) https://doi.org/10.1117/12.3056706
Convolutional neural networks (CNNs) perform excellently in many image processing and computer vision tasks. However, their complex structure and the vast number of parameters require substantial computational and storage resources. This makes them challenging to implement, especially on mobile or embedded devices. Furthermore, CNN soften face issues limited transferability and susceptibility to overfitting. Sparsity and pruning are commonly used techniques to address these issues. Current methods include Spatial Dropout, block sparsity, structured pruning, dynamic pruning, and model-independent retraining-free sparsity. Our algorithm enhances CNNs by implementing a Dropout-like operation within the convolutional kernels. Drawing inspiration from sparse CNN and ROCKET methods, this approach employs randomly sparse convolutional kernels to reduce the data density processed during convolution operations. This novel method improves performance and efficiency, demonstrating its potential as a significant advancement in CNN architecture. The method is tested on serval popular datasets by adjusting parameters within the same model and on different hardware. It demonstrates improved training speed and accuracy reduced overfitting -- compared to traditional CNNs -- as measured by FLOPs and validation dataset accuracy.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Proceedings Volume Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 134430B (2024) https://doi.org/10.1117/12.3055791
In the new era of the booming digital economy, the digital transformation trend of auditing is inevitable, and carrying out internal auditing in universities with data mining techniques is a beneficial exploration following the trend. This paper builds an auditing framework of large-scale instrumentation and equipment opening and sharing in universities based on data mining. It designs three data mining processes: "data collection, data preprocessing, data analysis", and proposes a "3×2" data analysis system that encompasses "two types of supervision points, two evaluation indicators, and two recommendation bases", aligned with the three functions of internal auditing: "supervision, evaluation, and recommendation". It provides ideas for the use of auditing digitization to promote the improvement of large-scale instrumentation and equipment opening and sharing management in universities, and provides a practical path for the transformation and upgrading of internal auditing in universities.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.