Intelligence Artificielle et Data Sciences

Permanent URI for this community

Now showing 1 - 1 of 1

Image Text Similarity using Deep Learning Object Detection and Word Spotting Approach
(Tassadit, 2025-01-21) Billal MOKHTARI; Lilia MAHDID
With the fast expansion of Deep Learning, multi-modal models have become increasingly popular for tasks requiring complex data inputs. Content gen eration—such as image, video, or text generation—as well as recent object detection and segmentation methods, frequently use Large Language Mod els (LLMs). This project focuses on enhancing image and text similarity measures, aiming to improve the CLIP (Contrastive Language-Image Pre training) method by examining the impact of object semantics on image descriptions. Our approach, named ODITS (Object Driven Image and Text Similarity), uses the CLIP model pre-trained with the ViT-B/32 architecture, which is subsequently fine-tuned for our specific purposes. We evaluated the performance of the fine-tuned model using modified metrics, selecting the optimal checkpoint based on precision to minimize false associations between descriptions and images. Our findings indicate that this optimal checkpoint is 10% more precise than the original checkpoint. The weights from this model will be integrated into ODITS’s shared components with CLIP, providing a robust starting point for further optimization. The research component of the ODITS model, including theoretical and preliminary analysis, is also discussed, providing insights into its potential and areas for future development.