Image Text Similarity using Deep Learning Object Detection and Word Spotting Approach
Loading...
Date
2025-01-21
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Tassadit
Abstract
With the fast expansion of Deep Learning, multi-modal models have become increasingly popular for tasks requiring complex data inputs. Content gen eration—such as image, video, or text generation—as well as recent object detection and segmentation methods, frequently use Large Language Mod els (LLMs). This project focuses on enhancing image and text similarity measures, aiming to improve the CLIP (Contrastive Language-Image Pre training) method by examining the impact of object semantics on image descriptions. Our approach, named ODITS (Object Driven Image and Text Similarity), uses the CLIP model pre-trained with the ViT-B/32 architecture,
which is subsequently fine-tuned for our specific purposes. We evaluated the
performance of the fine-tuned model using modified metrics, selecting the
optimal checkpoint based on precision to minimize false associations between descriptions and images. Our findings indicate that this optimal checkpoint is 10% more precise than the original checkpoint. The weights from this model will be integrated into ODITS’s shared components with CLIP, providing a robust starting point for further optimization. The research component of the ODITS model, including theoretical and preliminary analysis, is also discussed, providing insights into its potential and areas for future development.
Description
Keywords
Image and Text Similarity, Multi-Modal models, CLIP, ODITS, Zero-Shot Learning, Large Language Models, Object Detection, Image Segmentation, Text Recognition and Spotting