OpenCLIP Fine-Tuning for Multi-Modal Retrieval on FashionGen
- Project: OpenCLIP Fine-Tuning for Multi-Modal Retrieval on FashionGen
- Focus: Vision-Language Models · Retrieval Training · Loss Design · Caption Augmentation
- Source Code: GitHub
- Paper (PDF): Download
Project Overview
This project studies how different OpenCLIP-style models behave when fine-tuned for image-text retrieval on FashionGen. The work compares multiple architectures and training objectives while examining how modeling choices affect retrieval quality, convergence behavior, and downstream usability.
The goal was not only to improve retrieval metrics, but also to better understand trade-offs between model size, objective design, caption quality, and training dynamics in a practical multimodal setting.
What I Explored
- Fine-tuned multiple model variants including ViT-B/32, ViT-B/16, and SigLIP2-based setups
- Compared InfoNCE-style contrastive training against BCE-based alternatives
- Evaluated caption augmentation strategies and their effect on retrieval behavior
- Analyzed architectural trade-offs and optimization behavior across experiments
Connection to the Demo System
This research project forms the modeling foundation for my broader multimodal retrieval work. The production-oriented Fashion Search Demo extends these ideas into an end-to-end application with retrieval infrastructure, APIs, and an interactive frontend.