Project: OpenCLIP Fine-Tuning for Multi-Modal Retrieval on FashionGen
Focus: Vision-Language Models · Retrieval Training · Loss Design · Caption Augmentation
Source Code: GitHub
Paper (PDF): Download

Project Overview

This project studies how different OpenCLIP-style models behave when fine-tuned for image-text retrieval on FashionGen. The work compares multiple architectures and training objectives while examining how modeling choices affect retrieval quality, convergence behavior, and downstream usability.

The goal was not only to improve retrieval metrics, but also to better understand trade-offs between model size, objective design, caption quality, and training dynamics in a practical multimodal setting.

What I Explored

Fine-tuned multiple model variants including ViT-B/32, ViT-B/16, and SigLIP2-based setups
Compared InfoNCE-style contrastive training against BCE-based alternatives
Evaluated caption augmentation strategies and their effect on retrieval behavior
Analyzed architectural trade-offs and optimization behavior across experiments

Connection to the Demo System

This research project forms the modeling foundation for my broader multimodal retrieval work. The production-oriented Fashion Search Demo extends these ideas into an end-to-end application with retrieval infrastructure, APIs, and an interactive frontend.

Report Preview

OpenCLIP Fine-Tuning for Multi-Modal Retrieval on FashionGen

Project Overview

What I Explored

Connection to the Demo System

Report Preview