OpenCLIP Fine-Tuning for Multi-Modal Retrieval on FashionGen

Project Overview

This project studies how different OpenCLIP-style models behave when fine-tuned for image-text retrieval on FashionGen. The work compares multiple architectures and training objectives while examining how modeling choices affect retrieval quality, convergence behavior, and downstream usability.

The goal was not only to improve retrieval metrics, but also to better understand trade-offs between model size, objective design, caption quality, and training dynamics in a practical multimodal setting.

What I Explored

Connection to the Demo System

This research project forms the modeling foundation for my broader multimodal retrieval work. The production-oriented Fashion Search Demo extends these ideas into an end-to-end application with retrieval infrastructure, APIs, and an interactive frontend.

Report Preview