All test sets contain with the CLIP pre‑training corpus. 5.2 Baselines | Method | Training regime | Retrieval metric | |--------|-----------------|------------------| | CLIP (global) | Zero‑shot (no fine‑tune) | R@1 24.3% (avg) | | CLIP + linear probe (image+text) | Zero‑shot | 28.1% | | ALIGN‑ZS (global) | Zero‑shot | 25.6% | | ZC‑SOFTAIM (ours) | Zero‑shot | 34.7% | | ZC‑SOFTAIM + fine‑tune (10 k pairs) | Semi‑supervised | 41.2% | | ViLT‑ZS (global) | Zero‑shot | 22.9% |
┌─────────────────────┐ │ Pre‑trained Text │ │ Backbone (e.g., │ │ CLIP‑Text) │ └───────┬─────────────┘ │ Text tokens (M) → BERT tokens (d) → T ∈ ℝ^M×d │ ┌───────▼───────┐ │ SOFTAIM │ (shared linear proj.) └───────┬───────┘ │ Text token embeddings (Ť)
R@K = Recall at K; higher = better. | Domain | CLIP (global) | ZC‑SOFTAIM | Δ (absolute) | |--------|----------------|------------|--------------| | Medical | 19.8% | 30.5% | +10.7 | | Satellite | 22.1% | 33.9% | +11.8 | | Fine‑art | 25.7% | 38.2% | +12.5 | | E‑Commerce | 27.3% | 41.0% | +13.7 | | Scientific | 28.4% | 38.5% | +10.1 | | Average | 24.7% | 34.7% | +10.0% |
Exported on: 2025-07-08.