From 7eac85b8e52646d14162054846acbf052b2961b7 Mon Sep 17 00:00:00 2001 From: Yanqing0327 Date: Mon, 25 Nov 2024 17:05:01 -0800 Subject: [PATCH] update --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 28c1af7..98834e7 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ ## **Proposed Method** ### **CLIPS Pipeline** -Method Pipeline +![Method Pipeline](./docs/resources/method.jpg) Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions: @@ -35,28 +35,28 @@ Our method achieves **state-of-the-art (SOTA)** results in zero-shot image-text ## **Key Results** ### **Inverse Effect with Synthetic Captions** -Inverse Effect Visualization +![Inverse Effect Visualization](./docs/resources/mask_strategy.jpg) Visualization of four different token reduction strategies. These strategies can improve the model's learning efficiency on synthetic captions to varying degrees. Among these strategies, the sub-caption and block mask perform best. --- ### **Zero-Shot Cross-Modal Retrieval** -Zero-Shot Retrieval Results +![Zero-Shot Retrieval Results](./docs/resources/retrieval.png) Our method consistently achieves superior performance across all benchmarks and model sizes, yielding significant improvements over the baselines. --- ### **Comparison with State-of-the-Art Methods** -SOTA Comparison +![SOTA Comparison](./docs/resources/sota.png) With increased computational resources and scaling, our best model further achieves 76.4% and 96.6% R@1 text retrieval performance on MSCOCO and Flickr30K respectively, and 57.2% and 83.9% R@1 image retrieval performance on the same datasets, setting new state-of-the-art (SOTA) results. --- ### **CLIPS in LLaVA** -LLaVA Results +![LLaVA Results](./docs/resources/LLaVA.png) Replacing OpenAI-CLIP with **CLIPS** significantly boosts LLaVA's performance across various benchmarks.