Skip to content

Commit

Permalink
update website
Browse files Browse the repository at this point in the history
  • Loading branch information
tsunghan-wu committed Oct 18, 2024
1 parent afeb3e0 commit a69aa41
Showing 1 changed file with 12 additions and 11 deletions.
23 changes: 12 additions & 11 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,18 @@
<meta charset="utf-8">
<!-- Meta tags for social media banners, these should be filled in appropriately as they are your "business card" -->
<!-- Replace the content tag with appropriate information -->
<meta name="description" content="SVisual Haystacks: Answering Harder Questions About Sets of Images">
<meta name="description" content="Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark">
<meta property="og:title" content="Visual Haystacks" />
<meta property="og:description" content="Recent advancements in Large Multimodal Models (LMMs) have made significant progress in the field of single-image visual question answering. However, these models face substantial challenges when tasked with queries that span extensive collections of images, similar to real-world scenarios like searching through large photo albums, finding specific information across the internet, or monitoring environmental changes through satellite imagery. This paper explores the task of Multi-Image Visual Question Answering (MIQA): given a large set of images and a natural language query, the task is to generate a relevant and grounded response. We propose a new public benchmark, dubbed Visual Haystacks (VHs), specifically designed to evaluate LMMs capabilities in visual retrieval and reasoning over sets of unrelated images, where we perform comprehensive evaluations demonstrating that even robust closed-source models struggle significantly. Towards addressing these shortcomings, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), a novel retrieval/QA framework tailored for LMMs that confronts the challenges of MIQA with marked efficiency and accuracy improvements over baseline methods. Our evaluation shows that MIRAGE surpasses closed-source GPT-4o models by up to 11% on the VHs benchmark and offers up to 3.4x improvements in efficiency over text-focused multi-stage approaches." />
<meta property="og:description" content="Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, Visual Haystacks (VHs). We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs." />
<meta property="og:url" content="http://visual-haystacks.github.io" />
<!-- Path to banner image, should be in the path listed below. Optimal dimensions are 1200X630-->
<meta property="og:image" content="/static/images/VHs_logo.png" />
<meta property="og:image:width" content="1200" />
<meta property="og:image:height" content="630" />


<meta name="twitter:title" content="Visual Haystacks: Answering Harder Questions About Sets of Images">
<meta name="twitter:description" content="Recent advancements in Large Multimodal Models (LMMs) have made significant progress in the field of single-image visual question answering. However, these models face substantial challenges when tasked with queries that span extensive collections of images, similar to real-world scenarios like searching through large photo albums, finding specific information across the internet, or monitoring environmental changes through satellite imagery. This paper explores the task of Multi-Image Visual Question Answering (MIQA): given a large set of images and a natural language query, the task is to generate a relevant and grounded response. We propose a new public benchmark, dubbed Visual Haystacks (VHs), specifically designed to evaluate LMMs capabilities in visual retrieval and reasoning over sets of unrelated images, where we perform comprehensive evaluations demonstrating that even robust closed-source models struggle significantly. Towards addressing these shortcomings, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), a novel retrieval/QA framework tailored for LMMs that confronts the challenges of MIQA with marked efficiency and accuracy improvements over baseline methods. Our evaluation shows that MIRAGE surpasses closed-source GPT-4o models by up to 11% on the VHs benchmark and offers up to 3.4x improvements in efficiency over text-focused multi-stage approaches." />
<meta name="twitter:title" content="Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark">
<meta name="twitter:description" content="Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images. Recent advancements like long-context LMMs have allowed them to ingest larger, or even multiple, images. However, the ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering (MIQA), especially in real-world applications like photo album searches or satellite imagery analysis. In this work, we first assess the limitations of current benchmarks for long-context LMMs. We address these limitations by introducing a new vision-centric, long-context benchmark, Visual Haystacks (VHs). We comprehensively evaluate both open-source and proprietary models on VHs, and demonstrate that these models struggle when reasoning across potentially unrelated images, perform poorly on cross-image reasoning, as well as exhibit biases based on the placement of key information within the context window. Towards a solution, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU -- far surpassing the 1k-image limit of contemporary models. MIRAGE demonstrates up to 13% performance improvement over existing open-source LMMs on VHs, sets a new state-of-the-art on the RetVQA multi-image QA benchmark, and achieves competitive performance on single-image QA with state-of-the-art LMMs." />
<!-- Path to banner image, should be in the path listed below. Optimal dimensions are 1200X600-->
<meta name="twitter:image" content="static/images/VHs_logo.png">
<meta name="twitter:card" content="Visual Haystack Project Logo: A cartoon character photo sitting on top of a haystack of images.">
Expand All @@ -26,7 +26,7 @@
<meta name="viewport" content="width=device-width, initial-scale=1">


<title>Visual Haystacks: Answering Harder Questions About Sets of Images</title>
<title>Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark</title>
<link rel="icon" type="image/x-icon" href="static/images/favicon_io/favicon.ico">
<link rel="apple-touch-icon" sizes="180x180" href="static/images/favicon_io">
<link rel="icon" type="image/png" sizes="32x32" href="static/images/favicon_io/favicon-32x32.png">
Expand Down Expand Up @@ -114,22 +114,23 @@ <h1 class="title is-1 publication-title">Visual Haystacks: A Vision-Centric Need
</a>
</span>


<span class="link-block">
<a href="https://huggingface.co/datasets/tsunghanwu/visual_haystacks" target="_blank"
<a href="https://github.com/visual-haystacks/vhs_benchmark" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon" style="vertical-align: middle; font-size: 20px;">🤗</span>
<span>VHs Dataset</span>
<span class="icon">
<i class="fa-brands fa-github"></i>
</span>
<span>VHs Dataset/Toolkit</span>
</a>
</span>

<span class="link-block">
<a href="https://github.com/visual-haystacks/vhs_benchmark" target="_blank"
<a href="https://github.com/visual-haystacks/mirage" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fa-brands fa-github"></i>
</span>
<span>VHs Benchmark Toolkits</span>
<span>MIRAGE's Code/Model</span>
</a>
</span>
</div>
Expand Down

0 comments on commit a69aa41

Please sign in to comment.