diff --git a/index.html b/index.html index bcfb23c..50bb6d8 100644 --- a/index.html +++ b/index.html @@ -4,268 +4,282 @@ - - - - - - - - - UGround Homepage - - - - - - - - - - - - - - - - - - - - - - - - + .sidebar:hover { + box-shadow: 0 8px 16px rgba(0, 0, 0, 0.3); + } + + .sidebar h4 { + font-size: 18px; + text-align: center; + margin-bottom: 15px; + color: #333; + } + + .sidebar ul { + list-style-type: none; + padding: 0; + } + + .sidebar ul li { + margin-bottom: 12px; + } + + .sidebar ul li a { + text-decoration: none; + color: #555; + font-weight: bold; + font-size: 14px; + padding: 10px; + display: block; + transition: background-color 0.3s ease, color 0.3s ease; + border-radius: 5px; + } + + .sidebar ul li a:hover { + background-color: #ff5722; + color: #ffffff; + } + + @media (max-width: 1374px) { + .sidebar { + display: none; + } + } + + table { + width: 90%; + margin: 20px auto; + border-collapse: collapse; + text-align: center; /* Horizontally center content */ + vertical-align: middle; /* Vertically center content */ + font-size: 14px; + } + + th, td { + padding: 12px; + text-align: center; /* Horizontally center content */ + vertical-align: middle; /* Vertically center content */ + border: 1px solid #ddd; + } + + th { + background-color: #f2f2f2; + text-align: center; /* Horizontally center content */ + vertical-align: middle; /* Vertically center content */ + font-weight: bold; + } + + caption { + caption-side: top; + text-align: center; + font-weight: bold; + padding: 10px; + font-size: 20px; + } + + /*tr:nth-child(even) {*/ + /* background-color: #f9f9f9;*/ + /* text-align: center; !* Horizontally center content *!*/ + /* vertical-align: middle; !* Vertically center content *!*/ + /*}*/ + .highlight { + font-weight: bold; + color: #ff5722; + } + + .hidden { + display: none; + } + + + + + + + + + + UGround Homepage + + + + + + + + + + + + + + + + + + + + + + + +
-
-
-
-
-

- Logo - UGround -

-

- Navigating the Digital World as Humans Do:
- Universal Visual Grounding for GUI Agents -

-
+
+
+
+
+

+ Logo + UGround +

+

+ Navigating the Digital World as Humans Do:
+ Universal Visual Grounding for GUI Agents +

+
Boyu Gou1, - + Ruohan Wang1, - + Boyuan Zheng1, - + Yanan Xie2, - + Cheng Chang2, - + Yiheng Shu1, - + Huan Sun1, - + Yu Su1 -
- -
- -
- 1The Ohio State University      - 2Orby AI -
- -
+
+ +
+ +
+ 1The Ohio State University      + 2Orby AI +
+ + +
- -
+
- -
-
-
+ .center { + display: block; + margin-left: auto; + margin-right: auto; + width: 80%; + } +
-
-
-
-

- TLDR: 1) Low-cost, scalable and effective synthetic data for GUI visaul grounding 2) SOTA GUI visual grounding model UGround 3) human-like purely vision-only (modular) GUI agent framework SeeAct-V 4) first time demonstrating practical SOTA performance of vision-only GUI agents.
-
UGround is a universal visual grounding model for locating the element of an action by pixel coordinates on GUIs. It is trained on 10M elements from 1.3M screenshots, and substantially outperforms previous SOTA GUI visual grounding models on ScreenSpot (web, mobile, desktop). -
- Different from prevalent approaches that rely on HTML/accessibility trees for observation or grounding, we propose a generic framework, SeeAct-V, that perceives the GUIs entirely visually, and takes pixel-level operations on screens. SeeAct-V Agents with UGround achieve SOTA performance on five benchmarks, spanning offline agent evaluation (web, mobile, desktop), and online agent evaluation (web, mobile): - - - - - - - - - -

MY ALT TEXT

-

- - -

- Updates -

-
    - -
  • - 2025/01/03: Qwen2VL-based UGround-v1 has released (2B & 7B), with a even stronger SOTA performance on GUI visual grounding. - -
  • - -
  • - 2024/10/07: Preprint is arXived. - -
  • -
  • - 2024/08/06: Website is live. The initial manuscript and results are available. - -
  • -
+
+
+
+

+ TLDR: 1) Low-cost, scalable and effective synthetic data for GUI visaul grounding 2) SOTA GUI + visual grounding model UGround 3) human-like purely vision-only (modular) GUI agent framework + SeeAct-V 4) first time demonstrating practical SOTA performance of vision-only GUI agents.
+
UGround is a universal visual grounding model for locating the element of an + action by pixel coordinates on GUIs. It is trained on 10M elements from 1.3M screenshots, and + substantially outperforms previous SOTA GUI visual grounding models on ScreenSpot (web, + mobile, desktop). +
+ Different from prevalent approaches that rely on HTML/accessibility trees for observation or + grounding, we propose a generic framework, SeeAct-V, that perceives the GUIs entirely + visually, and takes pixel-level operations on screens. SeeAct-V Agents with UGround + achieve SOTA performance on five benchmarks, spanning offline agent evaluation (web, mobile, + desktop), and online agent evaluation (web, mobile): + + + + + + + + + +

MY ALT TEXT +

+

+ + +

+ Updates +

+
    + +
  • + 2025/01/03: Qwen2VL-based UGround-v1 has released (2B & 7B), with a even stronger SOTA + performance on GUI visual grounding. + +
  • + +
  • + 2024/10/07: Preprint is arXived. + +
  • +
  • + 2024/08/06: Website is live. The initial manuscript and results are available. + +
  • +
-
+
-
+
-
+
-
-

- Logo - SeeAct-V and UGround -

-
+
+

+ Logo + SeeAct-V and UGround +

+
- -
-
-
- -
- - - - -

Overview

-

MY ALT TEXT

-
-

- In this work, we advocate a human-like embodiment for GUI agents that perceive the environment entirely visually and directly take pixel-level operations on the GUI (SeeAct-V framework). The key is visual grounding models that can accurately map diverse referring expressions of GUI elements to their coordinates on the GUI across different platforms.
We show that a simple recipe, which includes web-based synthetic data and slight adaptation of the LLaVA architecture, is surprisingly effective for training such visual grounding models. We collect the largest dataset for GUI visual grounding so far, containing 10M GUI elements (~95% from web) and their referring expressions over 1.3M screenshots, and use it to train UGround, a strong universal visual grounding model for GUI agents.

- -
- -

Live Demo

- - - - - -

Offline Experiments

-
- -

The high-quality grounding data synthesized from web (9M elements from Web-Hybrid) effectively helps UGround generalize to Desktop and Mobile UIs, making UGround outperform previous SOTA SeeClick on every platform and element type on ScreenSpot. - - - - - - -

MY ALT TEXT

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ScreenSpot (Standard Setting)
Grounding ModelMobileDesktopWebAverage
TextIcon/WidgetTextIcon/WidgetTextIcon/Widget
GPT-422.624.520.211.89.28.816.2
GPT-4o20.224.921.123.612.27.818.3
MiniGPT-v28.46.66.22.96.53.45.7
Groma10.32.64.64.35.73.45.2
Fuyu41.01.333.03.633.94.419.5
Qwen-VL9.54.85.75.03.52.45.2
SeeClick78.052.072.230.055.732.553.4
Qwen-GUI52.410.945.95.743.013.628.6
UGround-V182.860.382.563.680.470.473.3
Qwen2-VL61.339.352.045.033.021.842.1
Aguvis-G-7B88.378.288.170.785.774.881.0
Aguvis-7B95.677.793.867.188.375.283.0
OS-Atlas-Base-4B85.758.572.245.782.663.168.0
OS-Atlas-Base-7B93.072.991.862.990.974.381.0
ShowUI-G91.669.081.859.083.065.575.0
ShowUI92.375.576.361.181.763.675.1
Iris85.364.286.757.582.671.274.6
Aria-UI92.373.893.364.386.576.281.1
UGround-V1-2B (Qwen2-VL)89.472.088.765.781.368.977.7
UGround-V1-7B (Qwen2-VL)93.079.993.876.490.984.086.3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
ScreenSpot (Agent Setting)

Planner

Grounding
MobileDesktop Web
Average
TextIcon/WidgetTextIcon/WidgetTextIcon/Widget

GPT-4
SeeClick76.655.568.028.640.923.348.8
UGround-V190.170.387.155.785.764.675.6

GPT-4o
Qwen-VL21.321.418.610.79.15.814.5
SeeClick81.059.869.633.643.926.252.4
Qwen-GUI67.824.553.116.450.418.538.5
UGround-V193.476.992.867.988.768.981.4
OS-Atlas-Base-4B94.173.877.847.186.565.374.1
OS-Atlas-Base-7B93.879.990.266.492.679.183.7
UGround-V1-2B (Qwen2-VL)94.177.792.863.690.070.981.5
UGround-V1-7B (Qwen2-VL)94.179.993.373.689.673.384.0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Multimodal-Mind2Web (Element Accuracy)
InputPlannerGroundingCross-TaskCross-WebsiteCross-DomainAverage

Image + Text

GPT-4
Choice46.438.042.442.3
SoM29.620.127.025.6
-

Image
-
(SeeAct-V)
-

GPT-4
SeeClick29.728.530.729.6
UGround-V145.144.744.644.8

GPT-4o
SeeClick32.133.133.532.9
UGround-V147.746.046.646.8
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
AndroidControl

Input

Planner

Grounding
Step Accuracy
HighLow
TextGPT-4Choice42.155.0
-
-
-
Image
-
(SeeAct-V)
-

GPT-4
SeeClick39.447.2
UGround-V146.258.0

GPT-4o
SeeClick41.852.8
UGround-V148.462.4
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
OmniACT
InputsPlannerGroundingAction Score
Text
GPT-4
DetACT11.6
Image + TextDetACT17.0
-
-
-
Image
-
(SeeAct-V)
-

GPT-4
SeeClick28.9
UGround-V131.1

GPT-4o
SeeClick29.6
UGround-V132.8
- - - -

- -

Online Experiments

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Mind2Web-Live
InputsPlannerGroundingCompletion RateTask Success Rate

Text
GPT-4
Choice
44.321.1
GPT-4o47.622.1
-
Image
-
(SeeAct-V)
-
GPT-4
UGround-V1
50.723.1
GPT-4o50.819.2
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
AndroidWorld
InputPlannerGroundingTask Success Rate
Text
GPT-4
Choice30.6
Image + TextSoM25.4
-
Image
-
(SeeAct-V)
-
GPT-4
UGround-V1
31.0
GPT-4o32.8
- - - -
-

BibTeX

-

-      @article{gou2024uground,
-        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
-        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
-        journal={arXiv preprint arXiv:2410.05243},
-        year={2024},
-        url={https://arxiv.org/abs/2410.05243},
-      }
+
+
+
+ +
+ + + + +

Overview

+

MY ALT TEXT

+
+

+ In this work, we advocate a human-like embodiment for GUI agents that perceive + the environment entirely visually and directly take pixel-level operations on the GUI (SeeAct-V + framework). The key is visual grounding models that can accurately map + diverse referring expressions of GUI elements to their coordinates on the GUI across different + platforms.
We show that a simple recipe, which includes web-based synthetic + data and slight adaptation of the LLaVA architecture, is surprisingly effective for + training such visual grounding models. We collect the largest dataset for GUI visual grounding + so far, containing 10M GUI elements (~95% from web) and their referring expressions over 1.3M + screenshots, and use it to train UGround, a strong universal visual grounding + model for GUI agents.

+ + +
+ +

Live Demo

+ + + + + +

Offline Experiments

+
+ +

The high-quality grounding data synthesized from web (9M elements from + Web-Hybrid) effectively helps UGround generalize to Desktop and Mobile UIs, + making UGround outperform previous SOTA SeeClick on every platform and element type on + ScreenSpot. + + + + + + +

MY ALT TEXT

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScreenSpot (Standard Setting)
Grounding ModelMobileDesktopWebAverage
TextIcon/WidgetTextIcon/WidgetTextIcon/Widget
GPT-422.624.520.211.89.28.816.2
GPT-4o20.224.921.123.612.27.818.3
MiniGPT-v28.46.66.22.96.53.45.7
Groma10.32.64.64.35.73.45.2
Fuyu41.01.333.03.633.94.419.5
Qwen-VL9.54.85.75.03.52.45.2
SeeClick78.052.072.230.055.732.553.4
Qwen-GUI52.410.945.95.743.013.628.6
UGround-V182.860.382.563.680.470.473.3
Qwen2-VL61.339.352.045.033.021.842.1
Aguvis-G-7B88.378.288.170.785.774.881.0
Aguvis-7B95.677.793.867.188.375.283.0
OS-Atlas-Base-4B85.758.572.245.782.663.168.0
OS-Atlas-Base-7B93.072.991.862.990.974.381.0
ShowUI-G91.669.081.859.083.065.575.0
ShowUI92.375.576.361.181.763.675.1
Iris85.364.286.757.582.671.274.6
Aria-UI92.373.893.364.386.576.281.1
UGround-V1-2B (Qwen2-VL)89.472.088.765.781.368.977.7
UGround-V1-7B (Qwen2-VL)93.079.993.876.490.984.086.3
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ScreenSpot (Agent Setting)

Planner

Grounding
MobileDesktopWeb
Average
TextIcon/WidgetTextIcon/WidgetTextIcon/Widget

GPT-4
SeeClick76.655.568.028.640.923.348.8
UGround-V190.170.387.155.785.764.675.6

GPT-4o
Qwen-VL21.321.418.610.79.15.814.5
SeeClick81.059.869.633.643.926.252.4
Qwen-GUI67.824.553.116.450.418.538.5
UGround-V193.476.992.867.988.768.981.4
OS-Atlas-Base-4B94.173.877.847.186.565.374.1
OS-Atlas-Base-7B93.879.990.266.492.679.183.7
UGround-V1-2B (Qwen2-VL)94.177.792.863.690.070.981.5
UGround-V1-7B (Qwen2-VL)94.179.993.373.689.673.384.0
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Multimodal-Mind2Web (Element Accuracy)
InputPlannerGroundingCross-TaskCross-WebsiteCross-DomainAverage

Image + Text

GPT-4
Choice46.438.042.442.3
SoM29.620.127.025.6
+

+
+
Image
+
(SeeAct-V)
+

GPT-4
SeeClick29.728.530.729.6
UGround-V145.144.744.644.8

GPT-4o
SeeClick32.133.133.532.9
UGround-V147.746.046.646.8
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AndroidControl

Input

Planner

Grounding
Step Accuracy
HighLow
TextGPT-4Choice42.155.0
+
+
+
Image
+
(SeeAct-V)
+

GPT-4
SeeClick39.447.2
UGround-V146.258.0

GPT-4o
SeeClick41.852.8
UGround-V148.462.4
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OmniACT
InputsPlannerGroundingAction Score
Text
GPT-4
DetACT11.6
Image + TextDetACT17.0
+
+
+
Image
+
(SeeAct-V)
+

GPT-4
SeeClick28.9
UGround-V131.1

GPT-4o
SeeClick29.6
UGround-V132.8
+ + +

+ +

Online Experiments

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Mind2Web-Live
InputsPlannerGroundingCompletion RateTask Success Rate

Text
GPT-4
Choice
44.321.1
GPT-4o47.622.1
+
Image
+
(SeeAct-V)
+
GPT-4
UGround-V1
50.723.1
GPT-4o50.819.2
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
AndroidWorld
InputPlannerGroundingTask Success Rate
Text
GPT-4
Choice30.6
Image + TextSoM25.4
+
Image
+
(SeeAct-V)
+
GPT-4
UGround-V1
31.0
GPT-4o32.8
+ + +
+

BibTeX

+

+@article{gou2024uground,
+title = {Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
+author = {Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
+journal = {arXiv preprint arXiv:2410.05243},
+year = {2024},
+url = {https://arxiv.org/abs/2410.05243},
+}
+@InProceedings{zheng2024seeact,
+title = {{GPT}-4{V}(ision) is a Generalist Web Agent, if Grounded},
+author = {Zheng, Boyuan and Gou, Boyu and Kil, Jihyung and Sun, Huan and Su, Yu},
+booktitle = {Proceedings of the 41st International Conference on Machine Learning},
+pages = {61349--61385},
+year = {2024},
+}
 
-
+
- + +
+
-
-
@@ -1071,1370 +1119,1349 @@

BibTeX