ACU - Awesome Agents for Computer Use

An AI Agent for Computer Use is an autonomous program that can reason about tasks, plan sequences of actions, and act within the domain of a computer or mobile device in the form of clicks, keystrokes, other computer events, command-line operations and internal/external API calls. These agents combine perception, decision-making, and control capabilities to interact with digital interfaces and accomplish user-specified goals independently.

A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools.

Articles

Papers

Surveys

Surveys

GUI Agents: A Survey (Dec. 2024)
- General survey of GUI agents
Large Language Model-Brained GUI Agents: A Survey (Nov. 2024)
- Focus on LLM-based approaches
- Website
GUI Agents with Foundation Models: A Comprehensive Survey (Nov. 2024)
- Comprehensive overview of foundation model-based GUI agents

Frameworks & Models

Frameworks & Models

Large Action Models: From Inception to Implementation (Dec. 2024)
- Comprehensive framework for developing LAMs that can perform real-world actions beyond language generation
- Details key stages including data collection, model training, environment integration, grounding and evaluation
Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (Dec. 2024)
- Novel reward-guided navigation approach
SpiritSight Agent: Advanced GUI Agent with One Look (Dec. 2024)
- Single-shot GUI interaction approach
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs (Dec. 2024)
- Novel approach for automatic GUI functionality annotation
Simulate Before Act: Model-Based Planning for Web Agents (Dec. 2024)
- Novel model-based planning approach using LLM world models
Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Dec. 2024)
- Novel autonomous skill discovery framework for web agents
- Code
Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents (Dec. 2024)
- Novel framework for contextualizing web pages to enhance LLM agent decision making
Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL (Dec. 2024)
- Novel value-based offline RL approach for training VLM device-control agents
Magentic-One (Nov. 2024)
- Multi-agent system with orchestrator-led coordination
- Strong performance on GAIA, WebArena, and AssistantBench
Agent Workflow Memory (Sep. 2024)
- Novel workflow memory framework for agents
- Code
The Impact of Element Ordering on LM Agent Performance (Sep. 2024)
- Novel study on element ordering's impact on agent performance
- Code
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)
- Novel reasoning and learning framework
- Website
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models (Aug. 2024)
- Open platform for web-based agent deployment
- Code
Agent-e: From autonomous web navigation to foundational design principles in agentic systems (Jul. 2024)
- Hierarchical architecture with flexible DOM distillation
- Novel denoising method for web navigation
Apple Intelligence Foundation Language Models (Jul. 2024)
- Vision-Language Model with Private Cloud Compute
- Novel foundation model architecture
Tree search for language model agents (Jul. 2024)
- Multi-step reasoning and planning with best-first tree search
- Novel approach for LLM-based agents
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)
- Novel reinforcement learning approach
- Code
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)
- Multi-agent collaboration for mobile device operation
- Code
Octopus Series: On-device Language Models for Computer Control (Apr. 2024)
- v4: Graph of language models with functional tokens integration (Apr. 2024)
- v3: Sub-billion parameter multimodal model for edge devices (Apr. 2024)
- v2: Super agent for Android and iOS (Apr. 2024)
- v1: Function calling of software APIs (Apr. 2024)
- Website
- Code
AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent (Apr. 2024)
- Novel approach for real-world web navigation and bilingual benchmark
- Code
Cradle: Empowering Foundation Agents towards General Computer Control (Mar. 2024)
- Focus on general computer control using Red Dead Redemption II as a case study
- Code
Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)
- Novel Chain-of-Action-Thought framework for Android interaction
- Code
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)
- Vision-language model for computer control
- Code
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)
- Vision-Language Model for PC interaction
- Code
UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)
- Specialized for Windows OS interaction
- Code
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation (Feb. 2024)
- Novel comprehensive environment perception (CEP) approach for exhaustive GUI perception
- Introduces conditional action prediction (CAP) for reliable action response
Intention-inInteraction (IN3): Tell Me More! (Feb. 2024)
- Novel benchmark for evaluating user intention understanding in agent designs
- Introduces model experts for robust user-agent interaction
Dual-view visual contextualization for web navigation (Feb. 2024)
- Novel approach for automatic web navigation with language instructions
- Key: HTML elements, visual contextualization
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)
- Specialized for mobile UI and infographics understanding
- Novel approach for visual interface comprehension
GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024)
- Demonstrates GPT-4V capabilities for web interaction
- Code
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)
- Visual perception for mobile device interaction
- Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (Jan. 2024)
- End-to-end approach for web interaction
- Code
CogAgent: A Visual Language Model for GUI Agents (Dec. 2023)
- Works across PC and Android platforms
- Code
AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)
- Focused on smartphone interaction
- Code
LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)
- Novel approach to web navigation
- Code
AndroidEnv: A Reinforcement Learning Platform for Android (May 2021)
- Reinforcement learning platform for Android interaction
- Code

UI Grounding

UI Grounding

OmniParser for Pure Vision Based GUI Agent (Aug. 2024)
- Novel vision-based screen parsing method for UI screenshots
- Combines finetuned interactable icon detection and functional description models
- Code
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)
- Mobile UI understanding
- Code
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024)
- Advanced visual grounding techniques
- Code
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (Oct. 2024)
- Multimodal LLM for universal UI understanding across diverse platforms
- Introduces adaptive gridding for high-resolution perception
- Preprint
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)
- Universal approach to GUI interaction
- Code
OS-ATLAS: Foundation Action Model for Generalist GUI Agents (Oct. 2024)
- Comprehensive action modeling
- Code
UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding (Dec. 2024)
- Novel framework for building VLMs with strong UI element grounding capabilities
Grounding Multimodal Large Language Model in GUI World (Dec. 2024)
- Novel GUI grounding framework with automated data collection engine and lightweight grounding module

Dataset

Dataset

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Dec. 2024)
- Novel interaction-driven approach for automated GUI trajectory synthesis
- Introduces reverse task synthesis and trajectory reward model
- Code
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (Dec. 2024)
- Web tutorial-based trajectory synthesis
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights (Jun. 2024)
- Novel approach to continual learning from trajectories
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale (Sep. 2024)
- Scalable demonstration generation
Multi-Turn Mind2Web: On the Multi-turn Instruction Following (Feb. 2024)
- Multi-turn instruction dataset for web agents
- Code
CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation (Aug. 2024)
- Chinese benchmark for agent evaluation
- Code
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks (Jul. 2024)
- Benchmark for realistic and time-consuming web tasks
- Code
Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)
- Large-scale web interaction dataset
- Code
Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)
- Large-scale dataset for Android interaction
- Real-world device control scenarios
WebShop: Towards Scalable Real-World Web Interaction (Jul. 2022)
- Dataset for grounded language agents in web interaction
- Code
Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)
- Mobile app UI dataset
- Design-focused data collection

Benchmark

Benchmark

A3: Android Agent Arena for Mobile GUI Agents (Jan. 2025)
- Novel evaluation platform with 201 tasks across 21 widely used third-party apps
- Website
- Code
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)
- Comprehensive evaluation framework
- Code
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (May. 2024)
- Android-focused evaluation
- Code
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)
- Evaluation in data science workflows
- Code
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)
- Mobile agent evaluation
- Code
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024)
- Web-focused evaluation
- Code
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (Sep. 2024)
- Windows OS-focused evaluation framework
- Code
- Website
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)
- Mobile-focused evaluation framework
- Code

Safety

Safety

Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)
- Security analysis of computer agents
- Code
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage (Sep. 2024)
- Privacy and security analysis
GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning (Jun. 2024)
- Safety mechanisms for agents

Projects

Open Source

Frameworks & Models

Frameworks & Models

AutoGen
- Framework for building AI agent systems.
- It simplifies the creation of event-driven, distributed, scalable, and resilient agentic applications.
Auto-GPT
- Autonomous GPT-4 agent
- Task automation focus
Browser Use
- Make websites accessible for AI agents with vision + HTML extraction
- Supports multi-tab management and custom actions with LangChain integration
Claude Computer Use Demo
- MacOS implementation
- Claude integration
Claude Minecraft Use
- Game automation
- Specialized use case
Computer Use OOTB
- Ready-to-use implementation
- Comprehensive toolset
Cybergod
- Advanced computer control
Grunty
- Computer control agent
- Task automation focus
LaVague
- AI web agent framework
- Modular architecture
Mac Computer Use
- MacOS-specific tools
- Anthropic integration
NatBot
- Browser automation
- GPT-4 Vision integration
OpenAdapt
- AI-First Process Automation
- Multimodal model integration
OpenInterface
- Open-source UI interaction framework
- Cross-platform support
OpenInterpreter
- General-purpose computer control framework
- Python-based, extensible architecture
Open Source Computer Use by E2B
- Open-source implementation of computer control capabilities
- Secure sandboxed environment for AI agents
Self-Operating Computer
- Computer control framework
- Vision-based automation
Skyvern
- AI web agent framework
- Automate browser-based workflows with LLMs using vision and HTML extraction
Surfkit
- Device operation toolkit
- Extensible agent framework
WebMarker
- Web page annotation tool
- Vision-language model support

UI Grounding

UI Grounding

AskUI/PTA-1
- A small vision language model for computer & phone automation, based on Florence-2.
- With only 270M parameters it outperforms much larger models in GUI text and element localization.
Microsoft/OmniParser
- A general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent

Environment & Sandbox

Environment & Sandbox

dockur/windows
- Windows inside a Docker container
E2B Desktop Sandbox
- Secure desktop environment
- Agent testing platform
qemus/qemu-docker
- Docker container for running virtual machines using QEMU

Automation

Automation

nut.js
- Native UI automation
- JavaScript/TypeScript implementation
PyAutoGUI
- Cross-platform GUI automation
- Python-based control library

Commercial

Frameworks & Models

Frameworks & Models

Anthropic Claude Computer Use
- Commercial computer control capability
- Integrated with Claude 3.5 models
Multion
- AI agents that can fully complete tasks in any web environment.
Runner H
- Advanced AI agent for real-world applications.
- Scores 67% on WebVoyager

Contributing

We welcome and encourage contributions from the community! Here's how you can help:

Add new resources: Found a relevant paper, project, or tool? Submit a PR to add it
Fix errors: Help us correct any mistakes in existing entries
Improve organization: Suggest better ways to structure the information
Update content: Keep entries up-to-date with latest developments

To contribute:

Fork the repository
Create a new branch for your changes
Submit a pull request with a clear description of your additions/changes
Post in the X Community to let everyone know about the new resource

For an example of how to format your contribution, please refer to this PR.

Thank you for helping spread knowledge about AI agents for computer use!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACU - Awesome Agents for Computer Use

Table of Contents

Articles

Papers

Surveys

Frameworks & Models

UI Grounding

Dataset

Benchmark

Safety

Projects

Open Source

Frameworks & Models

UI Grounding

Environment & Sandbox

Automation

Commercial

Frameworks & Models

Contributing

About

Contributors 2

francedot/acu

Folders and files

Latest commit

History

Repository files navigation

ACU - Awesome Agents for Computer Use

Table of Contents

Articles

Papers

Surveys

Frameworks & Models

UI Grounding

Dataset

Benchmark

Safety

Projects

Open Source

Frameworks & Models

UI Grounding

Environment & Sandbox

Automation

Commercial

Frameworks & Models

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2