An AI Agent for Computer Use is an autonomous program that can reason about tasks, plan sequences of actions, and act within the domain of a computer or mobile device in the form of clicks, keystrokes, other computer events, command-line operations and internal/external API calls. These agents combine perception, decision-making, and control capabilities to interact with digital interfaces and accomplish user-specified goals independently.
A curated list of resources about AI agents for Computer Use, including research papers, projects, frameworks, and tools.
- Anthropic | Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
- Bill Gates | AI is about to completely change how you use computers
- Ethan Mollick | When you give a Claude a mouse
Surveys
-
GUI Agents: A Survey (Dec. 2024)
- General survey of GUI agents
-
Large Language Model-Brained GUI Agents: A Survey (Nov. 2024)
- Focus on LLM-based approaches
- Website
-
GUI Agents with Foundation Models: A Comprehensive Survey (Nov. 2024)
- Comprehensive overview of foundation model-based GUI agents
Frameworks & Models
-
Large Action Models: From Inception to Implementation (Dec. 2024)
- Comprehensive framework for developing LAMs that can perform real-world actions beyond language generation
- Details key stages including data collection, model training, environment integration, grounding and evaluation
-
Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation (Dec. 2024)
- Novel reward-guided navigation approach
-
SpiritSight Agent: Advanced GUI Agent with One Look (Dec. 2024)
- Single-shot GUI interaction approach
-
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs (Dec. 2024)
- Novel approach for automatic GUI functionality annotation
-
Simulate Before Act: Model-Based Planning for Web Agents (Dec. 2024)
- Novel model-based planning approach using LLM world models
-
Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents (Dec. 2024)
- Novel autonomous skill discovery framework for web agents
- Code
-
Learning to Contextualize Web Pages for Enhanced Decision Making by LLM Agents (Dec. 2024)
- Novel framework for contextualizing web pages to enhance LLM agent decision making
-
Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL (Dec. 2024)
- Novel value-based offline RL approach for training VLM device-control agents
-
Magentic-One (Nov. 2024)
- Multi-agent system with orchestrator-led coordination
- Strong performance on GAIA, WebArena, and AssistantBench
-
Agent Workflow Memory (Sep. 2024)
- Novel workflow memory framework for agents
- Code
-
The Impact of Element Ordering on LM Agent Performance (Sep. 2024)
- Novel study on element ordering's impact on agent performance
- Code
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents (Aug. 2024)
- Novel reasoning and learning framework
- Website
-
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models (Aug. 2024)
- Open platform for web-based agent deployment
- Code
-
Agent-e: From autonomous web navigation to foundational design principles in agentic systems (Jul. 2024)
- Hierarchical architecture with flexible DOM distillation
- Novel denoising method for web navigation
-
Apple Intelligence Foundation Language Models (Jul. 2024)
- Vision-Language Model with Private Cloud Compute
- Novel foundation model architecture
-
Tree search for language model agents (Jul. 2024)
- Multi-step reasoning and planning with best-first tree search
- Novel approach for LLM-based agents
-
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning (Jun. 2024)
- Novel reinforcement learning approach
- Code
-
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration (Jun. 2024)
- Multi-agent collaboration for mobile device operation
- Code
-
Octopus Series: On-device Language Models for Computer Control (Apr. 2024)
-
AutoWebGLM: Bootstrap and reinforce a large language model-based web navigating agent (Apr. 2024)
- Novel approach for real-world web navigation and bilingual benchmark
- Code
-
Cradle: Empowering Foundation Agents towards General Computer Control (Mar. 2024)
- Focus on general computer control using Red Dead Redemption II as a case study
- Code
-
Android in the Zoo: Chain-of-Action-Thought for GUI Agents (Mar. 2024)
- Novel Chain-of-Action-Thought framework for Android interaction
- Code
-
ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (Feb. 2024)
- Vision-language model for computer control
- Code
-
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement (Feb. 2024)
- Vision-Language Model for PC interaction
- Code
-
UFO: A UI-Focused Agent for Windows OS Interaction (Feb. 2024)
- Specialized for Windows OS interaction
- Code
-
CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation (Feb. 2024)
- Novel comprehensive environment perception (CEP) approach for exhaustive GUI perception
- Introduces conditional action prediction (CAP) for reliable action response
-
Intention-inInteraction (IN3): Tell Me More! (Feb. 2024)
- Novel benchmark for evaluating user intention understanding in agent designs
- Introduces model experts for robust user-agent interaction
-
Dual-view visual contextualization for web navigation (Feb. 2024)
- Novel approach for automatic web navigation with language instructions
- Key: HTML elements, visual contextualization
-
ScreenAI: A Vision-Language Model for UI and Infographics Understanding (Feb. 2024)
- Specialized for mobile UI and infographics understanding
- Novel approach for visual interface comprehension
-
GPT-4V(ision) is a Generalist Web Agent, if Grounded (Jan. 2024)
- Demonstrates GPT-4V capabilities for web interaction
- Code
-
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception (Jan. 2024)
- Visual perception for mobile device interaction
- Code
-
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (Jan. 2024)
- End-to-end approach for web interaction
- Code
-
CogAgent: A Visual Language Model for GUI Agents (Dec. 2023)
- Works across PC and Android platforms
- Code
-
AppAgent: Multimodal Agents as Smartphone Users (Dec. 2023)
- Focused on smartphone interaction
- Code
-
LASER: LLM Agent with State-Space Exploration for Web Navigation (Sep. 2023)
- Novel approach to web navigation
- Code
-
AndroidEnv: A Reinforcement Learning Platform for Android (May 2021)
- Reinforcement learning platform for Android interaction
- Code
UI Grounding
-
OmniParser for Pure Vision Based GUI Agent (Aug. 2024)
- Novel vision-based screen parsing method for UI screenshots
- Combines finetuned interactable icon detection and functional description models
- Code
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apr. 2024)
- Mobile UI understanding
- Code
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents (Jan. 2024)
- Advanced visual grounding techniques
- Code
-
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms (Oct. 2024)
- Multimodal LLM for universal UI understanding across diverse platforms
- Introduces adaptive gridding for high-resolution perception
- Preprint
-
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents (Oct. 2024)
- Universal approach to GUI interaction
- Code
-
OS-ATLAS: Foundation Action Model for Generalist GUI Agents (Oct. 2024)
- Comprehensive action modeling
- Code
-
UI-Pro: A Hidden Recipe for Building Vision-Language Models for GUI Grounding (Dec. 2024)
- Novel framework for building VLMs with strong UI element grounding capabilities
-
Grounding Multimodal Large Language Model in GUI World (Dec. 2024)
- Novel GUI grounding framework with automated data collection engine and lightweight grounding module
Dataset
-
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis (Dec. 2024)
- Novel interaction-driven approach for automated GUI trajectory synthesis
- Introduces reverse task synthesis and trajectory reward model
- Code
-
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials (Dec. 2024)
- Web tutorial-based trajectory synthesis
-
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights (Jun. 2024)
- Novel approach to continual learning from trajectories
-
Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale (Sep. 2024)
- Scalable demonstration generation
-
Multi-Turn Mind2Web: On the Multi-turn Instruction Following (Feb. 2024)
- Multi-turn instruction dataset for web agents
- Code
-
CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation (Aug. 2024)
- Chinese benchmark for agent evaluation
- Code
-
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks (Jul. 2024)
- Benchmark for realistic and time-consuming web tasks
- Code
-
Mind2Web: Towards a Generalist Agent for the Web (Jun. 2023)
- Large-scale web interaction dataset
- Code
-
Android in the Wild: A Large-Scale Dataset for Android Device Control (Jul. 2023)
- Large-scale dataset for Android interaction
- Real-world device control scenarios
-
WebShop: Towards Scalable Real-World Web Interaction (Jul. 2022)
- Dataset for grounded language agents in web interaction
- Code
-
Rico: A Mobile App Dataset for Building Data-Driven Design Applications (Oct. 2017)
- Mobile app UI dataset
- Design-focused data collection
Benchmark
-
A3: Android Agent Arena for Mobile GUI Agents (Jan. 2025)
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (Apr. 2024)
- Comprehensive evaluation framework
- Code
-
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents (May. 2024)
- Android-focused evaluation
- Code
-
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? (Jul. 2024)
- Evaluation in data science workflows
- Code
-
MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents (Jun. 2024)
- Mobile agent evaluation
- Code
-
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (Jan. 2024)
- Web-focused evaluation
- Code
-
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale (Sep. 2024)
-
Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction (May. 2023)
- Mobile-focused evaluation framework
- Code
Safety
-
Attacking Vision-Language Computer Agents via Pop-ups (Nov. 2024)
- Security analysis of computer agents
- Code
-
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage (Sep. 2024)
- Privacy and security analysis
-
GuardAgent: Safeguard LLM Agent by a Guard Agent via Knowledge-Enabled Reasoning (Jun. 2024)
- Safety mechanisms for agents
Frameworks & Models
-
- Framework for building AI agent systems.
- It simplifies the creation of event-driven, distributed, scalable, and resilient agentic applications.
-
- Autonomous GPT-4 agent
- Task automation focus
-
- Make websites accessible for AI agents with vision + HTML extraction
- Supports multi-tab management and custom actions with LangChain integration
-
- MacOS implementation
- Claude integration
-
- Game automation
- Specialized use case
-
- Ready-to-use implementation
- Comprehensive toolset
-
- Advanced computer control
-
- Computer control agent
- Task automation focus
-
- AI web agent framework
- Modular architecture
-
- MacOS-specific tools
- Anthropic integration
-
- Browser automation
- GPT-4 Vision integration
-
- AI-First Process Automation
- Multimodal model integration
-
- Open-source UI interaction framework
- Cross-platform support
-
- General-purpose computer control framework
- Python-based, extensible architecture
-
Open Source Computer Use by E2B
- Open-source implementation of computer control capabilities
- Secure sandboxed environment for AI agents
-
- Computer control framework
- Vision-based automation
-
- AI web agent framework
- Automate browser-based workflows with LLMs using vision and HTML extraction
-
- Device operation toolkit
- Extensible agent framework
-
- Web page annotation tool
- Vision-language model support
UI Grounding
-
- A small vision language model for computer & phone automation, based on Florence-2.
- With only 270M parameters it outperforms much larger models in GUI text and element localization.
-
- A general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent
Environment & Sandbox
-
- Windows inside a Docker container
-
- Secure desktop environment
- Agent testing platform
-
- Docker container for running virtual machines using QEMU
Automation
-
- Native UI automation
- JavaScript/TypeScript implementation
-
- Cross-platform GUI automation
- Python-based control library
Frameworks & Models
-
- Commercial computer control capability
- Integrated with Claude 3.5 models
-
- AI agents that can fully complete tasks in any web environment.
-
- Advanced AI agent for real-world applications.
- Scores 67% on WebVoyager
We welcome and encourage contributions from the community! Here's how you can help:
- Add new resources: Found a relevant paper, project, or tool? Submit a PR to add it
- Fix errors: Help us correct any mistakes in existing entries
- Improve organization: Suggest better ways to structure the information
- Update content: Keep entries up-to-date with latest developments
To contribute:
- Fork the repository
- Create a new branch for your changes
- Submit a pull request with a clear description of your additions/changes
- Post in the X Community to let everyone know about the new resource
For an example of how to format your contribution, please refer to this PR.
Thank you for helping spread knowledge about AI agents for computer use!