Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: added docs for vision #227

Merged
merged 6 commits into from
Nov 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/tender-elephants-whisper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"appwright": patch
---

chore: added docs for vision
56 changes: 56 additions & 0 deletions docs/vision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Vision methods

Appwright provides a set of built-in methods to tap or extract information from the screen. These methods use LLM Capabilities to perform actions on the screen.

## Extract information from the screen

The `query` method allows you to extract information from the screen based on a prompt. Ensure the `OPENAI_API_KEY` environment variable is set to authenticate the API request.

```ts
const text = await device.beta.query("Extract the contact details present in the footer");
```

By default, the `query` method returns a string. You can also specify a Zod schema to get the response in a specific format.

```ts
const isLoginButtonVisible = await device.beta.query(
`Is the login button visible on the screen?`,
{
responseFormat: z.boolean(),
},
);
```

### Using custom screenshot

By default, the query method retrieves information from the current screen. Alternatively, you can specify a screenshot to perform operations on that particular image.

```ts
const text = await device.beta.query(
"Extract contact details from the footer of this screenshot.",
{
screenshot: <base64ImageString>,
},
);
```

### Using a different model

By default, the `query` method uses the `gpt-4o-mini` model. You can also specify a different model.

```ts
const text = await device.beta.query(
`Extract contact details present in the footer`,
{
model: "gpt-4o",
},
);
```

## Tap on the screen

The `tap` method allows you to tap on the screen based on a prompt. Ensure the `EMPIRICAL_API_KEY` environment variable is set to authenticate the API request.

```ts
await device.beta.tap("point at the 'Login' button.");
```
2 changes: 1 addition & 1 deletion src/vision/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ export interface AppwrightVision {

/**
* Performs a tap action on the screen based on the provided prompt.
* Ensure the `VISION_MODEL_ENDPOINT` environment variable is set to authenticate the API request.
* Ensure the `EMPIRICAL_API_KEY` environment variable is set to authenticate the API request.
*
* **Usage:**
* ```js
Expand Down