Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial boilerplate for contacts scraper #51

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions actors/contact-gpt-scraper/.actor/actor.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"actorSpecification": 1,
"name": "gpt-scraper",
"title": "GPT Scraper",
"description": "Crawler uses OpenAI API",
"version": "0.0",
"meta": {
"templateId": "ts-crawlee-playwright-chrome"
},
"input": "./input_schema.json",
"readme": "./README.md",
"dockerfile": "../../../shared/Dockerfile",
"changelog":"../../../shared/CHANGELOG.md",
"storages": {
"dataset": "../../../shared/dataset_schema.json"
},
"dockerContextDir": "../../.."
}
192 changes: 192 additions & 0 deletions actors/contact-gpt-scraper/.actor/input_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
{
"title": "Extended GPT Scraper",
"type": "object",
"description": "The crawler scrapes pages and runs GPT model instructions for each page.",
"schemaVersion": 1,
"properties": {
"startUrls": {
"title": "Start URLs",
"type": "array",
"description": "A static list of URLs to scrape. <br><br>For details, see <a href='https://apify.com/drobnikj/extended-gpt-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> in README.",
"prefill": [
{ "url": "https://news.ycombinator.com/" }
],
"editor": "requestListSources"
},
"includeUrlGlobs": {
"title": "Include URLs (globs)",
"type": "array",
"description": "Glob patterns matching URLs of pages that will be included in crawling. Combine them with the link selector to tell the scraper where to find links. You need to use both globs and link selector to crawl further pages.",
"editor": "globs",
"default": [],
"prefill": []
},
"excludeUrlGlobs": {
"title": "Exclude URLs (globs)",
"type": "array",
"description": "Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not Start URLs, which are always crawled.",
"editor": "globs",
"default": [],
"prefill": []
},
"linkSelector": {
"title": "Link selector",
"type": "string",
"description": "This is a CSS selector that says which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) should be followed and added to the request queue. To filter the links added to the queue, use the <b>Pseudo-URLs</b> setting.<br><br>If <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see <a href='https://apify.com/drobnikj/extended-gpt-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README.",
"editor": "textfield",
"prefill": "a[href]"
},
"initialCookies": {
"title": "Initial cookies",
"type": "array",
"description": "Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with `name`, `value`, 'domain' and 'path' properties. For example: `[{\"name\": \"cookieName\", \"value\": \"cookieValue\"}, \"domain\": \".domain.com\", \"path\": \"/\"}]`.\n\nYou can use the [EditThisCookie](https://chrome.google.com/webstore/detail/editthiscookie/fngmhnnpilhplaeedifhccceomclgfbg) browser extension to copy browser cookies in this format, and paste it here.",
"default": [],
"prefill": [],
"editor": "json"
},
"openaiApiKey": {
"title": "OpenAI API key",
"type": "string",
"description": "The API key for accessing OpenAI. You can get it from <a href='https://platform.openai.com/account/api-keys' target='_blank' rel='noopener'>OpenAI platform</a>.",
"editor": "textfield",
"isSecret": true
},
"instructions": {
"title": "Instructions for GPT",
"type": "string",
"description": "Instruct GPT how to generate text. For example: \"Summarize this page in three sentences.\"<br><br>You can instruct OpenAI to answer with \"skip this page\", which will skip the page. For example: \"Summarize this page in three sentences. If the page is about Apify Proxy, answer with 'skip this page'.\".",
"prefill": "Gets the post with the most points from the page and returns it as JSON in this format: \npostTitle\npostUrl\npointsCount",
"editor": "textarea"
},
"model": {
"title": "GPT model",
"type": "string",
"description": "Select a GPT model. See <a href='https://platform.openai.com/docs/models/overview' target='_blank' rel='noopener'>models overview</a>. Keep in mind that each model has different pricing and features.",
"editor": "select",
"default": "gpt-3.5-turbo",
"prefill": "gpt-3.5-turbo",
"enum": ["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4", "gpt-4-32k", "text-davinci-003", "gpt-4-turbo"],
"enumTitles": ["GPT-3.5 Turbo", "GPT-3.5 Turbo 16k", "GPT-4", "GPT-4 32k", "GTP-3 (davinci)", "GPT-4 Turbo (Preview)"]
},
"targetSelector": {
"title": "Content selector",
"type": "string",
"description": "A CSS selector of the HTML element on the page that will be used in the instruction. Instead of a whole page, you can use only part of the page. For example: \"div#content\".",
"editor": "textfield",
"prefill": ""
},
"removeElementsCssSelector": {
"title": "Remove HTML elements (CSS selector)",
"type": "string",
"description": "A CSS selector matching HTML elements that will be removed from the DOM, before sending it to GPT processing. This is useful to skip irrelevant page content and save on GPT input tokens. \n\nBy default, the Actor removes usually unwanted elements like scripts, styles and inline images. You can disable the removal by setting this value to some non-existent CSS selector like `dummy_keep_everything`.",
"editor": "textarea",
"default": "script, style, noscript, path, svg, xlink",
"prefill": "script, style, noscript, path, svg, xlink"
},
"maxCrawlingDepth": {
"title": "Max crawling depth",
"type": "integer",
"description": "This specifies how many links away from the <b>Start URLs</b> the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers.<br><br>If set to <code>0</code>, there is no limit.",
"minimum": 0,
"default": 99999999
},
"maxPagesPerCrawl": {
"title": "Max pages per run",
"type": "integer",
"description": "Maximum number of pages that the scraper will open. 0 means unlimited.",
"minimum": 0,
"default": 10,
"unit": "pages"
},
"skipGptGlobs": {
"title": "Skip GPT processing for Globs",
"type": "array",
"description": "This setting allows you to specify certain page URLs to skip GPT instructions for. Pages matching these glob patterns will only be crawled for links, excluding them from GPT processing. Useful for intermediary pages used for navigation or undesired content.",
"editor": "globs",
"default": [],
"prefill": []
},
"useStructureOutput": {
"sectionCaption": "Formatted output",
"sectionDescription": "By default, the scraper outputs text answers for each page. If you want to get data in a structured format, you can define a JSON schema. The scraper uses [function](https://platform.openai.com/docs/api-reference/chat/create#chat/create-functions), which is called for each page. The function receives the page content and returns the answer in the defined JSON format.",
"title": "Use JSON schema to format answer",
"type": "boolean",
"description": "If true, the answer will be transformed into a structured format based on the schema in the `jsonAnswer` attribute.",
"editor": "checkbox"
},
"schema": {
"title": "Schema",
"type": "object",
"description": "Defines how the output will be stored in structured format using the [JSON Schema[JSON Schema](https://json-schema.org/understanding-json-schema/). Keep in mind that it uses [function](https://platform.openai.com/docs/api-reference/chat/create#chat/create-functions), so by setting the description of the fields and the correct title, you can get better results.",
"prefill": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Page title"
},
"description": {
"type": "string",
"description": "Page description"
}
},
"required": ["title", "description"]
},
"editor": "json"
},
"temperature": {
"sectionCaption": "GPT settings",
"title": "Temperature",
"type": "string",
"description": "Controls randomness: Lowering results in less random completions. As the temperature approaches zero, the model will become deterministic and repetitive. For consistent results, we recommend setting the temperature to 0.",
"editor": "textfield",
"default": "0"
},
"topP": {
"title": "TopP",
"type": "string",
"description": "Controls diversity via nucleus sampling: 0.5 means half of all likelihood-weighted options are considered.",
"editor": "textfield",
"default": "1"
},
"frequencyPenalty": {
"title": "Frequency penalty",
"type": "string",
"description": "How much to penalize new tokens based on their existing frequency in the text so far. Decreases the model's likelihood to repeat the same line verbatim.",
"editor": "textfield",
"default": "0"
},
"presencePenalty": {
"title": "Presence penalty",
"type": "string",
"description": "How much to penalize new tokens based on whether they appear in the text so far. Increases the model's likelihood to talk about new topics.",
"editor": "textfield",
"default": "0"
},
"proxyConfiguration": {
"sectionCaption": "Advanced configuration",
"title": "Proxy configuration",
"type": "object",
"description": "This specifies the proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/drobnikj/extended-gpt-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
"prefill": { "useApifyProxy": true },
"default": { "useApifyProxy": false },
"editor": "proxy"
},
"pageFormatInRequest": {
"title": "Page format in request",
"type": "string",
"description": "In what format to send the content extracted from the page to the GPT. Markdown will take less space allowing for larger requests, while HTML may help include some information like attributes that may otherwise be omitted.",
"enum": ["HTML", "Markdown"],
"enumTitles": ["HTML", "Markdown"],
"default": "Markdown"
},
"saveSnapshots": {
"title": "Save debug snapshots",
"type": "boolean",
"description": "For each page store its HTML, screenshot and parsed content (markdown/HTML as it was sent to ChatGPT) adding links to these into the output",
"editor": "checkbox",
"default": true
}
},
"required": ["startUrls", "instructions", "openaiApiKey", "model"]
}
13 changes: 13 additions & 0 deletions actors/contact-gpt-scraper/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# configurations
.idea

# crawlee and apify storage folders
apify_storage
crawlee_storage
storage

# installed files
node_modules

# git folder
.git
23 changes: 23 additions & 0 deletions actors/contact-gpt-scraper/.eslintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"root": true,
"env": {
"browser": true,
"es2020": true,
"node": true
},
"extends": [
"@apify/eslint-config-ts"
],
"parserOptions": {
"project": "./tsconfig.json",
"ecmaVersion": 2020
},
"ignorePatterns": [
"node_modules",
"dist",
"**/*.d.ts"
],
"rules": {
"@typescript-eslint/ban-ts-comment": "warn"
}
}
8 changes: 8 additions & 0 deletions actors/contact-gpt-scraper/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# This file tells Git which files shouldn't be added to source control

.DS_Store
.idea
dist
node_modules
apify_storage
storage
108 changes: 108 additions & 0 deletions actors/contact-gpt-scraper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Extended GPT Scraper

Extended GPT Scraper is a powerful tool that leverages OpenAI's API to modify text obtained from a scraper.
You can use the scraper to extract content from a website and then pass that content to the OpenAI API to make the GPT magic happen.

## How does Extended GPT Scraper work?

The scraper first loads the page using [Playwright](https://playwright.dev/), then
it converts the content into markdown format and asks for GPT instructions about markdown content.

If the content doesn't fit into the GPT limit, the scraper will truncate the content. You can find the message about truncated content in the log.

## How much does it cost?

There are two costs associated with using GPT Scraper.

### Cost of the OpenAI API

You can find the cost of the OpenAI API on the [OpenAI pricing page](https://openai.com/pricing/).
The cost depends on the model you are using and the length of the content you are sending to the API for scraping.

### Cost of the scraping itself

The cost of the scraper is the same as the cost of [Web Scraper](https://apify.com/apify/web-scraper), because it uses the same browser under the hood.
You can find information about the cost on [the pricing page](https://apify.com/pricing) under the Detailed Pricing breakdown section.
The cost estimates are based on averages and may vary depending on the complexity of the pages you scrape.

## How to use Extended GPT Scraper

To get started with Extended GPT Scraper, you need to set up the pages you want to scrape using [**Start URLs**](#start-urls) and set up instructions for how the scraper should handle each page and the OpenAI API key.
NOTE: You can find the OpenAI API key in your [OpenAI dashboard](https://beta.openai.com/account/api-keys).

You can configure the scraper and GTP using Input configuration to set up a more complex workflow.

## Input configuration

Extended GPT Scraper accepts a number of configuration settings.
These can be entered either manually in the user interface in [Apify Console](https://console.apify.com)
or programmatically in a JSON object using the [Apify API](https://apify.com/docs/api/v2#/reference/actors/run-collection/run-actor).
For a complete list of input fields and their types, please see the outline of the Actor's [Input-schema](https://apify.com/apify/playwright-scraper/input-schema).

### Start URLs

The **Start URLs** (`startUrls`) field represents the initial list of page URLs that the scraper will visit. You can enter a group of URLs together using file upload or one by one.

The scraper supports adding new URLs to scrape on the fly, either using the **[Link selector](#link-selector)** or **[Glob patterns](#glob-patterns)** options.

### Link selector

The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages (items with `href` attributes, e.g. `<div class="my-class" href="...">`).

On every page that is loaded, the scraper looks for all links matching **Link selector**, and checks that the target URL matches one of the [**Glob patterns**](#glob-patterns). If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.

If **Link selector** is empty, the page links are ignored, and the scraper only loads pages specified in **[Start URLs](#start-urls)**.

### Glob patterns

The **Glob patterns** (`globs`) field specifies which types of URLs found by **[Link selector](#link-selector)** should be added to the request queue.

A glob pattern is simply a string with wildcard characters.

For example, a glob pattern `http://www.example.com/pages/**/*` will match all the
following URLs:

- `http://www.example.com/pages/deeper-level/page`
- `http://www.example.com/pages/my-awesome-page`
- `http://www.example.com/pages/something`

### OpenAI API key

The API key for accessing OpenAI. You can get it from <a href='https://platform.openai.com/account/api-keys' target='_blank' rel='noopener'>OpenAI platform</a>.

### Instructions and prompts for GPT

This option tells GPT how to handle page content. For example, you can send the following prompts.

- "Summarize this page in three sentences."
- "Find sentences that contain 'Apify Proxy' and return them as a list."

You can also instruct OpenAI to answer with "skip this page" if you don't want to process all the scraped content, e.g.

- "Summarize this page in three sentences. If the page is about proxies, answer with 'skip this page'.".

### GPT Model

The **GPT Model** (`model`) option specifies which GPT model to use.
You can find more information about the models on the [OpenAI API documentation](https://platform.openai.com/docs/models/overview).
Keep in mind that each model has different pricing and features.

### Max crawling depth

This specifies how many links away from `Start URLs` the scraper will descend.
This value is a safeguard against infinite crawling depths for misconfigured scrapers.

### Max pages per run

The maximum number of pages that the scraper will open. 0 means unlimited.

### Formatted output

If you want to get data in a structured format, you can define [JSON schema](https://json-schema.org/understanding-json-schema/) using the `Schema` input option and enable the **Use JSON schema to format answer** option.
This schema will be used to format data into a structured JSON object, which will be stored in the output in the jsonAnswer attribute.

### Proxy configuration

The **Proxy configuration** (`proxyConfiguration`) option enables you to set proxies.
The scraper will use them to prevent its detection by target websites.
You can use both [Apify Proxy](https://apify.com/proxy) and custom HTTP or SOCKS5 proxy servers.
Loading
Loading