WebParsy

Franklin's fork:

Added sos-california.yml to experiment with declarative parsing.
Added Dockerfile and docker-compose.yml

Run with:

DEBUG=webparsy:* node cli.js examples/sos-california.yml

OR

docker-compose up --build

WebParsy is a NodeJS library and cli which allows to scrape websites using Puppeteer (or not) and YAML definitions

version: 1
jobs:
  main:
    steps:
      - goto: https://github.com/marketplace?category=code-quality
      - pdf:
          path: Github_Tools.pdf
          format: A4
      - many: 
          as: github_tools
          selector: main .col-lg-9.mt-1.mb-4.float-lg-right a.col-md-6.mb-4.d-flex.no-underline
          element:
            - property:
                selector: a
                type: string
                property: href
                as: url
                transform: absoluteUrl
            - text:
                selector: h3.h4
                type: string
                transform: trim
                as: name
            - text:
                selector: p
                type: string
                transform: trim
                as: description

Return an array with Github's tools, and creates a PDF. Example output:

{
  "github_tools": [
    {
      "url": "https://github.com/marketplace/codelingo",
      "name": "codelingo",
      "description": "Your Code, Your Rules - Automate code reviews with your own best practices"
    },
    {
      "url": "https://github.com/marketplace/codebeat",
      "name": "codebeat",
      "description": "Code review expert on demand. Automated for mobile and web"
    },
    ...
  ]
}

Don't panic. There are examples for all WebParsy features in the examples folder. This are as basic as possible to help you get started.

^{Support my work,}

Overview
Browser config
Output
Transform
Types
Steps
- goto Navigate to an URL
- goBack Navigate to the previous page in history
- screenshot Takes an screenshot of the page
- pdf Takes a pdf of the page
- text Gets the text for a given CSS selector
- many Returns an array of elements given their CSS selectors
- title Gets the title for the current page.
- form Fill and submit forms
- html Return HTML code for the page or a DOM element
- click Click on an element
- url Return the current URL
- waitFor Wait for selectors or some time before continuing

Overview

You can use WebParsy either as cli from your terminal or as a NodeJS library.

Cli

Install webparsy:

$ npm i webparsy -g

$ webparsy example/_weather.yml --customFlag "custom flag value"
Result:

{
  "title": "Madrid, España Pronóstico del tiempo y condiciones meteorológicas - The Weather Channel | Weather.com",
  "city": "Madrid, España",
  "temp": 18
}

Library

const webparsy = require('webparsy')
const parsingResult = await webparsy.init({
  file: 'jobdefinition.yml'
  flags: { ... } // optional
})

Methods:

init(options)

options:

One of yaml, file or string is required.

yaml: A yaml npm module instance of the scraping definition.
string: The YAML definition, as a plain string.
file: The path for the YAML file containing the scraping definition.

Additionally, you can pass a flags object property to input additional values to your scraping process.

Browser config

You can setup Chrome's details in the browser property within the main job.

jobs:
  main:
    browser:
      width: 1200
      height: 800
      scaleFactor: 1
      timeout: 60
      delay: 0

Output

In order for WebParsy to get contents, it needs some very basic details. This are:

as the property you want to be returned
selector the css selector to extract the html or text from

Other optional options are

parent Get the parent of the element filtered by a selector.

Example

text:
  selector: .entry-title
  as: entryLink
  parent: a

Transform

When you extract texts from a web page, you might want to transform the data before returning them. example

You can use the following - transform methods:

uppercase transforms the result to uppercase
lowercase transforms the result to lowercase
absoluteUrl return the absolute url for a link

Types

When extractring details from a page, you might want them to be returned in different formats, for example as a number in the example of grabing temperatures. example

You can use the following values for - type:

string
number
integer
float
fcd tranform to float an string-number that uses comma for thousands
fdc tranform to float an string-number that uses dot for thousands

Steps

Steps are the list of things the browser must do.

This can be:

goto Navigate to an URL
goBack Navigate to the previous page in history
screenshot Takes an screenshot of the page
pdf Takes a pdf of the page
text Gets the text for a given CSS selector
many Returns an array of elements given their CSS selectors
title Gets the title for the current page.
form Fill and submit forms
html Return HTML code for the page or a DOM element
click Click on an element
url Return the current URL
waitFor Wait for selectors or some time before continuing

goto

URL to navigate page to. The url should include scheme, e.g. https://. example

- goto: https://example.com

You can also tell WebParsy to don't use Puppeteer to browse, and instead do a normal HTTP(s) GET request. This will perform much faster, but it may not be suitable for websites that requires JavaScript. simple example / extended example

Note that some methods (for example: form, click and others) will not be available if you are not browsing using puppeteer.

- goto:
    url: https://google.com
    method: get

You can also tell WebParsy which urls it should visit via flags (available via cli and library). Example:

- goto:
    flag: websiteUrl

You can then call webparsy as:

webparsy definition.yaml --websiteUrl "https://google.com"

or

webparsy.init({
  file: 'definition.yml'
  flags: { websiteUrl: 'https://google.com' }
})

example

goBack

Navigate to the previous page in history. example

- goBack

screenshot

Takes an screenshot of the page. This triggers pupetteer's page.screenshot. example

- screenshot:
  - path: Github.png

pdf

Takes a pdf of the page. This triggers pupetteer's page.pdf

- pdf:
  - path: Github.pdf

title

Gets the title for the current page. If no output.as property is defined, the page's title will tbe returned as { title }. example

- title

many

Returns an array of elements given their CSS selectors. example

Example:

- many: 
  as: articles
  selector: main ol.articles-list li.article-item
  element:
    - text:
      selector: .title
      as: title

form

Fill and submit forms. example

Form filling can use values from environment variables. This is useful if you want to keep users login details in secret. If this is your case, instead of specifying the value as a string, set it as the env property for value. Check the example below or refer to banking example

Example:

- form:
    selector: "#tsf"            # form selector
    submit: true               # Submit after filling all details
    fill:                      # array of inputs to fill
      - selector: '[name="q"]' # input selector
        value: test            # input value

Using environment variables

- form:
    selector: "#login"            # form selector
    submit: true                  # Submit after filling all details
    fill:                         # array of inputs to fill
      - selector: '[name="user"]' # input selector
        value:
          env: USERNAME           # process.env.USERNAME
      - selector: '[name="pass"]' 
        value: 
          env: PASSWORD           # process.env.PASSWORD

html

Gets the HTML code. If no selector specified, it returns the page's full HTML code. If no output.as property is defined, the result will be returned as { html }. example

Example:

- html
    as: divHtml
    selector: div

click

Click on an element. example

Example:

- click: button.click-me

url

Return the current URL.

Example:

- url:
    as: currentUrl

waitFor

Wait for specified CSS selectors, on an specific amount of time before continuing example

Examples:

- waitFor:
   selector: "#search-results"

- waitFor:
    time: 1000 # Time in milliseconds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WebParsy

Table of Contents

Overview

Cli

Library

Methods:

init(options)

Browser config

Output

Transform

Types

Steps

goto

goBack

screenshot

pdf

title

many

form

html

click

url

waitFor

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

WebParsy

Table of Contents

Overview

Cli

Library

Methods:

init(options)

Browser config

Output

Transform

Types

Steps

goto

goBack

screenshot

pdf

title

many

form

html

click

url

waitFor

License