Franklin's fork:
- Added sos-california.yml to experiment with declarative parsing.
- Added Dockerfile and docker-compose.yml
Run with:
DEBUG=webparsy:* node cli.js examples/sos-california.yml
OR
docker-compose up --build
WebParsy is a NodeJS library and cli which allows to scrape websites using Puppeteer (or not) and YAML definitions
version: 1
jobs:
main:
steps:
- goto: https://github.com/marketplace?category=code-quality
- pdf:
path: Github_Tools.pdf
format: A4
- many:
as: github_tools
selector: main .col-lg-9.mt-1.mb-4.float-lg-right a.col-md-6.mb-4.d-flex.no-underline
element:
- property:
selector: a
type: string
property: href
as: url
transform: absoluteUrl
- text:
selector: h3.h4
type: string
transform: trim
as: name
- text:
selector: p
type: string
transform: trim
as: description
Return an array with Github's tools, and creates a PDF. Example output:
{
"github_tools": [
{
"url": "https://github.com/marketplace/codelingo",
"name": "codelingo",
"description": "Your Code, Your Rules - Automate code reviews with your own best practices"
},
{
"url": "https://github.com/marketplace/codebeat",
"name": "codebeat",
"description": "Code review expert on demand. Automated for mobile and web"
},
...
]
}
Don't panic. There are examples for all WebParsy features in the examples folder. This are as basic as possible to help you get started.
- Overview
- Browser config
- Output
- Transform
- Types
- Steps
- goto Navigate to an URL
- goBack Navigate to the previous page in history
- screenshot Takes an screenshot of the page
- pdf Takes a pdf of the page
- text Gets the text for a given CSS selector
- many Returns an array of elements given their CSS selectors
- title Gets the title for the current page.
- form Fill and submit forms
- html Return HTML code for the page or a DOM element
- click Click on an element
- url Return the current URL
- waitFor Wait for selectors or some time before continuing
You can use WebParsy either as cli from your terminal or as a NodeJS library.
Install webparsy:
$ npm i webparsy -g
$ webparsy example/_weather.yml --customFlag "custom flag value"
Result:
{
"title": "Madrid, España Pronóstico del tiempo y condiciones meteorológicas - The Weather Channel | Weather.com",
"city": "Madrid, España",
"temp": 18
}
const webparsy = require('webparsy')
const parsingResult = await webparsy.init({
file: 'jobdefinition.yml'
flags: { ... } // optional
})
options:
One of yaml
, file
or string
is required.
yaml
: A yaml npm module instance of the scraping definition.string
: The YAML definition, as a plain string.file
: The path for the YAML file containing the scraping definition.
Additionally, you can pass a flags
object property to input additional values
to your scraping process.
You can setup Chrome's details in the browser
property within the main job.
jobs:
main:
browser:
width: 1200
height: 800
scaleFactor: 1
timeout: 60
delay: 0
In order for WebParsy to get contents, it needs some very basic details. This are:
as
the property you want to be returnedselector
the css selector to extract the html or text from
Other optional options are
parent
Get the parent of the element filtered by a selector.
Example
text:
selector: .entry-title
as: entryLink
parent: a
When you extract texts from a web page, you might want to transform the data before returning them. example
You can use the following - transform
methods:
uppercase
transforms the result to uppercaselowercase
transforms the result to lowercaseabsoluteUrl
return the absolute url for a link
When extractring details from a page, you might want them to be returned in different formats, for example as a number in the example of grabing temperatures. example
You can use the following values for - type
:
string
number
integer
float
fcd
tranform to float an string-number that uses comma for thousandsfdc
tranform to float an string-number that uses dot for thousands
Steps are the list of things the browser must do.
This can be:
- goto Navigate to an URL
- goBack Navigate to the previous page in history
- screenshot Takes an screenshot of the page
- pdf Takes a pdf of the page
- text Gets the text for a given CSS selector
- many Returns an array of elements given their CSS selectors
- title Gets the title for the current page.
- form Fill and submit forms
- html Return HTML code for the page or a DOM element
- click Click on an element
- url Return the current URL
- waitFor Wait for selectors or some time before continuing
URL to navigate page to. The url should include scheme, e.g. https://. example
- goto: https://example.com
You can also tell WebParsy to don't use Puppeteer to browse, and instead do a normal HTTP(s) GET request. This will perform much faster, but it may not be suitable for websites that requires JavaScript. simple example / extended example
Note that some methods (for example: form
, click
and others) will not be
available if you are not browsing using puppeteer.
- goto:
url: https://google.com
method: get
You can also tell WebParsy which urls it should visit via flags (available via cli and library). Example:
- goto:
flag: websiteUrl
You can then call webparsy as:
webparsy definition.yaml --websiteUrl "https://google.com"
or
webparsy.init({
file: 'definition.yml'
flags: { websiteUrl: 'https://google.com' }
})
Navigate to the previous page in history. example
- goBack
Takes an screenshot of the page. This triggers pupetteer's page.screenshot. example
- screenshot:
- path: Github.png
Takes a pdf of the page. This triggers pupetteer's page.pdf
- pdf:
- path: Github.pdf
Gets the title for the current page. If no output.as property is defined, the
page's title will tbe returned as { title }
. example
- title
Returns an array of elements given their CSS selectors. example
Example:
- many:
as: articles
selector: main ol.articles-list li.article-item
element:
- text:
selector: .title
as: title
Fill and submit forms. example
Form filling can use values from environment variables. This is useful if you want to keep users login details in secret. If this is your case, instead of specifying the value as a string, set it as the env property for value. Check the example below or refer to banking example
Example:
- form:
selector: "#tsf" # form selector
submit: true # Submit after filling all details
fill: # array of inputs to fill
- selector: '[name="q"]' # input selector
value: test # input value
Using environment variables
- form:
selector: "#login" # form selector
submit: true # Submit after filling all details
fill: # array of inputs to fill
- selector: '[name="user"]' # input selector
value:
env: USERNAME # process.env.USERNAME
- selector: '[name="pass"]'
value:
env: PASSWORD # process.env.PASSWORD
Gets the HTML code. If no selector
specified, it returns the page's full HTML
code. If no output.as property is defined, the result will be returned
as { html }
. example
Example:
- html
as: divHtml
selector: div
Click on an element. example
Example:
- click: button.click-me
Return the current URL.
Example:
- url:
as: currentUrl
Wait for specified CSS selectors, on an specific amount of time before continuing example
Examples:
- waitFor:
selector: "#search-results"
- waitFor:
time: 1000 # Time in milliseconds
MIT © Jose Constela