Web-scrape tracker

Functionality

The package is divided into two scripts: extract_divtag.py and div_checker.py. The general purpose is to track which search fields need human supervision to gradually reduce manual interference when running a crawler. Below is a detailed breakdown of each script.

extract_divtag.py

Input: a CSV file that contains desired fields (e.g. phone, address) with each row being a different entity. Additionally, the last column must be a "special field" that contains the name of fields that were modified by a human. This should be saved after running the crawler & human inspection as it allows extract_divtag.py to keep track of which fields need monitoring. Resources.csv is an example of such input file.

General Note: If the file contains a specific URL that's expired/altered, the file will crash. Please ensure that the scraped URL in the input file is upto date and accesible.

Output: a JSON file that contains a set of tuples per each location ID. A tuple contains:

The desired field
The HTML element corresponding to the field
The content of the field

This JSON file can ultimately be used at a future date to quickly check if the tracked fields have changed.

div_checker.py

Input: a JSON file formatted as detailed above.

Output: command line output(s) determining whether the content of the tracked fields remain the same or if they're changed by an external host. If the content is changed, the script automaitcally outputs the new content which can be integrated with a crawler to reduce human supervision in "refresh updates".

Note: all entities should be mapped through an ID

tag_svm.ipynb

This file contains all the code for the main support vector machine ML model. The model takes labeled input generated in a similar matter to extract_divtag.py. Features are extracted from the text content within the HTML tag. Features relating to tag attributes and HTML structure proved to have litte to no predictive potential (so they're removed from the model). After hyperparameter tuning, the model reaches 95%+ classification accruacy.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-scrape tracker

Functionality

extract_divtag.py

div_checker.py

tag_svm.ipynb

About

Releases

Packages

Languages

Minoo-Kim/ScrapeClassifyML

Folders and files

Latest commit

History

Repository files navigation

Web-scrape tracker

Functionality

extract_divtag.py

div_checker.py

tag_svm.ipynb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages