The package is divided into two scripts: extract_divtag.py
and div_checker.py
. The general purpose is to
track which search fields need human supervision to gradually reduce manual interference when running a crawler.
Below is a detailed breakdown of each script.
Input: a CSV file that contains desired fields (e.g. phone, address) with each row being a different entity.
Additionally, the last column must be a "special field" that contains the name of fields that were modified
by a human. This should be saved after running the crawler & human inspection as it allows extract_divtag.py
to keep track of which fields need monitoring. Resources.csv
is an example of such input file.
General Note: If the file contains a specific URL that's expired/altered, the file will crash. Please ensure that the scraped URL in the input file is upto date and accesible.
Output: a JSON file that contains a set of tuples per each location ID. A tuple contains:
- The desired field
- The HTML element corresponding to the field
- The content of the field
This JSON file can ultimately be used at a future date to quickly check if the tracked fields have changed.
Input: a JSON file formatted as detailed above.
Output: command line output(s) determining whether the content of the tracked fields remain the same or if they're changed by an external host. If the content is changed, the script automaitcally outputs the new content which can be integrated with a crawler to reduce human supervision in "refresh updates".
Note: all entities should be mapped through an ID
This file contains all the code for the main support vector machine ML model. The model takes labeled
input generated in a similar matter to extract_divtag.py
. Features are extracted from the text content
within the HTML tag. Features relating to tag attributes and HTML structure proved to have litte to no
predictive potential (so they're removed from the model). After hyperparameter tuning, the model reaches
95%+ classification accruacy.