Experiment code and data artifacts released along with paper publication at TheWebConf'21.
We include all original source code and other data essential for reproducing our experiment setup. Each sub-directory in this repository contains further documentation on its contents. A summary:
crawler
: code to visit URLs and collect/archive raw datacrawl-post-processor
: code to post-process raw collected data into aggregate/deduplicated metrics for analysisexperiment-generator
: code to process a top-sites list (Alexa, Tranco) and generate batches of crawl jobs for executionmisc
: assorted utilities and sidecars to support our primary components working across multiple clusterstranco
: a snapshot of the Tranco top million Web domains as used in the paper's experimentwork-dispatcher
: the work queue producing/consuming framework used in our experiment
We do not include the Kubernetes/Kustomization resource definition files we used to orchestrate these components as these are highly specific to our infrastructure. As deployed, they contained significant private information, and the redaction required would render them both undeployable and unenlightening.
A compressed PostgreSQL database backup script containing all the post-processed data from the spring 2020 collection experiment described in the paper is available here.