- Run get_newspaper.sh in data/newspaper/
This will download and extract a .tsv of a lager kaggle dataset (https://www.kaggle.com/alvations/old-newspapers) - Run the filter_raw_data.py script.
This will extract the relevant data (newspapers in english language) and write then to the newspaper_dataset.pickle file, which can then be read and processed by helper methods in data_utils.py - Delete the .tsv file (optional)