The preprocess.py
script is available at https://github.com/whatevery1says/preprocessing/blob/master/preprocess.py.
This script is only for preprocessing from the command line. It performs the following algorithm:
- Reads the JSON manifest(s) into a spaCy nlp object.
- Removes properties from the manifest, if specified.
- Generates a table of spaCy nlp features, sorts it, and adds it without indexes to the manifest. The structure is a list of lists.
- Creates a bag of terms dict (not including punctuation and line breaks) and adds it to the manifest.
- Adds any additional specified properties (e.g. stems or ngrams) as lists to the manifest.
- Adds a list of the document's readability scores to the manifest.
- Adds the total word count (skipping punctuation and line breaks) to the manifest.
- Adds the language model metadata.
- Saves the new manifest over the old one.
This entire process took between 3-4 seconds for 11 files on my laptop.
The command line arguments are as follows:
--path
(required): The file path to the directory containing the JSON manifest file. The script should walk through subdirectories.--filename
(required): The name of the JSON manifest file .json with extension.--property
(required): The name of the JSON property to be preprocessed.--add-properties
(optional): A comma-separated list of properties to be added to the manifest file.--remove-properties
(optional): A comma-separated list of properties to be removed from the manifest file.
Sample commands
python preprocess.py --path=data --filename=2010_10_humanities_student_major_5_askreddit.json --property=content_scrubbed
python preprocess.py --path=data --filename=2010_10_humanities_student_major_5_askreddit.json --property=content --remove-properties=content_scrubbed
Sample commands
python preprocess.py --path=data --property=content_scrubbed
python preprocess.py --path=data --property=content --remove-properties=content_scrubbed
- Some fine tuning may be needed for the language model.
- Try switching to the larger spaCy language models.
- WE1S windowed ngrams need to be added. Right now only normal ngrams work.
- To use the large language model first run you first need to install it on the command line with
python -m spacy download en_core_web_lg
. I haven't done this because it would take up a lot of space on my laptop for about a 1% improvement in accuracy. But we could do it on the server. Once it is installed, you just change themodel
configuration to 'en_core_web_lg' inprocessing.py
.