Factor a document-word cooccurence-matrix that is scaled with positive pointwise mutual information (PPMI) using singular value decomposition (SVD).
We use the WikiText dataset.
To extract documents from WikiText and save as json file, run:
mkdir data
./parse-wikitext.py wikitext-2-raw/wiki.train.raw data/wikitext-2-raw.docs.json
In the project terminal, run
mkdir vec
./main.py --data data/wikitext-2-raw.docs.json --outpath vec/wikitext-2-raw.vec.txt \
--lower --num-words 1000 --dim 10
for a quick demo. Plots are saved in the folder plots
.
To rank the documents based on the vectors, use:
./rank.py vec/wikitext-2-raw.vec.txt > wikitext-2-raw.ranking.txt
numpy
scipy
tqdm
matplotlib
sklearn
bokeh