Codes for Massive Data Mining in SJTU.
It is a basic problem implemented in Pyspark. Read in the massive text data. Use Map and Reduce
to count the counts of each word to save time. Attaching tasks are like finding the most frequently metioned word, figuring out the count of a certain word and calculating the counts of words starting with a certain letter. Got the full score.
Implemented DGIM to calculate the approximated counts of 1 in a streaming data file. Compared the approximated count with the accurate one. Implemented LSH to determine the similarities of some documents. DGIM is right. LSH shows no problem in the process but the result.
A template by myself. Valid PageRank with the built-in function. Node2vec is realized by following the instructions.Got the full score.
A heterogeneous network, Metapath2vec (Random walk+Node2vec+Negative sampling). Got the 3th in class's Kaggle competition.
A homogeneous network,node2vec. The outcome is not good.