Massive-Data-Mining

Codes for Massive Data Mining in SJTU.

HW1:Wordcount Problem

It is a basic problem implemented in Pyspark. Read in the massive text data. Use Map and Reduce to count the counts of each word to save time. Attaching tasks are like finding the most frequently metioned word, figuring out the count of a certain word and calculating the counts of words starting with a certain letter. Got the full score.

HW2:DGIM and LSH

Implemented DGIM to calculate the approximated counts of 1 in a streaming data file. Compared the approximated count with the accurate one. Implemented LSH to determine the similarities of some documents. DGIM is right. LSH shows no problem in the process but the result.

HW3:PageRank and Node2vec

A template by myself. Valid PageRank with the built-in function. Node2vec is realized by following the instructions.Got the full score.

FINAL:Node classification and link prediction

Node classification

A heterogeneous network, Metapath2vec (Random walk+Node2vec+Negative sampling). Got the 3th in class's Kaggle competition.

Link prediction

A homogeneous network,node2vec. The outcome is not good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Massive-Data-Mining

HW1:Wordcount Problem

HW2:DGIM and LSH

HW3:PageRank and Node2vec

FINAL:Node classification and link prediction

Node classification

Link prediction

Files

README.md

Latest commit

History

README.md

File metadata and controls

Massive-Data-Mining

HW1:Wordcount Problem

HW2:DGIM and LSH

HW3:PageRank and Node2vec

FINAL:Node classification and link prediction

Node classification

Link prediction