Skip to content

Latest commit

 

History

History
25 lines (13 loc) · 1.14 KB

README.md

File metadata and controls

25 lines (13 loc) · 1.14 KB

Massive-Data-Mining

Codes for Massive Data Mining in SJTU.

HW1:Wordcount Problem

It is a basic problem implemented in Pyspark. Read in the massive text data. Use Map and Reduce to count the counts of each word to save time. Attaching tasks are like finding the most frequently metioned word, figuring out the count of a certain word and calculating the counts of words starting with a certain letter. Got the full score.

HW2:DGIM and LSH

Implemented DGIM to calculate the approximated counts of 1 in a streaming data file. Compared the approximated count with the accurate one. Implemented LSH to determine the similarities of some documents. DGIM is right. LSH shows no problem in the process but the result.

HW3:PageRank and Node2vec

A template by myself. Valid PageRank with the built-in function. Node2vec is realized by following the instructions.Got the full score.

FINAL:Node classification and link prediction

Node classification

A heterogeneous network, Metapath2vec (Random walk+Node2vec+Negative sampling). Got the 3th in class's Kaggle competition.

Link prediction

A homogeneous network,node2vec. The outcome is not good.