Skip to content

Seraen/EE226-Massive-Data-Mining

Repository files navigation

Massive-Data-Mining

Codes for Massive Data Mining in SJTU.

HW1:Wordcount Problem

It is a basic problem implemented in Pyspark. Read in the massive text data. Use Map and Reduce to count the counts of each word to save time. Attaching tasks are like finding the most frequently metioned word, figuring out the count of a certain word and calculating the counts of words starting with a certain letter. Got the full score.

HW2:DGIM and LSH

Implemented DGIM to calculate the approximated counts of 1 in a streaming data file. Compared the approximated count with the accurate one. Implemented LSH to determine the similarities of some documents. DGIM is right. LSH shows no problem in the process but the result.

HW3:PageRank and Node2vec

A template by myself. Valid PageRank with the built-in function. Node2vec is realized by following the instructions.Got the full score.

FINAL:Node classification and link prediction

Node classification

A heterogeneous network, Metapath2vec (Random walk+Node2vec+Negative sampling). Got the 3th in class's Kaggle competition.

Link prediction

A homogeneous network,node2vec. The outcome is not good.

About

Codes for Massive Data Mining(AI) in SJTU.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published