forked from kiranvodrahalli/cos521
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoutline.txt
90 lines (65 loc) · 3.14 KB
/
outline.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Things to add:
abstract
clear definition of problem and motivation
adapt proposal -> introduction stuff
background work: cm-sketch, hokusai (algorithm), specific twitter trending work: ?
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6137387&tag=1
(this is the closest we could find)
(to the best of our knowledge, this work has not been done before)
to note: our goal was to make a real time query algorithm that USES
the history (in condensed form) to help understand current time
we're remembering the past with less accuracy to save space (realistic model)
in order to get good estimates of present trending
this means we are not trying to query the past the way Hokusai is.
why we don't use raw frequency counts for the past hour:
there are some hashtags that are constantly present (for example, #porn, #gamesinsight, #teamfollowback (a way to get Twitter followers fast))
these should not be counted as 'trending', so we need to filter these out
with the history. what remains is 'what is trending'
how did we approach it
overview
data structures
description
-- fib heap
-- queue
-- frequency hash table
-- History (own subsection)
algorithms
description
- make sure changes in implementation are reflected in algorithm
-- querying the (History)
-- updating the data structure
------ adding to the data structure
----------- frequency hash table update/ heap update / (update history every time block - mention why we can add, etc, some of the theory from the Hokusai paper)
------ removing from the data structure (our snapshot of the present)
----------- every time unit of the present, we remove outdated frequencies
-- top k hashtags (real time)
runtime analysis
space analysis
accuracy analysis
-- key function for heap discussion
-- a guiding metric in our choice of heap key function was to
get some spread, i.e. not everythign was 1.0 in priority.
(rationale: if everything is 1.0, everything is trending, that's no good)
-- method of choosing, analysis (talk about in design choices as well)
-- freq/ history
-- freq/ (history)^p: p = 1.3? : this is not good, everything goes to 1
-- different parameter choices affect error: more error makes more random -> more spread
-- the space-saving actually has the side-effect of adding some smoothing error, which made
the increased space efficiency actually beneficial for the trend-detection application
-- (with higher error it worked better; when we got more precise, it worked worse)
design choices for data structure
-- data structure modification (History vs Hokusai)
-- fibonacci heap
-- heap key function
justification
results
naive algorithm serves as comparison
description of naive
the results
figure.jpg: September 24th, 2014, 6:00 Leftmost two Japanese hashtags are for 2pm and their album Go Crazy (apparently a remix version released on the 24th)
figure2.jpg: September 29 2014, 3:00 See stuff about Snowden and security trending, along with ferguson, ebola, cowboys, and manchester united
figure3.jpg doesn't have much, forgot why I included it
figure2 is probably the most interesting of the three
future work
references
http://newsoffice.mit.edu/2012/predicting-twitter-trending-topics-1101