Contains the TribeFlow (previously node-sherlock) source code.
The python dependencies are:
- Mpi4Py
- numpy
- scipy
- cython
- pandas
- plac
You will also need to install and setup:
- OpenMP
Easy way: Install Anaconda Python and set it up as your default enviroment.
Hard way: Use pip or your package manager to install the dependencies.
pip install numpy
pip install scipy
pip install cython
pip install pandas
pip install mpi4py
pip install plac
Use or package manager (apt on Ubuntu, HomeBrew on a mac) to install OpenMP and MPI. These are the managers I tested with. Should work on any other environment.
Simply type make
Either use python install
to install the packager or just use it from
the package folder using the
How to parse datasets: Use the scripts/
script. It has a help.
For command line help:
$ python scripts/ -h
$ python -h
Running with mpi
$ mpiexec -np 4 python [OPTIONS]
Running TribeFlow from other python code:
Check the file
Converting the Trace
Let's assume we have a trace like the Last.FM trace from Oscar Celma. In this example, each line is of the form:
userid \t timestamp \t musicbrainz-artist-id \t artist-name \t
musicbrainz-track-id \t track-name
For instance:
user_000001 2009-05-01T09:17:36Z c74ee320-1daa-43e6-89ee-f71070ee9e8f
Impossible Beings 952f360d-d678-40b2-8a64-18b4fa4c5f8Dois Pólos
First, we want to convert this file to our input format. We do this with the
script. Let's have a look at the options from
this script:
$ python scripts/ -h
usage: [-h] [-d DELIMITER] [-l LOOPS] [-r SORT] [-f FMT]
original_trace tstamp_column hypernode_column
positional arguments:
original_trace The name of the original trace
tstamp_column The column of the time stamp
hypernode_column The column of the time hypernode
obj_node_column The column of the object node
optional arguments:
-h, --help show this help message and exit
The delimiter
-l LOOPS, --loops LOOPS
Consider loops
-r SORT, --sort SORT Sort the trace
-f FMT, --fmt FMT The format of the date in the trace
-s SCALE, --scale SCALE
Scale the time by this value
-k SKIP_HEADER, --skip_header SKIP_HEADER
Skip these first k lines
-m MEM_SIZE, --mem_size MEM_SIZE
Memory Size (the markov order is m - 1)
The positional (obrigatory) arguments are:
- original_trace is the input file
- hypernode_column represents the users (called hypernodes since it can be playlists as well)
- tstamp_column the column of the time stamp
- obj_node_column the objects of interest
We can convert the file with the following line:
python scripts/ scripts/test_parser.dat 1 0 2 -d$'\t' \
-f'%Y-%m-%dT%H:%M:%SZ' > trace.dat
Here, we are saying that column 1 are the timestamps, 0 is the user, and 2 are the
objects (artist ids). The delimiter -d is a tab. The time stamp format is
Adding memory
Use the -m argument to increase the burst (B parameter in the paper) size.
python scripts/ scripts/test_parser.dat 1 0 2 -d$'\t' \
-f'%Y-%m-%dT%H:%M:%SZ' -m 3 > trace.dat
Learning the Model
The example below is the same code used for every result in the paper. It runs TribeFlow with the options used in every result in the paper. Explaining the parameters:
- -np 20 Number of cores for execution.
- 100 topics.
- output.h5 model file.
- --kernel eccdf The kernel heuristic for inter-event time estimation. ECCDF based as per described on the paper. We also have a t-student kernel.
- --residency_priors 1 99 The priors for the inter-event time estimation.
- --leaveout 0.3 Number of transitions to leaveout.
- --num_iter 2000 Number of iterations.
- --num_batches 20 Number of split/merge moves.
The example below uses 20 cores
$ mpiexec -np 20 python trace.dat 100 output.h5 \
--kernel eccdf --residency_priors 1 99 \
--leaveout 0.3 --num_iter 2000 --num_batches 20
The mean reciprocal rank script will generate predictions and save them to the given files. Just run:
$ PYTHONPATH=. python scripts/ output.h5 rss.dat predictions.dat
is the model trained.
Other useful scripts
Similar to the script above, you can use the scripts:
to print a summary of the topics with most likely
to print either an O by O matrix or a Z by Z
to generate the Z by Z matrix in the ISMIR jazz
to generate the Miles Davis plot in the ISMIR jazz paper
Below we have the list of datasets explored on the paper. We also curated links to various other timestamp datasets that can be exploited by TribeFlow and future efforts.
Datasets used on the paper:
- LastFM-1k
- LastFM-Our
- FourSQ This dataset was removed from the original website. Still available on archive. Other, more recent, FourSQ datasets are available. See below.
- Brightkite
- Yes
List of other, some more recent, datasets that can be explored by TribeFlow.
Basically, anything with users (playlists, actors, etc also work), objects and timestamps.
On the example
folder we have some sub-sampled datasets that can be used to
better understand the method.
The current version of the code may not be the exact version used in any of the papers that employ Tribeflow. However, and most importantly, I am tagging the commits closest to each paper. Please check the tags if you want to run an exact version of tribeflow used in a given paper.
- HMMs - or any other implementation
- Gravity Model
- StagesModel