Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/aaron #2

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 29 additions & 7 deletions doc/bmc_article/bmc_article.tex
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@
\email{Rutger A Vos\correspondingauthor - [email protected]}%
\and
Aaron Steele\correspondingauthor$^2$%
\email{Aaron Steele\correspondingauthor - [email protected]}
\email{Aaron Steele\correspondingauthor - [email protected]}
}


Expand All @@ -135,7 +135,7 @@

\address{%
\iid(1)Naturalis Biodiversity Center, Einsteinweg 2, Leiden, the Netherlands\\
\iid(2)UC Berkeley, Berkeley, USA
\iid(2)University of California Berkeley, Berkeley, USA
}%

\maketitle
Expand Down Expand Up @@ -329,7 +329,7 @@ \section*{Results and Discussion}
need to be scalars we concatenate the keys with | and the values with ,
(for example). Here's the result we would then emit:

A => 1,1 # the first integer is the node ID, the second its tip count
A => 1,1 % the first integer is the node ID, the second its tip count
C => 2,1
A|C => 3,2
A|C => 4,2
Expand Down Expand Up @@ -363,10 +363,32 @@ \section*{Results and Discussion}
- performance

% this describes at a high level Aaron's code
\subsection*{Name of the Clojure implementation}
- using the clojure implementation
- web front-end
- performance
\subsection*{Clojure}

Our implementation rides on Clojure, a dynamic programming language that
compiles down to bytecode and gets executed on the Java Virtual Machine. It
can natively access Java frameworks like Apache Hadoop, making it an ideal
candidate for implementing distributed MapReduce algorithms in an extrememly
performant way. In addition to Clojure, our implementation rides on
Cascalog, a high performance data processing library for querying "Big Data"
on Hadoop using clusters or local machines with the interactive Clojure REPL.

\subsubsection*{Implementation details}

As input, our implementation takes two files: The phylogenetic tree that has
been transformed and labelled in a post-order traversal from node tip to
root, and a file containing the node tips from which to prune. The output is
the taxon bipartition table described above. The algorithm initially maps
each node to its tip, then combines and merges resulting tips to create the
final bipartition table. The MapReduce job can be launched from the command
line on a Hadoop cluster or interactively using the REPL.

\subsubsection*{Runtime performance}

Here a brief overview of how performance improves as input data gets larger
since the Hadoop overhead is eclipsed. Also mention combining other sources
of Big Data such as spatial data via GADM native Java bindings, taxonomy
synonyms, etc could be done much faster than serial methods.

%%%%%%%%%%%%%%%%%%%%%%
\section*{Conclusions}
Expand Down