-
Notifications
You must be signed in to change notification settings - Fork 2
Home
Exelixi is a distributed framework for running genetic algorithms at scale. The framework is based on Apache Mesos and the code is mostly implemented in Python.
Why build yet another framework for this purpose? Apache Hadoop would be quite a poor fit, due to requirements for in-memory iteration. Apache Spark could fit the problem more closely, in terms of iterative tasks. However, task overhead can become high in proportion to tasks being performed ("small file problem"), plus there is a considerable amount of configuration required at scale. Server-side operations and coprocessors in Apache Cassandra or Apache HBase might also provide a good fit for GA processing, but both of those also require lots of configuration. Moreover, many of the features for these more heavyweight frameworks are not needed, plus it helps to have some lightweight messaging available among the shards -- which these existing frameworks lack.
On the one hand, Exelixi provides the basis for a tutorial for building distributed frameworks in Apache Mesos. On the other hand, it provides a general-purpose GA platform that emphasizes scalability and fault tolerance, while leveraging the wealth of available Python analytics packages.
More details are given below about customizing this framework for solving specific GA problems. The following instructions will help you get started quickly, running Exelixi either on Apache Mesos or in standalone mode.
- Apache Mesos 0.14.0 rc4
- Python version 2.7, with Anaconda as the recommended platform
- Python Setuptools
- Python Protobuf
- Python Gevent
Usage for Apache Mesos launch
First, launch an Apache Mesos cluster. The following instructions are based on using the Elastic Mesos service, which uses Ubuntu Linux servers running on Amazon AWS. However, the basic outline of steps should apply in the general case.
Once you have confirmation that your cluster is running --
Elastic Mesos sends you an email messages with a list of masters and slaves --
then use ssh
to login on any of the masters:
ssh -A -l ubuntu <master-public-ip>
You must install the Python bindings for Apache Mesos, In this instance the Apache Mesos version is 0.14.0-rc4, so you must install the Python egg for that exact release. Also, you need to install the Exelixi source.
On the master, download the master
branch of the Exelixi code repo on GitHub and install the required libraries:
wget https://github.com/ceteri/exelixi/archive/master.zip ; \
unzip master.zip ; \
cd exelixi-master ; \
./bin/local_install.sh
You can test the installation at any point simply by attempting to import the mesos
package into Python:
python -c 'import mesos'
If there is no ImportError
exception thrown, then your installation should be complete.
Next, run the installation commands on each of the slaves:
python ./src/exelixi.py -n localhost:5050 | ./bin/install.sh
Great, ready to roll! Now launch the Framework, which in turn launches the Executors remotely on slave nodes. In the following case, it runs on two slave nodes:
python ./src/exelixi.py -m localhost:5050 -e 2
If everything runs successfully, the log should conclude with a final line:
all tasks done, and all messages received; exiting
See a GitHub gist for an example of a successful run.
To get started quickly on a single node (i.e., your laptop) in standalone mode simply follow two steps. First, launch one Executor locally:
nohup ./src/exelixi.py -p 9311 &
Then launch a Framework to run the default GA as an example:
./src/exelixi.py -s localhost:9311
Note that there are trade-offs regarding standalone mode. Pros: simple to test the customization of a GA quickly, without requiring an Apache Mesos cluster. Cons: difficult to configure and manage a GA in production at scale, since it lacks features for security and fault tolerance.
In general a GA is a search heuristic that mimics the process of natural selection in biological evolution. This approach is used to generate candidate solutions for optimization and search problems, especially in cases where the parameter space is large and complex. Note that genetic algorithms belong to a larger class of evolutionary algorithms, and have an important sub-class of genetic programming which is used to synthesize computer programs that perform a user-defined task.
Effectively, a GA can be applied for partial automation of "think out of the box" ideation in preliminary design. While the candidate solutions obtained from a GA may not be used directly, they inform domain experts how to derive novel designs from first principles, thereby accelerating design iterations substantially. In terms of relationship to machine learning, this approach approximates a stochastic gradient descent where the parameter space is quite large and a differentiable objective function may not be feasible.
In a GA, a Population of candidate solutions (called Individuals) to an optimization problem gets evolved toward improved solutions. Each candidate solution has a feature set -- i.e., its "chromosomes", if you will -- which can be altered and recominbed. The fitness for each Individual gets evaluated (or approximated) using a fitness function applied to its feature set.
Evolution starts with a set of randomly generated Individuals, then iterates through successive generations. A stochastic process called selection preserves the Individuals with better fitness as parents for the next generation. Some get randomly altered, based on a mutation operation. Pairs of parents selected at random (with replacement) from the Population are used to "breed" new Individuals, based on a crossover operation.
The algorithm terminates when the Population reaches some user-defined condition. For example:
- acceptable fitness for some Individual
- threshold aggregate error for the Population overall
- maximum number of generations iterated
- maximum number of Individuals evalutated
FeatureFactory: a base class for configuration and customization of the GA problem to be solved, which generates and evaluates feature sets
Individual: an candidate solution, represented by a feature set plus a fitness value obtained by applying a fitness function to that feature set
Population: a collection of Individuals, which in turn breed other Individuals
FossilRecord: an archive of Individuals that did not survive, persisted to durable storage and used to limit ergodic behaviors in search -- and also used for analysis after an algorithm run terminates
Executor: a service running on a slave node in the cluster, responsible for computing shards of the Population
Framework: a long-running process that maintains state for the system parameters and models parameters, obtains resources for the Executors, coordinates Executors through successive generations, and reports results; also handles all of the user interaction
To implement a GA in Exelixi,
subclass the FeatureFactory class (copy/edit src/run.py
) to customize the following operations:
- handle serializing/deserializing a feature set
- randomly generate a feature set
- calculate (or approximate) a fitness function
- mutate a feature set
- crossover a pair of parents to produce a child
- test the terminating condition
Then customize the model parameters:
- n_gen: maximum number of generations
- n_pop: maximum number of "live" Individuals at any point
- max_pop: maximum number of Individuals explored in the feature space during an algorithm run
- term_limit: a threshold used for testing the terminating condition
- hist_granularity: number of decimal places in fitness values used to construct the fitness histogram
- selection_rate: fraction of "most fit" Individuals selected as parents in each generation
- mutation_rate: random variable for applying mutation to an Individual retained for diversity
In general, the other classes cover most use cases and rarely need modifications.
An Individual represents a candidate solution.
Individuals get persisted in durable storage as key/value pairs.
The value consists of a tuple [fitness value, generation, feature set]
and a unique key is constructed from a SHA-3 digest of the JSON representing the feature set.
Let's consider how to persist an Individual in the Fossil Record given:
- a UUID as a job's unique prefix, e.g.,
048e9fae50c311e3a4cd542696d7e175
- a unique key, e.g.,
BC19234D
- a fitness value, e.g.,
0.5654
- a generation number, e.g.,
231
- JSON representing a feature set, e.g.,
(1, 5, 2)
In that case, the Individual would be represented in tab-separated format (TSV) as the pair:
hdfs://048e9fae50c311e3a4cd542696d7e175/0b799066c39a673d84133a484c2bf9a6b55eae320e33e0cc7a4ade49, [0.5654, 231, [1, 5, 2]]
A Framework is a long-running process that:
- parses command-line options from the user
- n_exe: number of allocated Executors
- exe_url: download URL for Executor tarball (customized Python classes)
- generates a UUID for each attempted algorithm run
- maintains operational state (e.g., system parameters) in Zookeeper
- list of Executor endpoints from Apache Mesos
- prefix: unique directory prefix in HDFS based on generated UUID
- current_gen: current generation count
- receives logical state (e.g., model parameters) from customized Python classes
- initializes the pool of Executors
- iterates through the phases of each generation (selection/mutation, breeding, evaluation)
- restores state for itself or for any Executor after a failure
- enumerates results at any point -- including final results after an algorithm run terminates
Resources allocated for each Executor must be sufficient to support a Population shard of n_pop / n_exe Individuals.
An Executor is a service running on a Apache Mesos slave that:
- maintains operational state (e.g., system parameters) in memory
- implements an in-memory distributed cache backed by HDFS (with write-behinds and checkpoints)
- provides a lookup service for past/present Individuals in the feature space via a bloom filter
- generates a shard as a pool of "live" Individuals at initialization or recovery
- maintains a shard of "live" Individuals in memory
- enumerates the Individuals in the shard of the Population at any point
- calculates a partial histogram for the distribution of fitness
- shuffles the local Population among neighboring Executors via a hash ring
- applies a filter to "live" Individuals to select parents for the next generation
- handles mutation, breeding, and evaluation of "live" Individuals
- persists serialized Individuals to durable storage (write-behinds)
- recovers a shard from the last checkpoint, after failures
The lookup service which implements the distributed hash table in turn leverages a hash ring to distribute Individuals among neighboring shards of the Population and a bloom filter for a fast, space-efficient, probabilistic set membership function which has no false negatives but allows rare false positives. The hash ring helps to "shuffle" the genes of the Population among different shards, to enhance the breeding pair selection. In essence, this aspect allows for GA problems to scale-out horizontally. The bloom filter introduces a small rate of false positives in the Individual lookup (data loss), as a trade-off for large performance gains. This also forces a pre-defined limit on the number of Individuals explored in the feature space during an algorithm run.
REST services and internal tasks for the Executors are implement using gevents, a coroutine library that provides concurrency on top of libevent.
Effectively, a GA implements a stochastic process over a content addressable memory, to optimize a non-convex search space. Given use of HDFS for distributed storage, then much of the architecture resembles a distributed hash table which tolerates data loss.
Note that feature set serialization (key construction) only needs to be performed once per Individual, and the fitness function calculation only needs to be performed on "live" Individuals in the current Population. Consequently, if mutation is considered as "replacement", then there is a limited amount of mutable state in the Individuals. This allows for some measure of idempotence in the overall data collection, e.g., append-only updates to HDFS, which can be used to reconstruct state following a node or process failure.
Also, the algorithm is tolerant of factors that often hinder distributed systems:
- eventual consistency in the durable storage
- data loss of partial solutions, e.g., a bloom filter false positive, or when an Executor fails, etc.
In the latter case when an Executor process is lost, the Framework can simply launch another Executor on the cluster (via Apache Mesos) and have it restore its shard of Individuals from its last good checkpoint. In general, limited amounts of data loss serve to add stochastic aspects to the search, and may help accelerate evolution.
Heartfelt kudos to Bill Worzel, Niklas Nielsen, Jason Dusek, Alek Storm, Erich Nachbar, Tobi Knaup, Flo Leibert.
- 2013-11-23 first successful launch of customized framework/executor on Elastic Mesos
- 2013-11-21 running one master/one slave only (e.g., on a laptop)
- 2013-11-26 integrate the required Apache Mesos methods to launch/manage a remote service
- articulate all of the REST endpoint services
- handle remote reify for Individuals
- support for multiple Executors in the hash ring
- shard checkpoint to HDFS
- shard recovery from HDFS
- saving/recovering Framework state in Zookeeper
- optimize bloom filter settings as a function of the max_pop and n_exe parameters
-
Makefile
to build tarball for Executor downloads - automate Anaconda installations on the cluster