Skip to content

Latest commit

 

History

History
executable file
·
144 lines (94 loc) · 7.03 KB

README.adoc

File metadata and controls

executable file
·
144 lines (94 loc) · 7.03 KB

fcmaesray - a multi-node Python 3 gradient-free optimization library

fcmaesray extends fcmaes for multi-node execution. It uses the cluster support of the ray library.

Features

  • fcmaesray is focused on optimization problems hard to solve.

  • Using a cluster the load computing expensive objective functions involving simulations or solving differential equations can be distributed.

  • During optimization good solutions are transfered between nodes so that information is shared.

  • A single ray actor executes the fcmaes coordinated retry on each node.

  • Local node interprocess communication is shared memory based as in fcmaes which is faster than using ray for this purpose.

  • Minimal message transfer overhead, only local improvements are broadcasted to the other nodes.

  • Alternatively you can simultaneously optimize a list of problem variants to find out which variant delivers the best solution. For instance the TandEM problem where the variants correspond to planet sequences. Bad variants are successively filtered out. The variants are distributed over all available nodes.

clustered retry

AWS, Azure, GCP and Kubernetes

To run fcmaesray on a cloud cluster follow the corresponding ray instructions:

  • Adapt the ray docker creation script (Create Docker Image) to include fcmaesray.

  • Check Launching Cloud Clusters how to launch cloud clusters. You should adapt the cloud configuration to use a fixed number of nodes (no autoscaling).

Optimization algorithms

See fcmaes. Default algorithm is a sequence of a random choice of state of the art differential evolution variants including GCL-DE from Mingcheng Zuo and CMA-ES all implemented in C++. Other algorithms from scipy and NLopt can be used and arbitrary choice/sequence expressions are supported.

Installation

  • pip install fcmaesray

Since ray doesn’t support Windows, use the single node fcmaes if you are on Windows or use the

The Linux subsystem can read/write NTFS, so you can do your development on a NTFS partition. Just the Python call is routed to Linux.

Usage

Usage is similar to scipy.optimize.minimize.

Local cluster multi-node coordinated parallel retry

To start a ray cluster manually:

  • Make sure fcmaesray is installed at each node ('pip install fcmaesray')

  • Additionally your objective function needs to be installed on each node.

  • Then call 'ray start --head --num-cpus=1' on the head node.

Make sure to replace <address> with the value printed by the command on the head node. <address> should look like '192.168.0.67:6379'. On all worker nodes execute.

  • 'ray start --address=<address> --num-cpus=1'.

  • adapt <address> also in the ray.init command in your code.

We use "--num-cpus=1" because only one actor is deployed at each node controlling the local coordinated retry. This uses (as for single node fcmaes) shared memory to exchange solutions between local processes. So ray needs only one CPU per node. Your code looks like:

import ray
from fcmaes.optimizer import logger
from fcmaesray import rayretry

ray.init(address = "<address>")
# ray.init() # for single node tests on your laptop

ret = rayretry.minimize(fun, bounds, logger=logger())

rayretry.minimize has many parameters for fine tuning, but in most of the cases the default settings work well. See rayexamples.py for more example code.

raytandem.py shows how multiple variants of the same problem can be distributed over multiple nodes. MULTI explains how the TandEM multi-problem can be solved utilizing all cluster nodes.

Log output of the parallel retry

The log output of the coordinated parallel retry is compatible to the single node execution log and contains the following rows:

  • time (in sec)

  • evaluations / sec (not supported for multi-node execution)

  • number of retries - optimization runs (not supported for multi-node execution)

  • total number of evaluations in all retries (not supported for multi-node execution)

  • best value found so far

  • worst value in the retry store

  • number of entries in the retry store

  • list of the best 20 function values in the retry store

  • best solution (x-vector) found so far

The master node retry store shown only contains solution improvements exchanged by the worker nodes.

Dependencies

Runtime:

Optional dependencies:

  • NLopt: NLopt. Install with 'pip install nlopt'.

Performance

On a five node (4 x AMD 3950x + 1 x AMD 2990WX) local CPU cluster using Anaconda 2020.2 for Linux, ray version 0.8.6 and fcmaes version 1.1.12 the parallel coordinated retry mechanism solves ESAs 26-dimensional Messenger full problem in about 20 minutes on average.

The Messenger full benchmark models a multi-gravity assist interplanetary space mission from Earth to Mercury. In 2009 the first good solution (6.9 km/s) was submitted. It took more than five years to reach 1.959 km/s and three more years until 2017 to find the optimum 1.958 km/s. The picture below shows the progress of the whole science community since 2009:

Fsc

The following picture shows the best score reached over time for 40 runs limited to 1500 sec using the five node cluster above:

multi node coordinated parallel retry6

33 out of these 40 runs reached a score ⇐ 2.0, 7 needed more than 1500 sec:

multi node coordinated parallel retry2

To reproduce execute rayexamples.py on a similar cluster.

For comparison: MXHCP paper shows that using 1000 cores of the the Hokudai Supercomputer using Intel Xeon Gold 6148 CPU’s with a clock rate of 2.7 GHz Messenger Full can be solved in about 1 hour using the MXHCP algorithm.