Memory issues when running distributed GLM over the entire Criteo dataset (~12GB) #143

smallbuxie · 2017-01-28T20:34:21Z

We implemented code to perform distributed GLM at https://github.com/OpenChai/BIDMach_OC/tree/max/glm. On this branch, the Master will actively collect modelmats from Workers, as opposed to on a timeout. The Master should also automatically stop once the Workers have finished all passes over their data.

Our issue is that our workers run out of memory when processing the entire Criteo dataset, which is about 12GB uncompressed, causing them to crash. To confirm this, we ssh into a worker while its running and monitor the memory on the machine using vmstat. The available memory gradually decreases to 0 (even over multiple passes) until the worker crashes.

We preprocess the dataset by splitting it into 91 parts and then run distributed GLM with 80 parts for 10 passes. We're currently testing this with the 4 workers of bidcluster3. Each worker i works on a subset of 20 parts. (Parameters for worker i: nstart=i*20, nend=(i+1)*20, npasses=10)

In an attempt to find the potential memory leak, we used jmap -dump to dump the heap of a running worker process. It appears that Class FMat occupies the most space, with many references from scala.reflect.io.FileZipArchive$FileEntry$1. This leads us to believe that it might be the case that memory isn't being properly reused/recycled between successive passes of the data.

The Java heap dump is on bidcluster3-slave-i-afd73976 at /home/ec2-user/criteo_dump.bin; you can spin up a server to view the heap dump with jhat -J-d64 -J-mx12g criteo_dump.bin. You can scroll down to view the histogram of memory usage by class and then browse class references from there.

Steps to run distributed GLM on Criteo and to reproduce this issue:

Boot all bidcluster3 machines. ssh into the master machine.
Navigate to /opt/BIDMach_Spark and run python scripts/start_cluster.py
Navigate to /opt/BIDMach and check out to git branch max/glm. Run:

./scripts/start_workers.sh ./scripts/distributed/worker_criteo_lr.ssc
bidmach ./scripts/distributed/master_criteo_lr.ssc

The first line will start a BIDMach Worker daemon ready to run GLM on each worker. The second line drops you into a Scala shell for the Master.

In the Master Scala shell, paste and run the following commands:

m.parCall((w) => {
  var totalNumSamples = 80;
  var numWorkers = 4; // TODO: don't hardcode this
  var workerNumSamples = totalNumSamples / numWorkers;
  var nstart = (w.imach*workerNumSamples);
  var nend = Math.min((w.imach+1)*workerNumSamples, totalNumSamples);
  var fgopts = w.learner.opts.asInstanceOf[GLM.FGOptions];
  fgopts.nstart = nstart;
  fgopts.nend = nend;
  "imach: %d, nstart: %d, nend: %d, nasses: %d\n" format (w.imach, nstart, nend, w.learner.opts.npasses)
})

m.parCall((w) => { w.learner.paused = true; "done"})
m.parCall((w) => { w.learner.train; "not reached" }) // this will hang, just wait for it to timeout

m.startLearners
m.startUpdates

The first command will configure each Worker to only process a subset of the Criteo dataset.
The second command starts the Learner on each Worker in a paused state. The last command causes the Master to signal
each Worker to unpause their respective learners and begin mixing modelmats (through Kylix).

This process works successfully on smaller datasets but seems to break with 12GB Criteo. Normally, the Master should
automatically stop once each Worker has completed its passes. But after several minutes of running Criteo, the Master stops
getting responses from its Workers, indicating that the workers have run out of memory.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues when running distributed GLM over the entire Criteo dataset (~12GB) #143

Memory issues when running distributed GLM over the entire Criteo dataset (~12GB) #143

smallbuxie commented Jan 28, 2017

Memory issues when running distributed GLM over the entire Criteo dataset (~12GB) #143

Memory issues when running distributed GLM over the entire Criteo dataset (~12GB) #143

Comments

smallbuxie commented Jan 28, 2017