You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We implemented code to perform distributed GLM at https://github.com/OpenChai/BIDMach_OC/tree/max/glm. On this branch, the Master will actively collect modelmats from Workers, as opposed to on a timeout. The Master should also automatically stop once the Workers have finished all passes over their data.
Our issue is that our workers run out of memory when processing the entire Criteo dataset, which is about 12GB uncompressed, causing them to crash. To confirm this, we ssh into a worker while its running and monitor the memory on the machine using vmstat. The available memory gradually decreases to 0 (even over multiple passes) until the worker crashes.
We preprocess the dataset by splitting it into 91 parts and then run distributed GLM with 80 parts for 10 passes. We're currently testing this with the 4 workers of bidcluster3. Each worker i works on a subset of 20 parts. (Parameters for worker i: nstart=i*20, nend=(i+1)*20, npasses=10)
In an attempt to find the potential memory leak, we used jmap -dump to dump the heap of a running worker process. It appears that Class FMat occupies the most space, with many references from scala.reflect.io.FileZipArchive$FileEntry$1. This leads us to believe that it might be the case that memory isn't being properly reused/recycled between successive passes of the data.
The Java heap dump is on bidcluster3-slave-i-afd73976 at /home/ec2-user/criteo_dump.bin; you can spin up a server to view the heap dump with jhat -J-d64 -J-mx12g criteo_dump.bin. You can scroll down to view the histogram of memory usage by class and then browse class references from there.
Steps to run distributed GLM on Criteo and to reproduce this issue:
Boot all bidcluster3 machines. ssh into the master machine.
Navigate to /opt/BIDMach_Spark and run python scripts/start_cluster.py
Navigate to /opt/BIDMach and check out to git branch max/glm. Run:
The first line will start a BIDMach Worker daemon ready to run GLM on each worker. The second line drops you into a Scala shell for the Master.
In the Master Scala shell, paste and run the following commands:
m.parCall((w) => {
var totalNumSamples = 80;
var numWorkers = 4; // TODO: don't hardcode this
var workerNumSamples = totalNumSamples / numWorkers;
var nstart = (w.imach*workerNumSamples);
var nend = Math.min((w.imach+1)*workerNumSamples, totalNumSamples);
var fgopts = w.learner.opts.asInstanceOf[GLM.FGOptions];
fgopts.nstart = nstart;
fgopts.nend = nend;
"imach: %d, nstart: %d, nend: %d, nasses: %d\n" format (w.imach, nstart, nend, w.learner.opts.npasses)
})
m.parCall((w) => { w.learner.paused = true; "done"})
m.parCall((w) => { w.learner.train; "not reached" }) // this will hang, just wait for it to timeout
m.startLearners
m.startUpdates
The first command will configure each Worker to only process a subset of the Criteo dataset.
The second command starts the Learner on each Worker in a paused state. The last command causes the Master to signal
each Worker to unpause their respective learners and begin mixing modelmats (through Kylix).
This process works successfully on smaller datasets but seems to break with 12GB Criteo. Normally, the Master should
automatically stop once each Worker has completed its passes. But after several minutes of running Criteo, the Master stops
getting responses from its Workers, indicating that the workers have run out of memory.
The text was updated successfully, but these errors were encountered:
We implemented code to perform distributed GLM at
https://github.com/OpenChai/BIDMach_OC/tree/max/glm
. On this branch, the Master will actively collect modelmats from Workers, as opposed to on a timeout. The Master should also automatically stop once the Workers have finished all passes over their data.Our issue is that our workers run out of memory when processing the entire Criteo dataset, which is about 12GB uncompressed, causing them to crash. To confirm this, we ssh into a worker while its running and monitor the memory on the machine using
vmstat
. The available memory gradually decreases to 0 (even over multiple passes) until the worker crashes.We preprocess the dataset by splitting it into 91 parts and then run distributed GLM with 80 parts for 10 passes. We're currently testing this with the 4 workers of bidcluster3. Each worker
i
works on a subset of 20 parts. (Parameters for worker i: nstart=i*20, nend=(i+1)*20, npasses=10)In an attempt to find the potential memory leak, we used
jmap -dump
to dump the heap of a running worker process. It appears that ClassFMat
occupies the most space, with many references fromscala.reflect.io.FileZipArchive$FileEntry$1
. This leads us to believe that it might be the case that memory isn't being properly reused/recycled between successive passes of the data.The Java heap dump is on
bidcluster3-slave-i-afd73976
at/home/ec2-user/criteo_dump.bin
; you can spin up a server to view the heap dump withjhat -J-d64 -J-mx12g criteo_dump.bin
. You can scroll down to view the histogram of memory usage by class and then browse class references from there.Steps to run distributed GLM on Criteo and to reproduce this issue:
ssh
into the master machine./opt/BIDMach_Spark
and runpython scripts/start_cluster.py
/opt/BIDMach
and check out to git branchmax/glm
. Run:The first line will start a BIDMach Worker daemon ready to run GLM on each worker. The second line drops you into a Scala shell for the Master.
The first command will configure each Worker to only process a subset of the Criteo dataset.
The second command starts the Learner on each Worker in a paused state. The last command causes the Master to signal
each Worker to unpause their respective learners and begin mixing modelmats (through Kylix).
This process works successfully on smaller datasets but seems to break with 12GB Criteo. Normally, the Master should
automatically stop once each Worker has completed its passes. But after several minutes of running Criteo, the Master stops
getting responses from its Workers, indicating that the workers have run out of memory.
The text was updated successfully, but these errors were encountered: