-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question #3
Comments
Thanks for your interest, @thistleknot.
That I believe enables a single node to host 16 processes. You can have multiple of such lines and it will allow you to achieve what you are looking for. Unfortunately we are not currently developing this project (yet still using though) so adding multi-node configuration won't be in the todo list. |
I thought about using lxc clustering to achieve this
…On Wed, Mar 31, 2021, 10:43 AM Artem Polyakov ***@***.***> wrote:
Thanks for your interest, @thistleknot <https://github.com/thistleknot>.
This project was created as a development tool for Slurm. This is what I
was using it for primarily. So the emphasis was on a single node.
What is your goal here? Do you want to work with containers? If containers
are not a must, then Slurm has a multihost feature (don't remember the name
precisely). I have an example slurm.conf that configures it:
https://github.com/artpol84/poc/blob/master/slurm/multihost_conf/slurm.conf
Note the line
NodeName=cn[1-16] NodeAddr=localhost Port=32221-32236 CPUs=4
That I believe enables a single node to host 16 processes. You can have
multiple of such lines and it will allow you to achieve what you are
looking for.
Unfortunately we are not currently developing this project (yet still
using though) so adding multi-node configuration won't be in the todo list.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHKKOUUMMPD2QYX5G6LK4LTGM7K3ANCNFSM42EW5PNQ>
.
|
If you happen to extend this project to multiple nodes - I'll be happy to integrate this here. But I'm not sure it's efficient enough as I'm not an expert with Docker. |
you asked earlier what my intention was the nice thing about lxc containers is from whatever node you run lxc ls -a from, it acts as if it's on a single machine. Which means you can have the containers hosted across nodes. I haven't really used it yet. But I thought about moving my lxc storage pool to a distributed volume (like glusterfs) and then run the containers from that so I'm not choking my head node with data throughput. Anyways, that's neither here nor there, but the plan was using the cluster solution. I could have my containers across nodes, but using your software it would still look like it was just on the head node. |
Out of curiosity, have you managed SLXC to work for you? It's a bit tricky to set it up initially. It works pretty stable afterwards though, |
I mean in a single-node installation. |
soon. I plan on getting it up this week |
I was hoping to use rpm provided installations of slurmctld which installs slurmctld in /usr/sbin but your guide details $SLURM_PATH/var $SLURM_PATH/etc so it seems you are suggesting I use a compiled installation with an opt path defined instead? because normally if I install slurm-slurmctld the etc falls under /etc/slurm/ |
https://www.thegeekdiary.com/how-to-install-an-rpm-package-into-a-different-directory-in-centos-rhel-fedora/ |
I'm using snapd lxc which makes things a bit difficult /var/lib/snapd/snap/lxd/ for one I don't have a dnsmasq.conf to edit |
dang it to do this I'd have to download and build the latest builds of munge and slurm |
I need better instructions. When I attempt to compile munge and slurm with --prefix. I can do munge, which doesn't install munge-devel. But when I go to compile slurm it wasn't munge-devel but I can't build/install munge-devel without rpm-build and when I attempt rpm-build with a ./configure --prefix=/opt/munge.xxx it will fail building on
|
Again, the point of this project was to allow me and my team to develop for Slurm. In this situation you only want to build Slurm from sources. The munge I was able to use successfully is 0.5.11, see the main project readme
Note that the above version was patched, but the link seems to be invalid and I don't remember what was the issue back then. Yet I've reported it to the munge developer and I believe it was fixed. I was building from sources - not rpm.src, but from the tarball obtained from the |
Wondering, how resource control works in these containers. At least cgroup stuff seems not to work:
Any ideas, how to fix? |
Hi, @jelmd Thank you for your interest in the project. If so - it seems like this requires nesting of cgroups which it seems like supported based on the description here:
The We've not seen this issue as for the Slurm development purposes SLXC was created for, we were not touching cgroups and this plugin was disabled. P.S. |
Hi @artpol84, thanx for your answer. Did some more experiments and found out, that it seems to work more or less if the right settings are made. But the documentation is so shallow and confusing (and sometimes IMHO even wrong)... I guess, the mentioned errors are really cleanup v23.2.5 bugs: found out, that they get triggered when the job is finished and that the related job cgroups (like Use case: Actually I do everything using LXCs (not any docker non-sense). So especially for our DL users we create projects, and dedicate 1+ LXCs to those projects with the appropriate number of GPUs (as needed/as available on the bare metal). Works really good and users are happy, however, sometimes studs do not use their LXCs 24/7 and others would like to have some more GPUs available from time to time. So the idea came out, to give slurm a try. Unfortunately isolation is an issue (and the current comfort as well), so trying to dig a little bit deeper ... |
Is it possible to use this across nodes (rather than one node?).
I have 3 nodes
x1 8 cores
x2 4 cores
What I'd like to do is setup slurmctld that talks to lxc containers running slurmd across 4 containers (that are limited to 4 cores each) on these 3 nodes for a total of 16 cores.
I've read of a way to manually create a crgoups directory, but when I run jobs the node goes down. I'm not sure what you are doing to resolve the cgroups issue.
My OS of choice is Oracle Linux (a RHEL compatible flavor)
The text was updated successfully, but these errors were encountered: