-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instance Memory calculation #17
Comments
When we made these calculations, we didn't have any Google credits to do a full survey to base it on. I did some calculations based on a bunch of AWS nodes and eye-balled a relationship: For AWS, we didn't include C5 nodes in the mix so I guess they're just outside the bounds of what works. I'll try to put together the data we collected and the code to visualise it in the wiki in this repo so that we can re-evaluate. |
I was hitting problems with To work around this, I've tweaked the $ diff -u /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py.orig /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py
--- /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py.orig 2023-06-15 21:26:00.303448073 +0000
+++ /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py 2023-06-15 21:21:49.035548872 +0000
@@ -96,7 +96,7 @@
return {
s: {
"memory": d["MemoryInfo"]["SizeInMiB"]
- - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),
+ - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 1000),
"cores_per_socket": d["VCpuInfo"].get(
"DefaultCores", d["VCpuInfo"]["DefaultVCpus"]
), Before this change (30965MB of memory):
after (30465MB of memory, so effectively -500):
|
I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the
/mnt/shared/etc/slurm.conf
file claiming that the instance should haveRealMem=94992
, but when the node comes up,slurmctld.log
shows that the node has less memory thanslurm.conf
indicates, and thus Slurm rejects the node (puts it in DRAIN state):This led me to the following calculation for expected
RealMem
for AWS:python-citc/citc/aws.py
Line 104 in c32b80a
Contrast this to GCP memory calculation:
python-citc/citc/google.py
Line 106 in c32b80a
It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.
Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.
The text was updated successfully, but these errors were encountered: