-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hq as light scheduler #795
Changes from 35 commits
3309bf6
07ae579
bc741b2
53fa9d4
af86de4
b14daec
708bbb7
bcfeb14
2018592
1c22d1f
150af1b
8e4b2bd
8434fc4
f0cb9d7
37151e0
6bf7aea
3cd731e
91fbf71
9763d7d
51ecd65
dcfdd42
eb79d2a
dff3c90
b9909ba
c2ef7a3
f251d71
47b8526
9999b57
57385ce
36ea854
bc2b322
58ea0f9
024e631
d03ce60
ef49620
473ca51
c472dcd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ set -eux | |
home="/home/${NB_USER}" | ||
|
||
# Untar home archive file to restore home directory if it is empty | ||
if [[ $(ls -A ${home} | wc -l) = "0" ]]; then | ||
if [ ! -e $home/.FLAG_HOME_INITIALIZED ]; then | ||
if [[ ! -f $HOME_TAR ]]; then | ||
echo "File $HOME_TAR does not exist!" | ||
exit 1 | ||
|
@@ -15,12 +15,20 @@ if [[ $(ls -A ${home} | wc -l) = "0" ]]; then | |
fi | ||
|
||
echo "Extracting $HOME_TAR to $home" | ||
# NOTE: a tar error when deployed to k8s but at the momment not cause any issue | ||
# tar: .: Cannot utime: Operation not permitted | ||
# tar: .: Cannot change mode to rwxr-s---: Operation not permitted | ||
tar -xf $HOME_TAR -C "$home" | ||
|
||
echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'" | ||
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS" | ||
else | ||
echo "$home folder is not empty!" | ||
ls -lrta "$home" | ||
fi | ||
|
||
if [ -d $AIIDALAB_APPS/quantum-espresso ]; then | ||
echo "Quantum ESPRESSO app does exist" | ||
else | ||
echo "Copying directory '$QE_APP_FOLDER' to '$AIIDALAB_APPS'" | ||
cp -r "$QE_APP_FOLDER" "$AIIDALAB_APPS" | ||
fi | ||
Comment on lines
+27
to
+32
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This do the trick, because in the k8s deployment, taring for the empty files like |
||
|
||
set +eux |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
#!/bin/bash | ||
|
||
set -x | ||
|
||
# computer | ||
verdi computer show ${HQ_COMPUTER} || verdi computer setup \ | ||
--non-interactive \ | ||
--label "${HQ_COMPUTER}" \ | ||
--description "local computer with hyperqueue scheduler" \ | ||
--hostname "localhost" \ | ||
--transport core.local \ | ||
--scheduler hyperqueue \ | ||
--work-dir /home/${NB_USER}/aiida_run/ \ | ||
--mpirun-command "mpirun -np {num_cpus}" | ||
|
||
verdi computer configure core.local "${HQ_COMPUTER}" \ | ||
--non-interactive \ | ||
--safe-interval 5.0 | ||
|
||
# disable the localhost which is set in base image | ||
verdi computer disable localhost aiida@localhost |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't look into this file too deeply, might want to get other eyes on this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, please do it, I am not in hurry to get this merged. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
#!/bin/bash | ||
|
||
set -x | ||
|
||
# NOTE: this cgroup folder hierachy is based on cgroupv2 | ||
# if the container is open in system which has cgroupv1 the image build procedure will fail. | ||
# Since the image is mostly for demo server where we know the machine and OS I supposed | ||
# it should have cgroupv2 (> Kubernetes v1.25). | ||
# We only build the server for demo server so it does not require user to have new cgroup. | ||
# But for developers, please update your cgroup version to v2. | ||
# See: https://kubernetes.io/docs/concepts/architecture/cgroups/#using-cgroupv2 | ||
|
||
# computer memory from runtime | ||
MEMORY_LIMIT=$(cat /sys/fs/cgroup/memory.max) | ||
|
||
if [ "$MEMORY_LIMIT" = "max" ]; then | ||
MEMORY_LIMIT=4096 | ||
echo "No memory limit set, use 4GiB" | ||
else | ||
MEMORY_LIMIT=$(echo "scale=0; $MEMORY_LIMIT / (1024 * 1024)" | bc) | ||
echo "Memory Limit: ${MEMORY_LIMIT} MiB" | ||
fi | ||
|
||
# Compute number of cpus allocated to the container | ||
CPU_LIMIT=$(awk '{print $1}' /sys/fs/cgroup/cpu.max) | ||
CPU_PERIOD=$(awk '{print $2}' /sys/fs/cgroup/cpu.max) | ||
|
||
if [ "$CPU_PERIOD" -ne 0 ]; then | ||
CPU_NUMBER=$(echo "scale=2; $CPU_LIMIT / $CPU_PERIOD" | bc) | ||
echo "Number of CPUs allocated: $CPU_NUMBER" | ||
|
||
# for HQ setting round to integer number of CPUs, the left are for system tasks | ||
CPU_LIMIT=$(echo "scale=0; $CPU_LIMIT / $CPU_PERIOD" | bc) | ||
else | ||
# if no limit (with local OCI without setting cpu limit, use all CPUs) | ||
CPU_LIMIT=$(nproc) | ||
echo "No CPU limit set" | ||
fi | ||
|
||
# Start hq server with a worker | ||
run-one-constantly hq server start 1>$HOME/.hq-stdout 2>$HOME/.hq-stderr & | ||
run-one-constantly hq worker start --cpus=${CPU_LIMIT} --resource "mem=sum(${MEMORY_LIMIT})" --no-detect-resources & | ||
|
||
# Reset the default memory_per_machine and default_mpiprocs_per_machine | ||
# c.set_default_mpiprocs_per_machine = ${CPU_LIMIT} | ||
# c.set_default_memery_per_machine = ${MEMORY_LIMIT} | ||
|
||
# Same as original localhost set job poll interval to 2.0 secs | ||
# In addition, set default mpiprocs and memor per machine | ||
# TODO: this will be run every time the container start, we need a lock file to prevent it. | ||
job_poll_interval="2.0" | ||
computer_name=${HQ_COMPUTER} | ||
python -c " | ||
from aiida import load_profile; from aiida.orm import load_computer; | ||
load_profile(); | ||
load_computer('${computer_name}').set_minimum_job_poll_interval(${job_poll_interval}) | ||
load_computer('${computer_name}').set_default_mpiprocs_per_machine(${CPU_LIMIT}) | ||
load_computer('${computer_name}').set_default_memory_per_machine(${MEMORY_LIMIT}) | ||
" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of this solution since it seems brittle (user can remove this file).
I am somewhat confused, why does the previous one not work anymore?
(feel free to ignore, this is just me rambling :D)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two problems with previous one
0
and right side is\0
as str.lost+found
folder exist before this script is running.So another solution I did was
if [ $(ls -A ${home} | wc -l ) -lt 1 ]; then
, but it is more brittle I assume :-p