Feature/ubuntu docker container (#220)

* move al2 files into new directory * add ubuntu dockerfile image scripts * update ecr deploy script to upload both ubuntu and al2 containers * set default number of cpus and memory for orchestration script * updates to slurm initialization for aws * change default conda env name This update includes: - creation scripts for a new ubuntu flavored docker container version of the IMI - Updates and scripts needed to create and run the IMI on a new prototype AMI on aws that uses ubuntu 24.04 instead of the old Ubuntu 18.04, which has reached end of support - A few minor updates/fixes for running on AWS
geoschem · May 17, 2024 · c4fde97 · c4fde97
1 parent 43d87a6
commit c4fde97
Show file tree

Hide file tree

Showing 45 changed files with 863 additions and 175 deletions.
diff --git a/.github/workflows/ecr_deploy.yml b/.github/workflows/ecr_deploy.yml
@@ -29,13 +29,22 @@ jobs:
       with:
         registry-type: public
 
+    - name: Build, tag, and push al2 image to Amazon ECR
+      env:
+        REGISTRY: ${{ steps.login-ecr-public.outputs.registry }}
+        REGISTRY_ALIAS: w1q7j9l2
+        REPOSITORY: imi-al2-docker-image
+        IMAGE_TAG: latest
+      run: |
+        docker build -f resources/containers/al2/Dockerfile -t $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG . --platform=linux/amd64
+        docker push $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG
 
-    - name: Build, tag, and push image to Amazon ECR
+    - name: Build, tag, and push ubuntu image to Amazon ECR
       env:
         REGISTRY: ${{ steps.login-ecr-public.outputs.registry }}
         REGISTRY_ALIAS: w1q7j9l2
-        REPOSITORY: imi-docker-image
+        REPOSITORY: imi-ubuntu-docker-image
         IMAGE_TAG: latest
       run: |
-        docker build -f resources/containers/Dockerfile -t $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG . --platform=linux/amd64
+        docker build -f resources/containers/ubuntu/Dockerfile -t $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG . --platform=linux/amd64
         docker push $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG
diff --git a/config.yml b/config.yml
@@ -181,8 +181,8 @@ OutputPath: "/home/ubuntu/imi_output_dir"
 DataPath: "/home/ubuntu/ExtData"
 
 ## Conda environment file
-CondaFile: "/home/ubuntu/miniconda/etc/profile.d/conda.sh"
-CondaEnv: "geo"
+CondaFile: "/home/ubuntu/.bashrc"
+CondaEnv: "imi_env"
 
 ## Download initial restart file from AWS S3?
 ##  NOTE: Must have AWS CLI enabled

diff --git a/docs/source/advanced/imi-docker-container.rst b/docs/source/advanced/imi-docker-container.rst
@@ -40,9 +40,17 @@ the section on `Using Singularity instead of Docker <#using-singularity-instead-
 -----------------
 Pulling the image
 -----------------
-To run the container you will first need to pull the image from our cloud repository::
+To run the container you will first need to pull the image from our cloud repository. There are two flavors of the
+IMI docker container using a base operating system of ubuntu or amazon linux 2
 
-    $ docker pull public.ecr.aws/w1q7j9l2/imi-docker-image:latest
+For Amazon Linux 2::
+
+    $ docker pull public.ecr.aws/w1q7j9l2/imi-al2-docker-image:latest
+
+
+For Ubuntu::
+
+    $ docker pull public.ecr.aws/w1q7j9l2/imi-ubuntu-docker-image:latest
 
 -------------------------------
 Setting up the compose.yml file
@@ -57,17 +65,7 @@ allows you to more easily configure the IMI and save the output directory to you
 IMI input data
 --------------
 The IMI needs input data in order to run the inversion. If you do not have the necessary input data available 
-locally then you will need to give the IMI container access to S3 on AWS, where the input data is available. This 
-can be done by specifying your 
-`aws credentials <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html#envvars-set>`__ in 
-the ``environment`` section of the compose.yml file. Eg:::
-
-    environment:
-        - AWS_ACCESS_KEY_ID=your_access_key_id
-        - AWS_SECRET_ACCESS_KEY=your_secret_access_key
-        - AWS_DEFAULT_REGION=us-east-1
-
-Note: these credentials are sensitive, so do not post them publicly in any repository.
+locally then it will be automatically downloaded from AWS, where the input data is publicly available.
 
 If you already have the necessary input data available locally, then you can mount it to the IMI container in the 
 `volumes` section of the compose.yml file without setting your aws credentials. Eg:::
@@ -82,6 +80,7 @@ In order to access the files from the inversion it is best to mount a volume fro
 container. This allows the results of the inversion to persist after the container exits. We recommend making a 
 dedicated IMI output directory using `mkdir`.::
 
+    # Note replace /home/al2 with /home/ubuntu if using ubuntu flavor
     volumes:
         - /local/output/dir/imi_output:/home/al2/imi_output_dir # mount output directory
         - /local/container/config.yml:/home/al2/integrated_methane_inversion/config.yml # mount desired config file
@@ -108,6 +107,7 @@ will replace the ``StartDate`` and ``EndDate`` in the IMI config.yml file.
 To apply a config.yml file from your local system to the docker container, specify it in your compose.yml file as a 
 volume. Then set the ``IMI_CONFIG_PATH`` environment variable to point to that path. Eg:::
 
+    # Note replace /home/al2 with /home/ubuntu if using ubuntu flavor
     volumes:
         - /local/path/to/config.yml:/home/al2/integrated_methane_inversion/config.yml # mount desired config file
     environment:
@@ -135,10 +135,6 @@ This is an example of what a fully filled out compose.yml file looks like:::
         environment:
         # comment out any environment vars you do not need for your system
           - IMI_CONFIG_PATH=config.yml # path starts from /home/al2/integrated_methane_inversions
-          ## ***** DO NOT push aws credentials to any public repositories *****
-          - AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
-          - AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
-          - AWS_DEFAULT_REGION=us-east-1
 
 
 Running the IMI
@@ -157,16 +153,16 @@ all env variables and volumes via flags.
 
 Using Singularity instead of Docker
 ===================================
-We use Docker `Docker <https://docs.docker.com/get-started/overview/>`__ to containerize the IMI, but the docker 
-containers can also be run using `Singularity <https://docs.sylabs.io/guides/3.5/user-guide/introduction.html>`__. 
+We use Docker `Docker <https://docs.docker.com/get-started/overview/>`__ to containerize the IMI, but it may also 
+be possible to use other container engines, like `Singularity <https://docs.sylabs.io/guides/3.5/user-guide/introduction.html>`__. 
 Singularity is a container engine designed to run on HPC systems and local clusters, as some clusters do not allow 
 Docker to be installed.
 Note: using Singularity to run the IMI is untested and may not work as expected.
 
 First pull the image:::
 
-    $ singularity pull public.ecr.aws/w1q7j9l2/imi-docker-image:latest
+    $ singularity pull public.ecr.aws/w1q7j9l2/imi-ubuntu-docker-image:latest
 
 Then run the image:::
 
-    $ singularity run imi-docker-repository_latest.sif
+    $ singularity run imi-ubuntu-docker-image_latest.sif
diff --git a/envs/aws/slurm/base_slurm.conf b/envs/aws/slurm/base_slurm.conf
@@ -2,40 +2,35 @@
 # Put this file on all nodes of your cluster.
 # See the slurm.conf man page for more information.
 #
-#ControlAddr=
-#BackupController=
-#BackupAddr=
+ClusterName=cluster
+# SlurmctldHost=cluster
+#SlurmctldHost=
 #
-AuthType=auth/munge
-#CheckpointType=checkpoint/none
-CryptoType=crypto/munge
 #DisableRootJobs=NO
 #EnforcePartLimits=NO
 #Epilog=
 #EpilogSlurmctld=
 #FirstJobId=1
-#MaxJobId=999999
+#MaxJobId=67043328
 #GresTypes=
 #GroupUpdateForce=0
 #GroupUpdateTime=600
-#JobCheckpointDir=/var/slurm/checkpoint
-#JobCredentialPrivateKey=
-#JobCredentialPublicCertificate=
 #JobFileAppend=0
 #JobRequeue=1
-#JobSubmitPlugins=1
+#JobSubmitPlugins=lua
 #KillOnBadExit=0
 #LaunchType=launch/slurm
 #Licenses=foo*4,bar
-#MailProg=/bin/mail
-#MaxJobCount=5000
+MailProg=/bin/mail
+#MaxJobCount=10000
 #MaxStepCount=40000
-#MaxTasksPerNode=128
+#MaxTasksPerNode=512
 MpiDefault=none
 #MpiParams=ports=#-#
 #PluginDir=
 #PlugStackConfig=
 #PrivateData=jobs
+# ProctrackType=proctrack/cgroup
 ProctrackType=proctrack/linuxproc
 #Prolog=
 #PrologFlags=
@@ -45,24 +40,21 @@ ProctrackType=proctrack/linuxproc
 #PropagateResourceLimitsExcept=
 #RebootProgram=
 ReturnToService=1
-#SallocDefaultCommand=
-#SlurmctldPidFile=/home/centos/slurm/slurmctld.pid
-#SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
-SlurmctldPidFile=/run/slurm/slurmctld.pid
+SlurmctldPidFile=/var/run/slurmctld.pid
 SlurmctldPort=6817
-#SlurmdPidFile=/home/centos/slurm/slurmd.pid
-SlurmdPidFile=/run/slurm/slurmd.pid
+SlurmdPidFile=/var/run/slurmd.pid
 SlurmdPort=6818
 SlurmdSpoolDir=/var/spool/slurmd
 SlurmUser=slurm
-#SlurmdUser=root
+SlurmdUser=root
+# SlurmdUser=ubuntu
 #SrunEpilog=
 #SrunProlog=
-StateSaveLocation=/var/spool/slurmd
+StateSaveLocation=/var/spool/slurmctld
 SwitchType=switch/none
 #TaskEpilog=
 TaskPlugin=task/affinity
-TaskPluginParam=Sched
+# TaskPlugin=task/affinity,task/cgroup
 #TaskProlog=
 #TopologyPlugin=topology/tree
 #TmpFS=/tmp
@@ -94,12 +86,10 @@ Waittime=0
 #
 # SCHEDULING
 #DefMemPerCPU=0
-FastSchedule=1
 #MaxMemPerCPU=0
 #SchedulerTimeSlice=30
 SchedulerType=sched/backfill
-SelectType=select/cons_res
-SelectTypeParameters=CR_Core
+SelectType=select/cons_tres
 #
 #
 # JOB PRIORITY
@@ -120,29 +110,28 @@ SelectTypeParameters=CR_Core
 # LOGGING AND ACCOUNTING
 #AccountingStorageEnforce=0
 #AccountingStorageHost=
-#AccountingStorageLoc=
 #AccountingStoragePass=
 #AccountingStoragePort=
 AccountingStorageType=accounting_storage/none
 #AccountingStorageUser=
-AccountingStoreJobComment=YES
-ClusterName=cluster
-#DebugFlags=
+#AccountingStoreFlags=
 #JobCompHost=
 #JobCompLoc=
+#JobCompParams=
 #JobCompPass=
 #JobCompPort=
 JobCompType=jobcomp/none
 #JobCompUser=
 #JobContainerType=job_container/none
 JobAcctGatherFrequency=30
 JobAcctGatherType=jobacct_gather/none
-SlurmctldDebug=3
-#SlurmctldLogFile=
-SlurmdDebug=3
-#SlurmdLogFile=
+SlurmctldDebug=info
+SlurmctldLogFile=/var/log/slurmctld.log
+SlurmdDebug=info
+SlurmdLogFile=/var/log/slurmd.log
 #SlurmSchedLogFile=
 #SlurmSchedLogLevel=
+#DebugFlags=
 #
 #
 # POWER SAVE SUPPORT FOR IDLE NODES (optional)
@@ -157,4 +146,6 @@ SlurmdDebug=3
 #SuspendTime=
 #
 #
-# COMPUTE NODES
+# COMPUTE NODES
+# NodeName=linux[1-32] CPUs=1 State=UNKNOWN
+# PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
diff --git a/envs/aws/slurm/cgroup.conf b/envs/aws/slurm/cgroup.conf
@@ -0,0 +1,10 @@
+###
+# Slurm cgroup support configuration file.
+###
+CgroupMountpoint=/sys/fs/cgroup
+ConstrainCores=yes
+ConstrainDevices=yes
+ConstrainRAMSpace=yes
+ConstrainSwapSpace=yes
+CgroupPlugin=cgroup/v1
+IgnoreSystemd=no
diff --git a/envs/aws/slurm/configure_slurm.py b/envs/aws/slurm/configure_slurm.py
@@ -10,7 +10,7 @@
 import subprocess
 
 slurm_info = (
-    subprocess.run(["/usr/sbin/slurmd", "-C"], stdout=subprocess.PIPE)
+    subprocess.run(["slurmd", "-C"], stdout=subprocess.PIPE)
     .stdout.decode("utf-8")
     .split()
 )
@@ -22,7 +22,7 @@
             slurm_info[0],
             slurm_info[1],
             slurm_info[6],
-            "CoresPerSocket" + slurm_info[1][4:],
+            "SocketsPerBoard" + slurm_info[1][4:],
             "ThreadsPerCore=1 State=UNKNOWN",
         ]
     )

diff --git a/envs/aws/slurm/initialize_slurm b/envs/aws/slurm/initialize_slurm
@@ -11,12 +11,22 @@
 # (4) edit it as you see fit (eg. @reboot /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/initialize_slurm)
 # (5) save and overwrite the existing file eg. /var/spool/cron/crontabs/root
 # (6) $ exit # to exit root user access
+
+# configure slurm
 cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/base_slurm.conf /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf
-/home/ubuntu/miniconda/bin/python /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/configure_slurm.py
-cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf /etc/slurm-llnl/slurm.conf
+python3 /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/configure_slurm.py
+
+# put newly generated slurm.conf in correct locations
+cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf /etc/slurm/slurm.conf
 rm /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf
-/usr/sbin/service slurmd restart
-/usr/sbin/service slurmctld restart
+
+# also add a cgroup.conf file
+cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/cgroup.conf /etc/slurm/cgroup.conf
+
+# start munge and slurm services
+service munge restart
+service slurmctld restart
+service slurmd restart
 
 # Fix issue when switching instance types where node claims to be drained
-scontrol update nodename=$HOSTNAME state=idle
+scontrol update nodename=$HOSTNAME state=idle