Skip to content

Commit

Permalink
Feature/ubuntu docker container (#220)
Browse files Browse the repository at this point in the history
* move al2 files into new directory

* add ubuntu dockerfile image scripts

* update ecr deploy script to upload both ubuntu and al2 containers

* set default number of cpus and memory for orchestration script

* updates to slurm initialization for aws

* change default conda env name

This update includes:
- creation scripts for a new ubuntu flavored docker container version of the IMI
- Updates and scripts needed to create and run the IMI on a new prototype AMI on aws that uses ubuntu 24.04 instead of the old Ubuntu 18.04, which has reached end of support
- A few minor updates/fixes for running on AWS
  • Loading branch information
laestrada authored May 17, 2024
1 parent 43d87a6 commit c4fde97
Show file tree
Hide file tree
Showing 45 changed files with 863 additions and 175 deletions.
15 changes: 12 additions & 3 deletions .github/workflows/ecr_deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,22 @@ jobs:
with:
registry-type: public

- name: Build, tag, and push al2 image to Amazon ECR
env:
REGISTRY: ${{ steps.login-ecr-public.outputs.registry }}
REGISTRY_ALIAS: w1q7j9l2
REPOSITORY: imi-al2-docker-image
IMAGE_TAG: latest
run: |
docker build -f resources/containers/al2/Dockerfile -t $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG . --platform=linux/amd64
docker push $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG
- name: Build, tag, and push image to Amazon ECR
- name: Build, tag, and push ubuntu image to Amazon ECR
env:
REGISTRY: ${{ steps.login-ecr-public.outputs.registry }}
REGISTRY_ALIAS: w1q7j9l2
REPOSITORY: imi-docker-image
REPOSITORY: imi-ubuntu-docker-image
IMAGE_TAG: latest
run: |
docker build -f resources/containers/Dockerfile -t $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG . --platform=linux/amd64
docker build -f resources/containers/ubuntu/Dockerfile -t $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG . --platform=linux/amd64
docker push $REGISTRY/$REGISTRY_ALIAS/$REPOSITORY:$IMAGE_TAG
4 changes: 2 additions & 2 deletions config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -181,8 +181,8 @@ OutputPath: "/home/ubuntu/imi_output_dir"
DataPath: "/home/ubuntu/ExtData"

## Conda environment file
CondaFile: "/home/ubuntu/miniconda/etc/profile.d/conda.sh"
CondaEnv: "geo"
CondaFile: "/home/ubuntu/.bashrc"
CondaEnv: "imi_env"

## Download initial restart file from AWS S3?
## NOTE: Must have AWS CLI enabled
Expand Down
38 changes: 17 additions & 21 deletions docs/source/advanced/imi-docker-container.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,17 @@ the section on `Using Singularity instead of Docker <#using-singularity-instead-
-----------------
Pulling the image
-----------------
To run the container you will first need to pull the image from our cloud repository::
To run the container you will first need to pull the image from our cloud repository. There are two flavors of the
IMI docker container using a base operating system of ubuntu or amazon linux 2

$ docker pull public.ecr.aws/w1q7j9l2/imi-docker-image:latest
For Amazon Linux 2::

$ docker pull public.ecr.aws/w1q7j9l2/imi-al2-docker-image:latest


For Ubuntu::

$ docker pull public.ecr.aws/w1q7j9l2/imi-ubuntu-docker-image:latest

-------------------------------
Setting up the compose.yml file
Expand All @@ -57,17 +65,7 @@ allows you to more easily configure the IMI and save the output directory to you
IMI input data
--------------
The IMI needs input data in order to run the inversion. If you do not have the necessary input data available
locally then you will need to give the IMI container access to S3 on AWS, where the input data is available. This
can be done by specifying your
`aws credentials <https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html#envvars-set>`__ in
the ``environment`` section of the compose.yml file. Eg:::

environment:
- AWS_ACCESS_KEY_ID=your_access_key_id
- AWS_SECRET_ACCESS_KEY=your_secret_access_key
- AWS_DEFAULT_REGION=us-east-1

Note: these credentials are sensitive, so do not post them publicly in any repository.
locally then it will be automatically downloaded from AWS, where the input data is publicly available.

If you already have the necessary input data available locally, then you can mount it to the IMI container in the
`volumes` section of the compose.yml file without setting your aws credentials. Eg:::
Expand All @@ -82,6 +80,7 @@ In order to access the files from the inversion it is best to mount a volume fro
container. This allows the results of the inversion to persist after the container exits. We recommend making a
dedicated IMI output directory using `mkdir`.::

# Note replace /home/al2 with /home/ubuntu if using ubuntu flavor
volumes:
- /local/output/dir/imi_output:/home/al2/imi_output_dir # mount output directory
- /local/container/config.yml:/home/al2/integrated_methane_inversion/config.yml # mount desired config file
Expand All @@ -108,6 +107,7 @@ will replace the ``StartDate`` and ``EndDate`` in the IMI config.yml file.
To apply a config.yml file from your local system to the docker container, specify it in your compose.yml file as a
volume. Then set the ``IMI_CONFIG_PATH`` environment variable to point to that path. Eg:::

# Note replace /home/al2 with /home/ubuntu if using ubuntu flavor
volumes:
- /local/path/to/config.yml:/home/al2/integrated_methane_inversion/config.yml # mount desired config file
environment:
Expand Down Expand Up @@ -135,10 +135,6 @@ This is an example of what a fully filled out compose.yml file looks like:::
environment:
# comment out any environment vars you do not need for your system
- IMI_CONFIG_PATH=config.yml # path starts from /home/al2/integrated_methane_inversions
## ***** DO NOT push aws credentials to any public repositories *****
- AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
- AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
- AWS_DEFAULT_REGION=us-east-1


Running the IMI
Expand All @@ -157,16 +153,16 @@ all env variables and volumes via flags.

Using Singularity instead of Docker
===================================
We use Docker `Docker <https://docs.docker.com/get-started/overview/>`__ to containerize the IMI, but the docker
containers can also be run using `Singularity <https://docs.sylabs.io/guides/3.5/user-guide/introduction.html>`__.
We use Docker `Docker <https://docs.docker.com/get-started/overview/>`__ to containerize the IMI, but it may also
be possible to use other container engines, like `Singularity <https://docs.sylabs.io/guides/3.5/user-guide/introduction.html>`__.
Singularity is a container engine designed to run on HPC systems and local clusters, as some clusters do not allow
Docker to be installed.
Note: using Singularity to run the IMI is untested and may not work as expected.

First pull the image:::

$ singularity pull public.ecr.aws/w1q7j9l2/imi-docker-image:latest
$ singularity pull public.ecr.aws/w1q7j9l2/imi-ubuntu-docker-image:latest

Then run the image:::

$ singularity run imi-docker-repository_latest.sif
$ singularity run imi-ubuntu-docker-image_latest.sif
61 changes: 26 additions & 35 deletions envs/aws/slurm/base_slurm.conf
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,35 @@
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
#ControlAddr=
#BackupController=
#BackupAddr=
ClusterName=cluster
# SlurmctldHost=cluster
#SlurmctldHost=
#
AuthType=auth/munge
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#MaxJobId=67043328
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#JobSubmitPlugins=lua
#KillOnBadExit=0
#LaunchType=launch/slurm
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
MailProg=/bin/mail
#MaxJobCount=10000
#MaxStepCount=40000
#MaxTasksPerNode=128
#MaxTasksPerNode=512
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
# ProctrackType=proctrack/cgroup
ProctrackType=proctrack/linuxproc
#Prolog=
#PrologFlags=
Expand All @@ -45,24 +40,21 @@ ProctrackType=proctrack/linuxproc
#PropagateResourceLimitsExcept=
#RebootProgram=
ReturnToService=1
#SallocDefaultCommand=
#SlurmctldPidFile=/home/centos/slurm/slurmctld.pid
#SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPidFile=/run/slurm/slurmctld.pid
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
#SlurmdPidFile=/home/centos/slurm/slurmd.pid
SlurmdPidFile=/run/slurm/slurmd.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
SlurmdUser=root
# SlurmdUser=ubuntu
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/var/spool/slurmd
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/affinity
TaskPluginParam=Sched
# TaskPlugin=task/affinity,task/cgroup
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFS=/tmp
Expand Down Expand Up @@ -94,12 +86,10 @@ Waittime=0
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core
SelectType=select/cons_tres
#
#
# JOB PRIORITY
Expand All @@ -120,29 +110,28 @@ SelectTypeParameters=CR_Core
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#AccountingStoreFlags=
#JobCompHost=
#JobCompLoc=
#JobCompParams=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
#JobContainerType=job_container/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#DebugFlags=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
Expand All @@ -157,4 +146,6 @@ SlurmdDebug=3
#SuspendTime=
#
#
# COMPUTE NODES
# COMPUTE NODES
# NodeName=linux[1-32] CPUs=1 State=UNKNOWN
# PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
10 changes: 10 additions & 0 deletions envs/aws/slurm/cgroup.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
###
# Slurm cgroup support configuration file.
###
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
CgroupPlugin=cgroup/v1
IgnoreSystemd=no
4 changes: 2 additions & 2 deletions envs/aws/slurm/configure_slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import subprocess

slurm_info = (
subprocess.run(["/usr/sbin/slurmd", "-C"], stdout=subprocess.PIPE)
subprocess.run(["slurmd", "-C"], stdout=subprocess.PIPE)
.stdout.decode("utf-8")
.split()
)
Expand All @@ -22,7 +22,7 @@
slurm_info[0],
slurm_info[1],
slurm_info[6],
"CoresPerSocket" + slurm_info[1][4:],
"SocketsPerBoard" + slurm_info[1][4:],
"ThreadsPerCore=1 State=UNKNOWN",
]
)
Expand Down
20 changes: 15 additions & 5 deletions envs/aws/slurm/initialize_slurm
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,22 @@
# (4) edit it as you see fit (eg. @reboot /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/initialize_slurm)
# (5) save and overwrite the existing file eg. /var/spool/cron/crontabs/root
# (6) $ exit # to exit root user access

# configure slurm
cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/base_slurm.conf /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf
/home/ubuntu/miniconda/bin/python /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/configure_slurm.py
cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf /etc/slurm-llnl/slurm.conf
python3 /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/configure_slurm.py

# put newly generated slurm.conf in correct locations
cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf /etc/slurm/slurm.conf
rm /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/new_slurm.conf
/usr/sbin/service slurmd restart
/usr/sbin/service slurmctld restart

# also add a cgroup.conf file
cp /home/ubuntu/integrated_methane_inversion/envs/aws/slurm/cgroup.conf /etc/slurm/cgroup.conf

# start munge and slurm services
service munge restart
service slurmctld restart
service slurmd restart

# Fix issue when switching instance types where node claims to be drained
scontrol update nodename=$HOSTNAME state=idle
scontrol update nodename=$HOSTNAME state=idle
Loading

0 comments on commit c4fde97

Please sign in to comment.