Skip to content

HuntsmanCancerInstitute/AwsApps

Repository files navigation

AwsApps

Genomic data focused toolkit for working with AWS services (e.g. S3 and EC2). Includes exhaustive JUnit testing for each app.

u0028003$ java -jar ~/Code/AwsApps/target/S3Copy_0.6.jar 

**************************************************************************************
**                                 S3 Copy : Sept 2024                              **
**************************************************************************************
SC copies AWS S3 objects, unarchiving them as needed, within the same or different
accounts or downloads them to your local computer. Run this as a daemon with -l or run
repeatedly until complete. To upload files to S3, use the AWS CLI. 

To use the app:
Create a ~/.aws/credentials file with your access and secret info, chmod
  600 the file and keep it private. Use a txt editor or the AWS CLI configure
  command, see https://aws.amazon.com/cli   Example ~/.aws/credentials file:
      [default]
      aws_access_key_id = AKIARHBDRGYUIBR33RCJK6A
      aws_secret_access_key = BgDV2UHZv/T5ENs395867ueESMPGV65HZMpUQ
Repeat these entries for multiple accounts replacing the word 'default' with a single
unique account name. Temporary credentials with a session attribute also work.

Required:
-j Provide a comma delimited string of copy jobs or a txt file with one per line.
      A copy job consists of a full S3 URI as the source and a destination separated
      by '>', e.g. 's3://source/tumor.cram > s3://destination/collabTumor.cram' or
      folders 's3://source/alignments/tumor > s3://destination/Collab/' or local
      's3://source/alignments/tumor > .' Note, the trailing '/' is required in the
      S3 destination for a recursive copy or when the local folder doesn't exist.

Optional/ Defaults:
-u Perform a real run, defaults to listing the actions that would be taken.
-r Perform a recursive unarchive and copy, defaults to an exact source key match.
      WARNING: this will unarchive all files in the path and restore them to acive S3.
      This can be costly! Use with care.
-e Email addresse(s) to send status messages, comma delimited, no spaces. Note, 
      the sendmail app must be configured on your system. Test it:
      echo 'Subject: Hello' | sendmail [email protected]
-x Expedite archive retrieval, increased cost $0.03/GB vs $0.01/GB, 1-5min vs 3-12hr, 
      defaults to standard.
-l Execute every hour (standard) or minute (expedited) until complete
-t Maximum threads to utilize, defaults to 8
-p AWS credentials profile, defaults to 'default'
-n Number of days to keep restored files in S3, defaults to 1
-a Print instructions for copying files between different accounts

Example: java -Xmx10G -jar pathTo/S3Copy_x.x.jar -e [email protected] -p obama -l
   -j 's3://source/Logs.zip>s3://destination/,s3://source/normal > ~/Downloads/' -r
**************************************************************************************
u0028003$ java -jar ~/Code/AwsApps/target/VersionManager_0.5.jar 

**************************************************************************************
**                             AWS S3 Version Manager : Sept 2024                   **
**************************************************************************************
Bucket versioning in S3 protects objects from being deleted or overwritten by hiding
the original when 'deleting' or over writing an existing object. Use this tool to 
delete these hidden S3 objects and any deletion marks from your buckets. Use the
options to select particular redundant objects to delete in a dry run, review the
actions, and rerun it with the -r option to actually delete them. This app will not
delete any isLatest=true object.

WARNING! This app has the potential to destroy precious data. TEST IT on a
pilot system before deploying in production. Although extensively unit tested, this
app is provided with no guarantee of proper function.

To use the app:
1) Enable S3 Object versioning on your bucket.
2) Install and configure the aws cli with your region, access and secret keys.
   Temporary credentials with a session attribute also work. See 
   https://aws.amazon.com/cli. 
3) Use cli commands like 'aws s3 rm s3://myBucket/myObj.txt' or the AWS web Console to
   'delete' particular objects. Then run this app to actually delete them.

Required Parameters:
-b Versioned S3 bucket name

Optional Parameters:
-r Perform a real run, defaults to a dry run where no objects are deleted
-c Credentials profile name, defaults to 'default'
-a Minimum age, in days, of object to delete, defaults to 30
-s Object key suffixes to delete, comma delimited, no spaces
-p Object key prefixes to delete, comma delimited, no spaces
-v Verbose output
-t Maximum threads to use, defaults to 8

Example: java -Xmx10G -jar pathTo/VersionManager_X.X.jar -b mybucket-vm-test 
     -s .cram,.bam,.gz,.zip -a 7 -c MiloLab

**************************************************************************************

See Misc/WorkingWithTheAWSJobRunner.pdf for details.

u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/JobRunner_0.3.jar 

**************************************************************************************
**                              AWS Job Runner : January 2021                       **
**************************************************************************************
JR is an app for running bash scripts on AWS EC2 nodes. It downloads and uncompressed
your resource bundle and looks for xxx.sh_JR_START files in your S3 Jobs directories.
For each, it copies over the directory contents, executes the associated xxx.sh
script, and transfers back the results. This is repeated until no unrun jobs are
found. Launch many EC2 JR nodes, each running an instance of the JR, to process
hundreds of jobs in parallel. Use spot requests and hibernation to reduce costs.
Upon termination, JR will cancel the spot request and kill the instance.

To use:
1) Install and configure the aws cli on your local workstation, see
     https://aws.amazon.com/cli/
2) Upload a [default] aws credential file containing a single set of region,
     aws_access_key_id, and aws_secret_access_key info into a private bucket, e.g.
     aws s3 cp ~/.aws/credentials s3://my-jr/aws.cred.txt 
3) Generate a secure 24hr timed URL for the credentials file, e.g.
     aws --region us-west-2  s3 presign s3://my-jr/aws.cred.txt  --expires-in 259200
4) Upload a zip archive containing resources needed to run your jobs into S3, e.g.
     aws s3 cp ~/TNRunnerResourceBundle.zip s3://my-jr/TNRunnerResourceBundle.zip
     This will be copied into the /JRDir/ directory and then unzipped.
5) Upload script and job files into a 'Jobs' directory on S3, e.g.
     aws s3 cp ~/JRJobs/A/ s3://my-jr/Jobs/A/ --recursive
6) Optional, upload bash script files ending with JR_INIT.sh and or JR_TERM.sh. These
     are executed by JR before and after running the main bash script.  Use these to
     copy in sample specific resources, e.g. fastq/ cram/ bam files, and to run
     post job clean up.
7) Upload a file named XXX_JR_START to let the JobRunner know the bash script named
     XXX is ready to run, e.g.
     aws s3 cp s3://my-jr/emptyFile s3://my-jr/Jobs/A/dnaAlignQC.sh_JR_START
8) Launch the JobRunner.jar on one or more JR configured EC2 nodes. See
     https://ri-confluence.hci.utah.edu/x/gYCgBw

Job Runner Required Options:
-c URL to your secure timed config credentials file.
-r S3URI to your zipped resource bundle.
-j S3URI to your root Jobs directory containing folders with job scripts to execute.
-l S3URI to your Log folder for node logs.

Default Options:
-d Directory on the local worker node, full path, in which resources and job files
     will be processed, defaults to /JRDir/
-a Aws credentials directory, defaults to ~/.aws/
-t Terminate the EC2 node upon program exit, defaults to leaving it running. 
-w Minutes to wait looking for jobs before exiting, defaults to 10.
-x Replace S3 job directories with processed analysis, defaults to syncing local with
     S3. WARNING, if selected, don't place any files in these S3 jobs directories that
     cannot be replaced. JR will delete them.
-v Verbose debugging output.

Example: java -jar -Xmx1G JobRunner.jar -x -t 
     -r s3://my-jr/TNRunnerResourceBundle.zip
     -j s3://my-jr/Jobs/
     -l s3://my-jr/NodeLogs/
     -c 'https://my-jr.s3.us-west-2.amazonaws.com/aws.cred.txt?X-AmRun...'

**************************************************************************************
u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/GSync_0.6.jar 
**************************************************************************************
**                                   GSync : June 2020                              **
**************************************************************************************
GSync pushes files with a particular extension that exceed a given size and age to 
Amazon's S3 object store. Associated genomic index files are also moved. Once 
correctly uploaded, GSync replaces the original file with a local txt placeholder file 
containing information about the S3 object. Files are restored or deleted by modifying
the name of the placeholder file. Symbolic links are ignored.

WARNING! This app has the potential to destroy precious genomic data. TEST IT on a
pilot system before deploying in production. BACKUP your local files and ENABLE S3
Object Versioning before running.  This app is provided with no guarantee of proper
function.

To use the app:
1) Create a new S3 bucket dedicated solely to this purpose. Use it for nothing else.
2) Enable S3 Object Locking and Versioning on the bucket to assist in preventing 
   accidental object overwriting. Add lifecycle rules to
   AbortIncompleteMultipartUpload and move objects to Deep Glacier.
3) It is a good policy when working on AWS S3 to limit your ability to accidentally
   delete buckets and objects. To do so, create and assign yourself to an AWS Group 
   called AllExceptS3Delete with a custom permission policy that denies s3:Delete*:
   {"Version": "2012-10-17", "Statement": [
      {"Effect": "Allow", "Action": "*", "Resource": "*"},
      {"Effect": "Deny", "Action": "s3:Delete*", "Resource": "*"} ]}
   For standard upload and download gsyncs, assign yourself to the AllExceptS3Delete
   group. When you need to delete or update objects, switch to the Admin group, then
   switch back. Accidental overwrites are OK since object versioning is enabled.
   To add another layer of protection, apply object legal locks via the aws cli.
3) Create a ~/.aws/credentials file with your access, secret, and region info, chmod
   600 the file and keep it private. Use a txt editor or the aws cli configure
   command, see https://aws.amazon.com/cli   Example ~/.aws/credentials file:
   [default]
   aws_access_key_id = AKIARHBDRGYUIBR33RCJK6A
   aws_secret_access_key = BgDV2UHZv/T5ENs395867ueESMPGV65HZMpUQ
   region = us-west-2
4) Execute GSync to upload large old files to S3 and replace them with a placeholder
   file named xxx.S3.txt
5) To download and restore an archived file, rename the placeholder
   xxx.S3.txt.restore and run GSync.
6) To delete an S3 archived file, it's placeholder, and any local files, rename the 
   placeholder xxx.S3.txt.delete and run GSync.
   Before executing, switch the GSync/AWS user to the Admin group.
7) Placeholder files may be moved, see -u

Required:
-d One or more local directories with the same parent to sync. This parent dir
     becomes the base key in S3, e.g. BucketName/Parent/.... Comma delimited, no
     spaces, see the example.
-b Dedicated S3 bucket name

Optional:
-f File extensions to consider, comma delimited, no spaces, case sensitive. Defaults
     to '.bam,.cram,.gz,.zip'
-a Minimum days old for archiving, defaults to 120
-g Minimum gigabyte size for archiving, defaults to 5
-r Perform a real run, defaults to just listing the actions that would be taken.
-k Delete local files that were successfully uploaded.
-u Update S3 Object keys to match current placeholder paths.
-c Recreate deleted placeholder files using info from orphaned S3 Objects.
-q Quiet verbose output.
-e Email addresses to send gsync messages, comma delimited, no spaces.
-s Smtp host, defaults to hci-mail.hci.utah.edu
-x Execute every 6 hrs until complete, defaults to just once, good for downloading
    latent glacier objects.

Example: java -Xmx20G -jar pathTo/GSync_X.X.jar -r -u -k -b hcibioinfo_gsync_repo 
     -q -a 90 -g 1 -d -d /Repo/DNA,/Repo/RNA,/Repo/Fastq -e [email protected]

**************************************************************************************