Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
DavidAustinNix authored Feb 1, 2024
1 parent ec9f51c commit 088cf6d
Showing 1 changed file with 141 additions and 144 deletions.
285 changes: 141 additions & 144 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,96 @@
# AwsApps
Genomic data focused toolkit for working with AWS services (e.g. S3 and EC2). Includes exhaustive JUnit testing for each app. See [Misc/WorkingWithTheAWSJobRunner.pdf](https://github.com/HuntsmanCancerInstitute/AwsApps/blob/master/Misc/WorkingWithTheAWSJobRunner.pdf) for details.
<pre>
u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/GSync_0.6.jar
Genomic data focused toolkit for working with AWS services (e.g. S3 and EC2). Includes exhaustive JUnit testing for each app.

<pre>
u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/S3Copy_0.3.jar
**************************************************************************************
** GSync : June 2020 **
** S3 Copy : Feb 2024 **
**************************************************************************************
GSync pushes files with a particular extension that exceed a given size and age to
Amazon's S3 object store. Associated genomic index files are also moved. Once
correctly uploaded, GSync replaces the original file with a local txt placeholder file
containing information about the S3 object. Files are restored or deleted by modifying
the name of the placeholder file. Symbolic links are ignored.

WARNING! This app has the potential to destroy precious genomic data. TEST IT on a
pilot system before deploying in production. BACKUP your local files and ENABLE S3
Object Versioning before running. This app is provided with no guarantee of proper
function.
SC copies AWS S3 objects, unarchiving them as needed, within the same or different
accounts or downloads them to your local computer. Run this as a daemon with -l or run
repeatedly until complete. To upload files to S3, use the AWS CLI.

To use the app:
1) Create a new S3 bucket dedicated solely to this purpose. Use it for nothing else.
2) Enable S3 Object Locking and Versioning on the bucket to assist in preventing
accidental object overwriting. Add lifecycle rules to
AbortIncompleteMultipartUpload and move objects to Deep Glacier.
3) It is a good policy when working on AWS S3 to limit your ability to accidentally
delete buckets and objects. To do so, create and assign yourself to an AWS Group
called AllExceptS3Delete with a custom permission policy that denies s3:Delete*:
{"Version": "2012-10-17", "Statement": [
{"Effect": "Allow", "Action": "*", "Resource": "*"},
{"Effect": "Deny", "Action": "s3:Delete*", "Resource": "*"} ]}
For standard upload and download gsyncs, assign yourself to the AllExceptS3Delete
group. When you need to delete or update objects, switch to the Admin group, then
switch back. Accidental overwrites are OK since object versioning is enabled.
To add another layer of protection, apply object legal locks via the aws cli.
3) Create a ~/.aws/credentials file with your access, secret, and region info, chmod
600 the file and keep it private. Use a txt editor or the aws cli configure
command, see https://aws.amazon.com/cli Example ~/.aws/credentials file:
[default]
aws_access_key_id = AKIARHBDRGYUIBR33RCJK6A
aws_secret_access_key = BgDV2UHZv/T5ENs395867ueESMPGV65HZMpUQ
region = us-west-2
4) Execute GSync to upload large old files to S3 and replace them with a placeholder
file named xxx.S3.txt
5) To download and restore an archived file, rename the placeholder
xxx.S3.txt.restore and run GSync.
6) To delete an S3 archived file, it's placeholder, and any local files, rename the
placeholder xxx.S3.txt.delete and run GSync.
Before executing, switch the GSync/AWS user to the Admin group.
7) Placeholder files may be moved, see -u
Create a ~/.aws/credentials file with your access, secret, and region info, chmod
600 the file and keep it private. Use a txt editor or the AWS CLI configure
command, see https://aws.amazon.com/cli Example ~/.aws/credentials file:
[default]
aws_access_key_id = AKIARHBDRGYUIBR33RCJK6A
aws_secret_access_key = BgDV2UHZv/T5ENs395867ueESMPGV65HZMpUQ
region = us-west-2
Repeat these entries for multiple accounts replacing the word 'default' with a single
unique account name.

Required:
-d One or more local directories with the same parent to sync. This parent dir
becomes the base key in S3, e.g. BucketName/Parent/.... Comma delimited, no
spaces, see the example.
-b Dedicated S3 bucket name
-j Provide a comma delimited string of copy jobs or a txt file with one per line.
A copy job consists of a full S3 URI as the source and a destination separated
by '>', e.g. 's3://source/tumor.cram > s3://destination/collabTumor.cram' or
folders 's3://source/alignments/tumor > s3://destination/Collab/' or local
's3://source/alignments/tumor > .' Note, the trailing '/' is required in the
S3 destination for a recursive copy or when the local folder doesn't exist.

Optional:
-f File extensions to consider, comma delimited, no spaces, case sensitive. Defaults
to '.bam,.cram,.gz,.zip'
-a Minimum days old for archiving, defaults to 120
-g Minimum gigabyte size for archiving, defaults to 5
-r Perform a real run, defaults to just listing the actions that would be taken.
-k Delete local files that were successfully uploaded.
-u Update S3 Object keys to match current placeholder paths.
-c Recreate deleted placeholder files using info from orphaned S3 Objects.
-q Quiet verbose output.
-e Email addresses to send gsync messages, comma delimited, no spaces.
-s Smtp host, defaults to hci-mail.hci.utah.edu
-x Execute every 6 hrs until complete, defaults to just once, good for downloading
latent glacier objects.
Optional/ Defaults:
-d Perform a dry run to list the actions that would be taken
-r Perform a recursive copy, defaults to an exact source key match
-e Email addresse(s) to send status messages, comma delimited, no spaces. Note,
the sendmail app must be configured on your system. Test it:
echo 'Subject: Hello' | sendmail [email protected]
-x Expedite archive retrieval, increased cost $0.03/GB vs $0.01/GB, 1-5min vs 3-12hr,
defaults to standard.
-l Execute every hour (standard) or minute (expedited) until complete
-t Maximum threads to utilize, defaults to 8
-p AWS credentials profile, defaults to 'default'
-n Number of days to keep restored files in S3, defaults to 1
-a Print instructions for copying files between different accounts

Example: java -Xmx20G -jar pathTo/GSync_X.X.jar -r -u -k -b hcibioinfo_gsync_repo
-q -a 90 -g 1 -d -d /Repo/DNA,/Repo/RNA,/Repo/Fastq -e [email protected]
Example: java -Xmx10G -jar pathTo/S3Copy_x.x.jar -e [email protected] -p obama -d -l
-j 's3://source/Logs.zip>s3://destination/,s3://source/normal > ~/Downloads/' -r
**************************************************************************************
</pre>

<pre>
u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/VersionManager_0.2.jar
**************************************************************************************
** AWS S3 Version Manager : August 2023 **
**************************************************************************************
Bucket versioning in S3 protects objects from being deleted or overwritten by hiding
the original when 'deleting' or over writing an existing object. Use this tool to
delete these hidden S3 objects and any deletion marks from your buckets. Use the
options to select particular redundant objects to delete in a dry run, review the
actions, and rerun it with the -r option to actually delete them. This app will not
delete any isLatest=true object.

WARNING! This app has the potential to destroy precious data. TEST IT on a
pilot system before deploying in production. Although extensively unit tested, this
app is provided with no guarantee of proper function.

To use the app:
1) Enable S3 Object versioning on your bucket.
2) Install and configure the aws cli with your region, access and secret keys. See
https://aws.amazon.com/cli
3) Use cli commands like 'aws s3 rm s3://myBucket/myObj.txt' or the AWS web Console to
'delete' particular objects. Then run this app to actually delete them.

Required Parameters:
-b Versioned S3 bucket name

Optional Parameters:
-r Perform a real run, defaults to a dry run where no objects are deleted
-c Credentials profile name, defaults to 'default'
-a Minimum age, in days, of object to delete, defaults to 30
-s Object key suffixes to delete, comma delimited, no spaces
-p Object key prefixes to delete, comma delimited, no spaces
-v Verbose output
-t Maximum threads to use, defaults to 8

Example: java -Xmx10G -jar pathTo/VersionManager_X.X.jar -b mybucket-vm-test
-s .cram,.bam,.gz,.zip -a 7 -c MiloLab

**************************************************************************************
</pre>

See [Misc/WorkingWithTheAWSJobRunner.pdf](https://github.com/HuntsmanCancerInstitute/AwsApps/blob/master/Misc/WorkingWithTheAWSJobRunner.pdf) for details.
<pre>
u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/JobRunner_0.3.jar

**************************************************************************************
Expand Down Expand Up @@ -137,96 +151,79 @@ Example: java -jar -Xmx1G JobRunner.jar -x -t
-c 'https://my-jr.s3.us-west-2.amazonaws.com/aws.cred.txt?X-AmRun...'

**************************************************************************************
</pre>




u0028003$ java -jar ~/Code/AwsApps/target/VersionManager_0.1.jar

**************************************************************************************
** AWS S3 Version Manager : January 2022 **
<pre>
u0028003$ java -jar -Xmx1G ~/Code/AwsApps/target/GSync_0.6.jar
**************************************************************************************
Bucket versioning in S3 protects objects from being deleted or overwritten by hiding
the original when 'deleting' or over writing an existing object. Use this tool to
delete these hidden S3 objects and any deletion marks from your buckets. Use the
options to select particular redundant objects to delete in a dry run, review the
actions, and rerun it with the -r option to actually delete them. This app will not
delete any isLatest=true object.

WARNING! This app has the potential to destroy precious data. TEST IT on a
pilot system before deploying in production. Although extensively unit tested, this
app is provided with no guarantee of proper function.

To use the app:
1) Enable S3 Object versioning on your bucket.
2) Install and configure the aws cli with your region, access and secret keys. See
https://aws.amazon.com/cli
3) Use cli commands like 'aws s3 rm s3://myBucket/myObj.txt' or the AWS web Console to
'delete' particular objects. Then run this app to actually delete them.

Required Parameters:
-b Versioned S3 bucket name
-l Bucket region location

Optional Parameters:
-r Perform a real run, defaults to a dry run where no objects are deleted
-c Credentials profile name, defaults to 'default'
-a Minimum age, in days, of object to delete, defaults to 30
-s Object key suffixes to delete, comma delimited, no spaces
-p Object key prefixes to delete, comma delimited, no spaces
-q Quiet output.

Example: java -Xmx10G -jar pathTo/VersionManager_X.X.jar -b mybucket-vm-test
-s .cram,.bam,.gz,.zip -a 7 -c MiloLab -l us-west-2

** GSync : June 2020 **
**************************************************************************************
GSync pushes files with a particular extension that exceed a given size and age to
Amazon's S3 object store. Associated genomic index files are also moved. Once
correctly uploaded, GSync replaces the original file with a local txt placeholder file
containing information about the S3 object. Files are restored or deleted by modifying
the name of the placeholder file. Symbolic links are ignored.



u0028003$ java -jar ~/Code/AwsApps/target/S3Copy_0.1.jar

**************************************************************************************
** S3 Copy : Jan 2023 **
**************************************************************************************
SC copies AWS S3 objects, unarchiving them as needed, within the same or different
accounts or downloads them to your local computer. Run this as a daemon with -l or run
repeatedly until complete. To upload files to S3, use the AWS CLI.
WARNING! This app has the potential to destroy precious genomic data. TEST IT on a
pilot system before deploying in production. BACKUP your local files and ENABLE S3
Object Versioning before running. This app is provided with no guarantee of proper
function.

To use the app:
Create a ~/.aws/credentials file with your access, secret, and region info, chmod
600 the file and keep it private. Use a txt editor or the AWS CLI configure
command, see https://aws.amazon.com/cli Example ~/.aws/credentials file:
[default]
aws_access_key_id = AKIARHBDRGYUIBR33RCJK6A
aws_secret_access_key = BgDV2UHZv/T5ENs395867ueESMPGV65HZMpUQ
region = us-west-2
Repeat these entries for multiple accounts replacing the word 'default' with a single
unique account name.
1) Create a new S3 bucket dedicated solely to this purpose. Use it for nothing else.
2) Enable S3 Object Locking and Versioning on the bucket to assist in preventing
accidental object overwriting. Add lifecycle rules to
AbortIncompleteMultipartUpload and move objects to Deep Glacier.
3) It is a good policy when working on AWS S3 to limit your ability to accidentally
delete buckets and objects. To do so, create and assign yourself to an AWS Group
called AllExceptS3Delete with a custom permission policy that denies s3:Delete*:
{"Version": "2012-10-17", "Statement": [
{"Effect": "Allow", "Action": "*", "Resource": "*"},
{"Effect": "Deny", "Action": "s3:Delete*", "Resource": "*"} ]}
For standard upload and download gsyncs, assign yourself to the AllExceptS3Delete
group. When you need to delete or update objects, switch to the Admin group, then
switch back. Accidental overwrites are OK since object versioning is enabled.
To add another layer of protection, apply object legal locks via the aws cli.
3) Create a ~/.aws/credentials file with your access, secret, and region info, chmod
600 the file and keep it private. Use a txt editor or the aws cli configure
command, see https://aws.amazon.com/cli Example ~/.aws/credentials file:
[default]
aws_access_key_id = AKIARHBDRGYUIBR33RCJK6A
aws_secret_access_key = BgDV2UHZv/T5ENs395867ueESMPGV65HZMpUQ
region = us-west-2
4) Execute GSync to upload large old files to S3 and replace them with a placeholder
file named xxx.S3.txt
5) To download and restore an archived file, rename the placeholder
xxx.S3.txt.restore and run GSync.
6) To delete an S3 archived file, it's placeholder, and any local files, rename the
placeholder xxx.S3.txt.delete and run GSync.
Before executing, switch the GSync/AWS user to the Admin group.
7) Placeholder files may be moved, see -u

Required:
-j Provide a comma delimited string of copy jobs or a txt file with one per line.
A copy job consists of a full S3 URI as the source and a destination separated
by '>', e.g. 's3://source/tumor.cram > s3://destination/collabTumor.cram' or
folders 's3://source/alignments/tumor > s3://destination/Collab/' or local
's3://source/alignments/tumor > .' Note, the trailing '/' is required in the
S3 destination for a recursive copy or when the local folder doesn't exist.
-d One or more local directories with the same parent to sync. This parent dir
becomes the base key in S3, e.g. BucketName/Parent/.... Comma delimited, no
spaces, see the example.
-b Dedicated S3 bucket name

Optional/ Defaults:
-d Perform a dry run to list the actions that would be taken
-r Perform a recursive copy, defaults to an exact source key match
-e Email addresse(s) to send status messages, comma delimited, no spaces. Note,
the sendmail app must be configured on your system. Test it:
echo 'Subject: Hello' | sendmail [email protected]
-x Expedite archive retrieval, increased cost $0.03/GB vs $0.01/GB, 1-5min vs 3-12hr,
defaults to standard.
-l Execute every hour (standard) or minute (expedited) until complete
-t Maximum threads to utilize, defaults to 8
-p AWS credentials profile, defaults to 'default'
-n Number of days to keep restored files in S3, defaults to 1
-a Print instructions for copying files between different accounts
Optional:
-f File extensions to consider, comma delimited, no spaces, case sensitive. Defaults
to '.bam,.cram,.gz,.zip'
-a Minimum days old for archiving, defaults to 120
-g Minimum gigabyte size for archiving, defaults to 5
-r Perform a real run, defaults to just listing the actions that would be taken.
-k Delete local files that were successfully uploaded.
-u Update S3 Object keys to match current placeholder paths.
-c Recreate deleted placeholder files using info from orphaned S3 Objects.
-q Quiet verbose output.
-e Email addresses to send gsync messages, comma delimited, no spaces.
-s Smtp host, defaults to hci-mail.hci.utah.edu
-x Execute every 6 hrs until complete, defaults to just once, good for downloading
latent glacier objects.

Example: java -Xmx20G -jar pathTo/S3Copy_x.x.jar -e [email protected] -p obama -d -l
-j 's3://source/Logs.zip>s3://destination/,s3://source/normal > ~/Downloads/' -r
**************************************************************************************
Example: java -Xmx20G -jar pathTo/GSync_X.X.jar -r -u -k -b hcibioinfo_gsync_repo
-q -a 90 -g 1 -d -d /Repo/DNA,/Repo/RNA,/Repo/Fastq -e [email protected]

**************************************************************************************
</pre>

0 comments on commit 088cf6d

Please sign in to comment.