diff --git a/README.md b/README.md index 121446a..fd9b0e5 100644 --- a/README.md +++ b/README.md @@ -33,9 +33,192 @@ Users who join these GitHub teams agree to use the NASA Openscapes Hub only for Run large or parallel jobs over large geographic bounding boxes or over long temporal extents should be cleared with the NASA Openscapes Team by submitting an issue to this repo. - ## Removal From the NASA Openscapes Hub The NASA Openscapes Hub is a shared, limited resource that incurs real costs. Users are granted access in the terms above and are removed at the end of those limits. Users that haven’t accessed the Hub in more than six months are also removed for security purposes. We will do our best to alert users before they lose access to the NASA Openscapes Hub. However, we reserve the right to remove users at any time for any reason. Users that violate the terms of access or incur large Cloud costs without prior permission from the NASA Openscapes Team will be removed immediately. + +## Data Storage in the NASA Openscapes Hub + +Storing large amounts of data in the cloud can incur significant ongoing costs if not done optimally. We are charged daily for data stored in our Hub. We are developing technical strategies and policies to reduce storage costs that will keep the Openscapes 2i2c Hub a shared resource for us all to use, while also providing reusable strategies for other admins. + +The Hub uses an [EC2](https://aws.amazon.com/ec2/) compute instance, with the +`$HOME` directory (`/users/jovyan/` in python images and `/users/rstudio/` in R +images) mounted to [AWS Elastic File System (EFS)](https://aws.amazon.com/efs/) +storage. This drive is really handy because it is persistent across server +restarts and is a great place to store your code. However the `$HOME` directory +should not be used to store data, as it is very expensive, and can also be quite +slow to read from and write to. + +To that end, the Hub provides every user access to two [AWS +S3](https://aws.amazon.com/s3/) buckets - a "scratch" bucket for short-term +storage, and a "persistent" bucket for longer-term storage. AWS S3 buckets are +like online storage containers, accessible through the internet, where you can +store and retrieve files. S3 buckets have fast read/write, and storage costs are +relatively inexpensive compared to storing in your `$HOME` directory. All major +cloud providers provide a similar storage service - S3 is Amazon's version, while +Google provides "Google Cloud Storage", and Microsoft provides "Azure Blob Storage". + +These buckets are accessible only when you are working inside the Hub; you can +access them using the environment variables: + +- `$SCRATCH_BUCKET` pointing to `s3://openscapeshub-scratch/[your-username]` + - Scratch buckets are designed for storage of temporary files, e.g. + intermediate results. Objects stored in a scratch bucket are removed after + 7 days from their creation. +- `$PERSISTENT_BUCKET` pointing to `s3://openscapeshub-persistent/[your-username]` + - Persistent buckets are designed for storing data that is consistently used + throughout the lifetime of a project. There is no automatic purging of + objects in persistent buckets, so it is the responsibility of the Hub + admin and/or Hub users to delete objects when they are no longer needed to + minimize cloud billing costs. + +### Using S3 Bucket Storage + +Please see the short tutorial in the Earthdata Cloud Cookbook on [Using S3 +Bucket Storage in NASA-Openscapes +Hub](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/how-tos/using-s3-storage.html). + +### Data retention and archiving policy + +User `$HOME` directories will be retained for six months after their last use. +After a home directory has been idle for six months, it will be [archived to our +"archive" S3 bucket, and removed](#how-to-archive-old-home-directories). If a +user requests their archive back, an admin can restore it for them. + +Once a user's home directory archive has been sitting in the archive for an +additional six months, it will be permanently removed from the archive. After +this it can no longer be retrieved. + +In addition to these policies, admins will keep an eye on the +[Home Directory Usage Dashboard](https://grafana.openscapes.2i2c.cloud/d/bd232539-52d0-4435-8a62-fe637dc822be/home-directory-usage-dashboard?orgId=1) +in Grafana. When a user's home directory increases in size to over 100GB, we +will contact them and work with them to reduce the size of their home directory +- by removing large unnecessary files, and moving the rest to the appropriate S3 +bucket (e.g., `$PERSISTENT_BUCKET`). + +## The `_shared` directory + +[The `_shared` directory](https://infrastructure.2i2c.org/topic/infrastructure/storage-layer/#shared-directories) +is a place where instructors can put workshop materials +for participants to access. It is mounted as `/home/jovyan/shared`, and is _read +only_ for all users. For those with admin access to the Hub, it is also mounted +as a writeable directory as `/home/jovyan/shared-readwrite`. + +This directory will follow the same policies as users' home directories: after +six months, contents will be archived to the "archive" S3 bucket (more below). +After an additional six months, the archive will be deleted. + +### How to archive old home directories (admin) + +To start, you will need to be an admin of the Openscapes Jupyterhub so that +the `allusers` directory is mounted in your home directory. This will contain +all users' home directories, and you will have full read-write access. + +#### Finding large `$HOME` directories + +Look at the [Home Directory Usage +Dashboard](https://grafana.openscapes.2i2c.cloud/d/bd232539-52d0-4435-8a62-fe637dc822be/home-directory-usage-dashboard?orgId=1) +in Grafana to see the directories that haven't been used in a long time and/or +are very large. + +You can also view and sort users' directories by size in the Hub with the +following command, though this takes a while because it has to summarize _a lot_ +of files and directories. This will show the 30 largest home directories: + +``` +du -h --max-depth=1 /home/jovyan/allusers/ | sort -h -r | head -n 30 +``` + +#### Authenticate with S3 archive bucket + +We have created an AWS IAM user called `archive-homedirs` with appropriate +permissions to write to the `openscapeshub-prod-homedirs-archive` bucket. +Get access keys for this user from the AWS console, and use these keys to +authenticate in the Hub: + +In the terminal, type: + +``` +awsv2 configure +``` + +Enter the access key and secret key at the prompts, and set default region to +`us-west-2`. + +You will also need to temporarily unset some AWS environment variables that have +been configured to authenticate with NASA S3 storage. (These will be reset the next +time you log in): + +``` +unset AWS_ROLE_ARN +unset AWS_WEB_IDENTITY_TOKEN_FILE +``` + +Test to make sure you can access the archive bucket: + +``` +# test s3 access: +awsv2 s3 ls s3://openscapeshub-prod-homedirs-archive/archives/ +touch test123.txt +awsv2 s3 mv test123.txt s3://openscapeshub-prod-homedirs-archive/archives/ +awsv2 s3 rm s3://openscapeshub-prod-homedirs-archive/archives/test123.txt +``` + +#### Setting up and running the archive script + +We use a [python script](scripts/archive-home-dirs.py), [developed by +@yuvipanda](https://github.com/2i2c-org/features/issues/32), that reproducibly +archives a list of users' directories into a specified S3 bucket. + +Copy the script into your home directory in the Hub, or even better, clone this +repo. + +In the Hub as of 2024-05-17, a couple of dependencies for the script are +missing; you can install them before running the script: + +``` +pip install escapism + +# I had solver errors with pigz so needed to use the classic solver. +# Also, the installation of pigz required a machine with >= 3.7GB memory +conda install conda-forge::pigz --solver classic +``` + +Create a text file, with one username per line, of users' home directories you +would like to archive to s3. It will look like: + +``` +username1 +username2 +# etc... +``` + +Finally, run the script from the terminal, changing the parameter values as required: + +``` +python3 archive-home-dirs.py \ + --archive-name="archive-$(date +'%Y-%m-%d')" \ + --basedir=/home/jovyan/allusers/ \ + --bucket-name=openscapeshub-prod-homedirs-archive \ + --object-prefix="archives/" \ + --usernames-file=users-to-archive.txt \ + --temp-path=/home/jovyan/archive-staging/ +``` + +Omitted in the above example, but available to use, is the `--delete` flag, +which will delete the users' home directory once the archive is completed. + +If you don't use the `--delete` flag, first verify that the archive was successfully +completed and then remove the user's home directory manually. + +By default, archives (`.tar.gz`) are created in your `/tmp` directory before +upload to the S3 bucket. The `/tmp` directory is cleared out when you shut down +the Hub. However, `/tmp` has limited space (80GB shared by up to four users on a +single node), so if you are archiving many large directories, you will likely +need to specify a location in your `$HOME` directory by passing a path to the +`--temp-path` argument. The script will endeavour to clean up after itself and +remove the `tar.gz` file after uploading, but double check that directory +when you are finished or you may have copies of all of the other user +directories in your own `$HOME`! diff --git a/scripts/archive-home-dirs.py b/scripts/archive-home-dirs.py new file mode 100644 index 0000000..6a32c94 --- /dev/null +++ b/scripts/archive-home-dirs.py @@ -0,0 +1,294 @@ +#!/usr/bin/env python3 +""" +Archive home directories onto object storage (like S3 or GCS). +Designed to be run manually, and takes care to not delete anything without a lot of +confirmation. + +Original script by Yuvi Panda (@yuvipanda), 2i2c. +https://github.com/2i2c-org/features/issues/32#issue-2221427520 +""" + +import hashlib +import string +import sys +import shutil +import os +import argparse +import boto3 +from botocore.exceptions import ClientError +from escapism import escape +from pathlib import Path +from contextlib import contextmanager +import tempfile +import time +import subprocess +from functools import cache + +@cache +def get_tar_command() -> str: + """ + Return the tar command to use. + + We use `gnu` tar for compressing files, and Mac OS ships with bsd tar by + default. We detect this, and tell users to get gnu tar if needed for local + testing. Should not be an issue when running on containers. + """ + + out = subprocess.check_output(["tar", "--version"]).decode() + if out.startswith("tar (GNU tar)"): + return "tar" + else: + # We may be on Mac OS, and GNU Tar is not installed by default + # It can be installed from homebrew with `brew install gnu-tar`, + # which provides `gtar` + if shutil.which("gtar"): + return "gtar" + else: + print("Could not find GNU Tar on the system", file=sys.stderr) + print( + "If on Mac OS, please install gnu-tar with the following command (if using homebrew) and try again", + file=sys.stderr, + ) + print("brew install gnu-tar", file=sys.stderr) + sys.exit(1) + +def validate_homes_exist(basedir: Path, usernames: list[str], ignore_missing: bool): + """ + Validate that all given homedirectories for users exist + """ + + errors = [] + + for username in usernames: + escaped_username = escape( + username, safe=set(string.ascii_lowercase + string.digits), escape_char="-" + ).lower() + + # We should still protect against directory traversal attacks + user_home = (basedir / escaped_username).absolute() + + if basedir not in user_home.parents: + errors.append( + f"{user_home} refers to a directory outside of {basedir}, can not be archived" + ) + + if not user_home.exists() and not ignore_missing: + errors.append( + f"{username}'s home directory does not exist inside {basedir}, {user_home} not found" + ) + + if errors: + print( + "The following errors were found when trying to validate that all user home directories exist", + file=sys.stderr, + ) + print("\n".join(errors), file=sys.stderr) + sys.exit(1) + +@contextmanager +def archive_dir(dir_path: Path, archive_name: str, temp_path: str): + """ + Archive given directory reproducibly to out_path + """ + + start_time = time.perf_counter() + + with tempfile.TemporaryDirectory(dir=temp_path) as d: + target_file = Path(d) / (archive_name + ".tar.gz") + cmd = [ + get_tar_command(), + f"--directory={dir_path}", + "--sort=name", + "--numeric-owner", + "--create", + "--use-compress-program=pigz", + f"--file={target_file}", + ] + ["."] + env = os.environ.copy() + # Set GZip / pigz option to not write timestamp so we get consistent hashes + env["GZIP"] = "-n" + try: + # Capture output and fail explicitly on non-0 error code + # Primarily to get rid of tar: Removing leading `/' from member names + subprocess.check_output(cmd, stderr=subprocess.STDOUT, env=env) + except subprocess.CalledProcessError as e: + print(f"Executing {e.cmd} failed with code {e.returncode}", file=sys.stderr) + print(f"stdout: {e.stdout}", file=sys.stderr) + print(f"stderr: {e.stderr}", file=sys.stderr) + sys.exit(1) + duration = time.perf_counter() - start_time + + file_size_gb = target_file.stat().st_size / 1024 / 1024 / 1024 + print( + f"Tarballing {dir_path.name} to {archive_name}.tar.gz ({file_size_gb:0.3f} GB) took {duration:0.2f}s" + ) + + yield target_file + +def sha256_file(filepath: Path) -> str: + + h = hashlib.sha256() + b = bytearray(128*1024) + mv = memoryview(b) + with open(filepath, "rb", buffering=0) as f: + for n in iter(lambda : f.readinto(mv), 0): + h.update(mv[:n]) + return h.hexdigest() + + +def archive_user( + s3_client, + basedir: Path, + username: str, + archive_name: str, + bucket_name: str, + prefix: str, + ignore_missing: bool, + delete: bool, + temp_path: str +): + + escaped_username = escape( + username, safe=set(string.ascii_lowercase + string.digits), escape_char="-" + ).lower() + + homedir = basedir / escaped_username + + if ignore_missing and not homedir.exists(): + print(f"User {username} does not exist, skipping archival") + return + + print(f"Archiving {username}") + with archive_dir(homedir, f"{escaped_username}-{archive_name}", temp_path) as archived_file: + # Make sure the object key has the same extension as the compressed file we have + object_name = os.path.join(prefix, username, archive_name) + "".join( + archived_file.suffixes + ) + sha256sum = sha256_file(archived_file) + try: + head_response = s3_client.head_object(Bucket=bucket_name, Key=object_name) + # If we are here, it means that the file *does* exist + if head_response["Metadata"].get("sha256sum") == sha256sum: + # We have already uploaded this, and the hashes match! + needs_upload = False + else: + # This file exists, *but hashes do not match!* + # This is an error condition, and we abort so we don't overwrite user files + print(head_response) + print("AAAAAAAAAAA", file=sys.stderr) + sys.exit(1) + except ClientError as e: + if e.response.get("Error", {}).get("Code") == "404": + # Does not exist, needs to be uploaded + needs_upload = True + else: + # Some other issue, let's just fail + raise + if needs_upload: + start_time = time.perf_counter() + print(f"Uploading {username}...") + s3_client.upload_file( + archived_file, + bucket_name, + object_name, + ExtraArgs={"Metadata": {"sha256sum": sha256sum}}, + ) + duration = time.perf_counter() - start_time + print(f"Upload for {username} complete in {duration:0.2f}s") + else: + if delete: + start_time = time.perf_counter() + print(f"Already uploaded, going to delete {username}") + shutil.rmtree(homedir) + duration = time.perf_counter() - start_time + print(f"Already uploaded, deleted {username} in {duration:0.2f}s") + else: + print(f"Username already uploaded, skipping.") + +def main(): + + argparser = argparse.ArgumentParser() + + argparser.add_argument( + "--archive-name", + help="Name for user home directory in", + required=True, + ) + + argparser.add_argument( + "--basedir", + help="Base directory containing user home directories", + required=True, + ) + + argparser.add_argument( + "--object-store", + choices=("s3",), + default="s3", + help="Type of object store to upload files to", + ) + + argparser.add_argument( + "--bucket-name", + help="Name of object storage bucket to upload archived files to", + required=True, + ) + + argparser.add_argument( + "--object-prefix", + help="Prefix to use before username when uploading archives", + default="a/", + ) + + argparser.add_argument( + "--usernames-file", + help="File with list of usernames to archive, one per line", + required=True, + ) + + argparser.add_argument( + "--ignore-missing", + help="Ignore missing user home directories", + action="store_true", + ) + + argparser.add_argument( + "--delete", + help="Delete home directories after uploading", + action="store_true" + ) + + argparser.add_argument( + "--temp-path", + help="Location to write the archive before uploading; default _tempdir_ uses the default tempdir.", + default=None + ) + + args = argparser.parse_args() + + basedir = Path(args.basedir).absolute() + usernames = [] + with open(args.usernames_file) as f: + for line in f: + if line.startswith("#"): + continue + usernames.append(line.strip()) + + validate_homes_exist(basedir, usernames, args.ignore_missing) + + s3_client = boto3.client("s3") + for username in usernames: + archive_user( + s3_client, + basedir, + username, + args.archive_name, + args.bucket_name, + args.object_prefix, + args.ignore_missing, + args.delete, + args.temp_path + ) + +if __name__ == "__main__": + main()