Skip to content

Commit

Permalink
Merge pull request #98 from slaclab/prod
Browse files Browse the repository at this point in the history
push to prod
  • Loading branch information
yee379 authored Dec 27, 2024
2 parents fa8fcc8 + 0e5fbca commit 0df2cd3
Show file tree
Hide file tree
Showing 7 changed files with 117 additions and 70 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for
data analytics and is characterized by large, massive throughput, high
concurrency storage systems.

**December 26th 8:00am PST: ALL S3DF services are currently DOWN/unavailable. We are investigating and will provide an update later today.**

## Quick Reference

Expand Down
52 changes: 28 additions & 24 deletions accounts-and-access.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,37 +3,41 @@
## How to get an account :id=access

If you are a SLAC employee, affiliated researcher, or experimental
facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC Unix account. The legacy SDF 1.0 environment requires a SLAC Active Directory account. They are not the same password system.***


1. If you don't already have a SLAC UNIX account (that allowed logins to the rhel6-64 and centos7 clusters), you'll need to get one by following these instructions. **If you already have one, skip to step 2**:
* Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration)
* Take Cyber 100 training via the [SLAC training portal](http://training.slac.stanford.edu/web-training.asp)
* Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID, and your preferred account name (and second choice).
2. Enable the SLAC UNIX account into S3DF:
* Log into [coact](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account and follow the instructions to enable your account in S3DF. If the account creation process fails for any reason, we'll let you know. Otherwise, you can assume your account will be enabled within 1 hour.

?> In some cases, e.g. for Rubin and LCLS, you may want to ask your
facility user, you are eligible for an S3DF account. ***S3DF authentication requires a SLAC UNIX account. The legacy SDF 1.0 environment required a SLAC Active Directory (Windows) account. These are not the same authentication system.***


1. If you don't already have a SLAC UNIX account (the credentials used to log in to SLAC UNIX clusters such as `rhel6-64` and `centos7`), you will need to acquire one by following these instructions. **If you already have an active SLAC UNIX account, skip to step 2**:
* Affiliated users/experimental facility users: Obtain a SLAC ID via the [Scientific Collaborative Researcher Registration process](https://it.slac.stanford.edu/identity/scientific-collaborative-researcher-registration) form (SLAC employees should already have a SLAC ID number).
* Take the appropriate cybersecurity SLAC training course via the [SLAC training portal](https://slactraining.slac.stanford.edu/how-access-the-web-training-portal):
* All lab users and non-SLAC/Stanford employees: "CS100: Cyber Security for Laboratory Users Training".
* All SLAC/Stanford employees or term employees of SLAC or the University: "CS200: Cyber Security Training for Employees".
* Depending on role, you may be required to take additional cybersecurity training. Consult with your supervisor or SLAC Point of Contact (POC) for more details.
* Ask your [SLAC POC](contact-us.md#facpoc) to submit a ticket to SLAC IT requesting a UNIX account. In your request indicate your SLAC ID and your preferred account name (include a second choice in case your preferred username is unavailable).
2. Register your SLAC UNIX account in S3DF:
* Log into the [Coact S3DF User Portal](https://s3df.slac.stanford.edu/coact) using your SLAC UNIX account via the "Log in with S3DF (unix)" option.
* Click on "Repos" in the menu bar.
* Click the "Request Access to Facility" button and select a facility from the dropdown.
* Include your affiliation and other contextual information for your request in the "Notes" field, then submit.
* A czar for the S3DF facility you requested access to will review your request. **Once approved by a facility czar**, the registration process should be completed in about 1 hour.

?> To access files and folders in facilities such as Rubin and LCLS, you will need to ask your
SLAC POC to add your username to the [POSIX
group](contact-us.md#facpoc) that manages access to your facility's
storage space. This is needed because S3DF is not the source of truth
for SLAC POSIX groups. S3DF is working with SLAC IT to deploy a
centralized database that will grant S3DF the ability to modify group
membership.
group](contact-us.md#facpoc) that manages access to that facility's
storage space. In the future, access to facility storage will be part of the S3DF registration process in Coact.


?> SLAC is currently working on providing federated access to SLAC
resources so that you will be able to authenticate with your home
institution's account as opposed to your SLAC account. We expect
federated authentication to be available in late 2024.
?> SLAC IT is currently working on providing federated access to SLAC
resources, which will enable authentication to SLAC computing systems
with a user's home institution account rather than a SLAC account.
Federated authentication is expected to be available in late 2024.

## Managing your UNIX account password

You can change your password yourself via [this password update site](https://unix-password.slac.stanford.edu/)
You can change your password via [the SLAC UNIX self-service password update site](https://unix-password.slac.stanford.edu/).

If you've forgotten your password and you want to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support)
If you have forgotten your password and need to reset it, [please contact the IT Service Desk](https://it.slac.stanford.edu/support).

Make sure you comply with SLAC training and cyber requirements to avoid getting your account disabled. You will be notified of these requirements via email.
Make sure you comply with all SLAC training and cybersecurity requirements to avoid having your account disabled. You will be notified of these requirements via email.


## How to connect
Expand Down Expand Up @@ -68,6 +72,6 @@ use applications like Jupyter, you can also launch a web-based
terminal using OnDemand:\
[`https://s3df.slac.stanford.edu/ondemand`](https://s3df.slac.stanford.edu/ondemand).\
You can find more information about using OnDemand in the [OnDemand
reference](reference.md#ondemand).
reference](interactive-compute.md#ondemand).

![S3DF users access](assets/S3DF_users_access.png)
4 changes: 2 additions & 2 deletions batch-compute.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ that the compute resources available in S3DF are fairly and
efficiently shared and distributed for all users. This page describes
S3DF specific Slurm information. If you haven't used Slurm before, you
can find general information on using this workflow manager in our
[Slurm reference FAQ](reference.md#slurm-daq).
[Slurm reference FAQ](reference.md#slurm-faq).

## Clusters & Repos

Expand Down Expand Up @@ -84,7 +84,7 @@ cluster/partition.
| milano | Milan 7713 | 120 | 480 GB | - | - | 300 GB | 136 |
| ampere | Rome 7542 | 112 (hyperthreaded) | 952 GB | Tesla A100 (40GB) | 4 | 14 TB | 42 |
| turing | Intel Xeon Gold 5118 | 40 (hyperthreaded) | 160 GB | NVIDIA GeForce 2080Ti | 10 | 300 GB | 27 |
| ada | AMD EPYC 9454 | 72(hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 |
| ada | AMD EPYC 9454 | 72 (hyperthreaded) | 702 GB | NVIDIA L40S | 10 | 21 TB | 6 |

### Banking

Expand Down
42 changes: 42 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,44 @@
# Status & Outages

## Support during Winter Shutdown

S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue as detailed below.

**Contacting S3DF staff for issues:**
Users should email [email protected] for ALL issues (critical and non-critical) providing full details of the problem (including what resources were being used, the impact and other information that may be useful in resolving the issue).
We will update the #comp-sdf Slack channel for critical issues as they are being worked on with status updates.
[This S3DF status web-page](https://s3df.slac.stanford.edu/#/changelog) will also have any updates on current issues.
If critical issues are not responded to within 2 hours of reporting the issue please contact your [Facility Czar](https://s3df.slac.stanford.edu/#/contact-us) for escalation.

**Critical issues** will be responded to as we become aware of them, except for the period of Dec 24-25 and Jan 31-1, which will be handled as soon as possible depending on staff availability.
* Critical issues are defined as full (a system-wide) outages that impact:
* Access to S3DF resources including
* All SSH logins
* All IANA interactive resources
* B50 compute resources(*)
* Bullet Cluster
* Access to all of the S3DF storage
* Home directories
* Group, Data and Scratch filesystems
* B50 Lustre, GPFS and NFS storage(*)
* Batch system access to S3DF Compute resources
* S3DF Kubernetes vClusters
* VMware clusters
* S3DF virtual machines
* B50 virtual machines(*)
* Critical issues for other SCS-managed systems and services for Experimental system support will be managed in conjunction with the experiment as appropriate. This includes
* LCLS workflows
* Rubin USDF resources
* CryoEM workflows
* Fermi workflows
(*) B50 resources are also dependent on SLAC-IT resources being available.

**Non-critical issues** will be responded to in the order they were received in the ticketing system when normal operations resume after the Winter Shutdown. Non-critical issues include:
* Individual node-outages in the compute or interactive pool
* Variable or unexpected performance issues for compute, storage or networking resources.
* Batch job errors (that do not impact overall batch system scheduling)
* Tape restores and data transfer issues

## Outages

### Current
Expand All @@ -10,6 +49,9 @@

|When |Duration | What |
| --- | --- | --- |
|Dec 10 2024|Ongoing (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)|
| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes.
|Nov 18 2024|8 days (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)|
|Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions.
|Oct 3 2024 |1.5 hrs (unplanned)| Storage issue impacted home directory access and SSH logins
|Jul 10 2024 |4 days (planned)| Urgent electrical maintenance is required in SRCF datacenter
Expand Down
35 changes: 22 additions & 13 deletions contact-us.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,21 +21,30 @@ S3DF and you don't see your facility in this table.

|Facility | PoC | Primary POSIX group|
|--- |--- |--- |
|Rubin | Richard Dubois | rubin_users |
|Rubin | James Chiang, Adam Bolton | rubin_users |
|SuperCDMS | Concetta Cartaro | cdms |
|LCLS | [email protected] | ps-users |
|MLI| Daniel Ratner | - |
|Neutrino| Kazuhiro Terao | - |
|AD | Greg White | - |
|MLI| Daniel Ratner | mli |
|Neutrino| Kazuhiro Terao | nu |
|AD | Greg White | cd |
|SUNCAT | Johannes Voss| suncat-norm |
|Fermi | Richard Dubois| glast-pipeline |
|Fermi | Seth Digel, Nicola Omodei| glast-pipeline |
|EPPTheory | Tom Rizzo | theorygrp |
|FACET | Nathan Majernik | - |
|DESC | Tom Glanzman | desc |
|KIPAC | Stuart Marshall | ki |
|FACET | Nathan Majernik | facet |
|DESC | Heather Kelly | desc |
|KIPAC | Marcelo Alvarez | ki |
|RFAR | David Bizzozero | rfar |
|SIMES | Tom Devereaux | - |
|CryoEM | Yee Ting Li | - |
|SSRL | Riti Sarangi | - |
|LDMX | Omar Moreno | - |
|HPS | Omar Moreno | - |
|SIMES | Tom Devereaux, Brian Moritz | simes |
|CryoEM | Patrick Pascual | cryo-data |
|SSRL | Riti Sarangi | ssrl |
|LDMX | Omar Moreno | ldmx |
|HPS | Mathew Graham | hps |
|EXO | Brian Mong | exo |
|ATLAS | Wei Yang, Michael Kagan | atlas |
|CDS | Ernest Williams | cds |
|SRS | Tony Johnson | srs |
|FADERS | Ryan Herbst | faders |
|TOPAS | Joseph Perl | topas |
|RP | Thomas Frosio | esh-rp |
|Projects | Yemi Adesanya, Ryan Herbst | - |
|SCS | Omar Quijano, Yee Ting Li, Gregg Thayer | - |
Loading

0 comments on commit 0df2cd3

Please sign in to comment.