-
Notifications
You must be signed in to change notification settings - Fork 20
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #98 from slaclab/prod
push to prod
- Loading branch information
Showing
7 changed files
with
117 additions
and
70 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,44 @@ | ||
# Status & Outages | ||
|
||
## Support during Winter Shutdown | ||
|
||
S3DF will remain operational over the Winter shutdown (Dec 21st 2024 to Jan 5th 2025). Staff will be taking time off as per SLAC guidelines. S3DF resources will continue to be managed remotely if there are interruptions to operations. Response times for issues will vary, depending on the criticality of the issue as detailed below. | ||
|
||
**Contacting S3DF staff for issues:** | ||
Users should email [email protected] for ALL issues (critical and non-critical) providing full details of the problem (including what resources were being used, the impact and other information that may be useful in resolving the issue). | ||
We will update the #comp-sdf Slack channel for critical issues as they are being worked on with status updates. | ||
[This S3DF status web-page](https://s3df.slac.stanford.edu/#/changelog) will also have any updates on current issues. | ||
If critical issues are not responded to within 2 hours of reporting the issue please contact your [Facility Czar](https://s3df.slac.stanford.edu/#/contact-us) for escalation. | ||
|
||
**Critical issues** will be responded to as we become aware of them, except for the period of Dec 24-25 and Jan 31-1, which will be handled as soon as possible depending on staff availability. | ||
* Critical issues are defined as full (a system-wide) outages that impact: | ||
* Access to S3DF resources including | ||
* All SSH logins | ||
* All IANA interactive resources | ||
* B50 compute resources(*) | ||
* Bullet Cluster | ||
* Access to all of the S3DF storage | ||
* Home directories | ||
* Group, Data and Scratch filesystems | ||
* B50 Lustre, GPFS and NFS storage(*) | ||
* Batch system access to S3DF Compute resources | ||
* S3DF Kubernetes vClusters | ||
* VMware clusters | ||
* S3DF virtual machines | ||
* B50 virtual machines(*) | ||
* Critical issues for other SCS-managed systems and services for Experimental system support will be managed in conjunction with the experiment as appropriate. This includes | ||
* LCLS workflows | ||
* Rubin USDF resources | ||
* CryoEM workflows | ||
* Fermi workflows | ||
(*) B50 resources are also dependent on SLAC-IT resources being available. | ||
|
||
**Non-critical issues** will be responded to in the order they were received in the ticketing system when normal operations resume after the Winter Shutdown. Non-critical issues include: | ||
* Individual node-outages in the compute or interactive pool | ||
* Variable or unexpected performance issues for compute, storage or networking resources. | ||
* Batch job errors (that do not impact overall batch system scheduling) | ||
* Tape restores and data transfer issues | ||
|
||
## Outages | ||
|
||
### Current | ||
|
@@ -10,6 +49,9 @@ | |
|
||
|When |Duration | What | | ||
| --- | --- | --- | | ||
|Dec 10 2024|Ongoing (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| | ||
| Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. | ||
|Nov 18 2024|8 days (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| | ||
|Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions. | ||
|Oct 3 2024 |1.5 hrs (unplanned)| Storage issue impacted home directory access and SSH logins | ||
|Jul 10 2024 |4 days (planned)| Urgent electrical maintenance is required in SRCF datacenter | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,21 +21,30 @@ S3DF and you don't see your facility in this table. | |
|
||
|Facility | PoC | Primary POSIX group| | ||
|--- |--- |--- | | ||
|Rubin | Richard Dubois | rubin_users | | ||
|Rubin | James Chiang, Adam Bolton | rubin_users | | ||
|SuperCDMS | Concetta Cartaro | cdms | | ||
|LCLS | [email protected] | ps-users | | ||
|MLI| Daniel Ratner | - | | ||
|Neutrino| Kazuhiro Terao | - | | ||
|AD | Greg White | - | | ||
|MLI| Daniel Ratner | mli | | ||
|Neutrino| Kazuhiro Terao | nu | | ||
|AD | Greg White | cd | | ||
|SUNCAT | Johannes Voss| suncat-norm | | ||
|Fermi | Richard Dubois| glast-pipeline | | ||
|Fermi | Seth Digel, Nicola Omodei| glast-pipeline | | ||
|EPPTheory | Tom Rizzo | theorygrp | | ||
|FACET | Nathan Majernik | - | | ||
|DESC | Tom Glanzman | desc | | ||
|KIPAC | Stuart Marshall | ki | | ||
|FACET | Nathan Majernik | facet | | ||
|DESC | Heather Kelly | desc | | ||
|KIPAC | Marcelo Alvarez | ki | | ||
|RFAR | David Bizzozero | rfar | | ||
|SIMES | Tom Devereaux | - | | ||
|CryoEM | Yee Ting Li | - | | ||
|SSRL | Riti Sarangi | - | | ||
|LDMX | Omar Moreno | - | | ||
|HPS | Omar Moreno | - | | ||
|SIMES | Tom Devereaux, Brian Moritz | simes | | ||
|CryoEM | Patrick Pascual | cryo-data | | ||
|SSRL | Riti Sarangi | ssrl | | ||
|LDMX | Omar Moreno | ldmx | | ||
|HPS | Mathew Graham | hps | | ||
|EXO | Brian Mong | exo | | ||
|ATLAS | Wei Yang, Michael Kagan | atlas | | ||
|CDS | Ernest Williams | cds | | ||
|SRS | Tony Johnson | srs | | ||
|FADERS | Ryan Herbst | faders | | ||
|TOPAS | Joseph Perl | topas | | ||
|RP | Thomas Frosio | esh-rp | | ||
|Projects | Yemi Adesanya, Ryan Herbst | - | | ||
|SCS | Omar Quijano, Yee Ting Li, Gregg Thayer | - | |
Oops, something went wrong.