From 3b2d8eec7a25d78ddfac945e019abac0d345e1cc Mon Sep 17 00:00:00 2001 From: YemBot Date: Sun, 29 Dec 2024 11:07:24 -0800 Subject: [PATCH 01/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 129088e..bde899b 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**December 26th 8:00am PST: ALL S3DF services are currently DOWN/unavailable. We are investigating and will provide an update later today.** +**December 29th 11:00am PST: S3DF and services are available. However, a few of the interactive nodes remain down as we still have to address some network issues related to them. Users will still be able to access interactive resources via the "iana" pool if facility-specific resources are unavailable. As always, please do continue to report issues via email to s3df-help@slac.stanford.edu** ## Quick Reference From f42de60449bb23fa4e92844c6662461980fc2000 Mon Sep 17 00:00:00 2001 From: YemBot Date: Sun, 29 Dec 2024 11:15:10 -0800 Subject: [PATCH 02/16] Update changelog.md --- changelog.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index 4976cd1..bd5d09c 100644 --- a/changelog.md +++ b/changelog.md @@ -49,7 +49,8 @@ If critical issues are not responded to within 2 hours of reporting the issue pl |When |Duration | What | | --- | --- | --- | -|Dec 10 2024|Ongoing (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| +|Dec 26 2024| 1 days (unplanned)|One of our core networking switches in the data center failed and had to be replaced. The fall-out from this impacted other systems and services on S3DF. Staff worked through the night on stabilization of the network devices and connections as well as recovery of the storage subsystem.| +|Dec 10 2024|(unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| | Dec 3 2024 | 1 hr (planned) | Mandatory upgrade of the slurm controller, the database, and the client components on all batch nodes, kubernetes nodes, and interactive nodes. |Nov 18 2024|8 days (unplanned)|StaaS GPFS disk array outage (partial /gpfs/slac/staas/fs1 unavailability)| |Oct 21 2024 |10 hrs (planned)| Upgrade to all S3DF Weka clusters. We do NOT anticipate service interruptions. From 5d533ec7477ba4b8fef3cbe770d5023b14e52cd2 Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 2 Jan 2025 13:38:39 -0800 Subject: [PATCH 03/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index bde899b..7f89a5a 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**December 29th 11:00am PST: S3DF and services are available. However, a few of the interactive nodes remain down as we still have to address some network issues related to them. Users will still be able to access interactive resources via the "iana" pool if facility-specific resources are unavailable. As always, please do continue to report issues via email to s3df-help@slac.stanford.edu** +**January 2nd 1:30pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 6pm PST today.** ## Quick Reference From 3484c2d642284d3bbdd47dab7d148c4074dfaab0 Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 2 Jan 2025 13:43:36 -0800 Subject: [PATCH 04/16] Update changelog.md --- changelog.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/changelog.md b/changelog.md index bd5d09c..c6cec88 100644 --- a/changelog.md +++ b/changelog.md @@ -43,6 +43,8 @@ If critical issues are not responded to within 2 hours of reporting the issue pl ### Current +**January 2nd 1:30pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 6pm PST today.** + ### Upcoming ### Past From e02c00c2b37b59c1fcf4662b934473489cd37a7a Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 2 Jan 2025 18:14:13 -0800 Subject: [PATCH 05/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7f89a5a..21cefc5 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**January 2nd 1:30pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 6pm PST today.** +**January 2nd 6:00pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 10am PST tomorrow.** ## Quick Reference From eaba02f0051749d384fed183bbe6516853443e18 Mon Sep 17 00:00:00 2001 From: YemBot Date: Thu, 2 Jan 2025 18:16:00 -0800 Subject: [PATCH 06/16] Update changelog.md --- changelog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index c6cec88..080745c 100644 --- a/changelog.md +++ b/changelog.md @@ -43,7 +43,7 @@ If critical issues are not responded to within 2 hours of reporting the issue pl ### Current -**January 2nd 1:30pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 6pm PST today.** +**January 2nd 6:00pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 10am PST tomorrow.** ### Upcoming From e170a4e7ca5a6877ebaac6c0aad17d3ab2463fa7 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 3 Jan 2025 09:36:18 -0800 Subject: [PATCH 07/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 21cefc5..55716e7 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**January 2nd 6:00pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 10am PST tomorrow.** +**January 3rd 9:35am PST: Today the team is working to address issues on the internal S3DF network. As a result, all S3DF services remain DOWN. Our next update will be at 3pm PST today.** ## Quick Reference From 5d8d4627c8444f3196bca7625ecbeae005324830 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 3 Jan 2025 09:37:49 -0800 Subject: [PATCH 08/16] Update changelog.md --- changelog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index 080745c..30a516f 100644 --- a/changelog.md +++ b/changelog.md @@ -43,7 +43,7 @@ If critical issues are not responded to within 2 hours of reporting the issue pl ### Current -**January 2nd 6:00pm PST: ALL S3DF services remain DOWN (except for login bastions). The team is actively troubleshooting. We will provide another update at 10am PST tomorrow.** +**January 3rd 9:35am PST: Today the team is working to address issues on the internal S3DF network. As a result, all S3DF services remain DOWN. Our next update will be at 3pm PST today.** ### Upcoming From 3992e421df19cbb39f724d7c7ac23325c9f1ae44 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 3 Jan 2025 14:55:23 -0800 Subject: [PATCH 09/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 55716e7..65a0e51 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**January 3rd 9:35am PST: Today the team is working to address issues on the internal S3DF network. As a result, all S3DF services remain DOWN. Our next update will be at 3pm PST today.** +**January 3rd 2:55pm PST: We have made some updates to the S3DF network. We continue to work on stabilizing storage. All S3DF services remain DOWN. Our next update will be at 8pm PST today.** ## Quick Reference From 1f5639412fe91fab897bbb04aa558eee50c68336 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 3 Jan 2025 14:57:31 -0800 Subject: [PATCH 10/16] Update changelog.md --- changelog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index 30a516f..c079931 100644 --- a/changelog.md +++ b/changelog.md @@ -43,7 +43,7 @@ If critical issues are not responded to within 2 hours of reporting the issue pl ### Current -**January 3rd 9:35am PST: Today the team is working to address issues on the internal S3DF network. As a result, all S3DF services remain DOWN. Our next update will be at 3pm PST today.** +**January 3rd 2:55pm PST: We have made some updates to the S3DF network. We continue to work on stabilizing storage. All S3DF services remain DOWN. Our next update will be at 8pm PST today.** ### Upcoming From 2c7565732e967da57a631cd656d0245dee743498 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 3 Jan 2025 18:46:51 -0800 Subject: [PATCH 11/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 65a0e51..55b1120 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**January 3rd 2:55pm PST: We have made some updates to the S3DF network. We continue to work on stabilizing storage. All S3DF services remain DOWN. Our next update will be at 8pm PST today.** +**January 3rd 6:45pm PST: Our team will continue to work throughout the weekend to resolve S3DF storage issues. All S3DF services remain DOWN. Our next update will be at 9am PST on Monday.** ## Quick Reference From ff835d490f88ffea61c8c08167d40fbe07769294 Mon Sep 17 00:00:00 2001 From: YemBot Date: Fri, 3 Jan 2025 18:48:05 -0800 Subject: [PATCH 12/16] Update changelog.md --- changelog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index c079931..2f71354 100644 --- a/changelog.md +++ b/changelog.md @@ -43,7 +43,7 @@ If critical issues are not responded to within 2 hours of reporting the issue pl ### Current -**January 3rd 2:55pm PST: We have made some updates to the S3DF network. We continue to work on stabilizing storage. All S3DF services remain DOWN. Our next update will be at 8pm PST today.** +**January 3rd 6:45pm PST: Our team will continue to work throughout the weekend to resolve S3DF storage issues. All S3DF services remain DOWN. Our next update will be at 9am PST on Monday.** ### Upcoming From cff0404856f6165a5cec2ef63649620bbd1c5d16 Mon Sep 17 00:00:00 2001 From: YemBot Date: Mon, 6 Jan 2025 08:42:54 -0800 Subject: [PATCH 13/16] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 55b1120..1f0eceb 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ and the Rubin observatory. The S3DF infrastructure is optimized for data analytics and is characterized by large, massive throughput, high concurrency storage systems. -**January 3rd 6:45pm PST: Our team will continue to work throughout the weekend to resolve S3DF storage issues. All S3DF services remain DOWN. Our next update will be at 9am PST on Monday.** +**January 6th 8:40am PST: All S3DF services are back UP. Users with k8s workloads should check for any lingering issues (stale file handles) and report to s3df-help@slac.stanford.edu. Thank you for your patience.** ## Quick Reference From 7681c824157587af6067042b5d0d6f4f5293390a Mon Sep 17 00:00:00 2001 From: YemBot Date: Mon, 6 Jan 2025 08:44:51 -0800 Subject: [PATCH 14/16] Update changelog.md --- changelog.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/changelog.md b/changelog.md index 2f71354..e357531 100644 --- a/changelog.md +++ b/changelog.md @@ -43,7 +43,7 @@ If critical issues are not responded to within 2 hours of reporting the issue pl ### Current -**January 3rd 6:45pm PST: Our team will continue to work throughout the weekend to resolve S3DF storage issues. All S3DF services remain DOWN. Our next update will be at 9am PST on Monday.** +**January 6th 8:40am PST: All S3DF services are back UP. Users with k8s workloads should check for any lingering issues (stale file handles) and report to s3df-help@slac.stanford.edu. Thank you for your patience.** ### Upcoming From ef01cebd704c169f9c249de1ba9a31f0f3f3fc3a Mon Sep 17 00:00:00 2001 From: "Micha R. Okun" Date: Mon, 6 Jan 2025 13:00:13 -0800 Subject: [PATCH 15/16] deploy to pages on changes to "main" branch --- .github/workflows/static.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/static.yml b/.github/workflows/static.yml index b9c9160..26304ed 100644 --- a/.github/workflows/static.yml +++ b/.github/workflows/static.yml @@ -4,7 +4,7 @@ name: Deploy static content to Pages on: # Runs on pushes targeting the default branch push: - branches: ["gh-pages"] + branches: ["main", "gh-pages"] # Allows you to run this workflow manually from the Actions tab workflow_dispatch: From 1aaa862917188a922cac29ba20313ca27409c92f Mon Sep 17 00:00:00 2001 From: "Micha R. Okun" Date: Thu, 9 Jan 2025 17:06:00 -0800 Subject: [PATCH 16/16] run Pages deployment workflow on `prod` branch pushes (in addition to `main` and `gh-pages`) --- .github/workflows/static.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/static.yml b/.github/workflows/static.yml index 26304ed..b715553 100644 --- a/.github/workflows/static.yml +++ b/.github/workflows/static.yml @@ -4,7 +4,7 @@ name: Deploy static content to Pages on: # Runs on pushes targeting the default branch push: - branches: ["main", "gh-pages"] + branches: ["main", "prod", "gh-pages"] # Allows you to run this workflow manually from the Actions tab workflow_dispatch: