Merge pull request #3366 from GeorgianaElena/30mchecklist

Improve docs about log inspection
2i2c-org · Nov 23, 2023 · 7db10b3 · 7db10b3
2 parents 72bdcb4 + 5008904
commit 7db10b3
Show file tree

Hide file tree

Showing 6 changed files with 256 additions and 6 deletions.
diff --git a/docs/howto/troubleshoot/index.md b/docs/howto/troubleshoot/index.md
@@ -5,7 +5,7 @@ issues that may arise.
 
 ```{toctree}
 :maxdepth: 2
-logs.md
+logs/index.md
 ssh.md
 prometheus.md
 cilogon-user-accounts.md

diff --git a/docs/howto/troubleshoot/logs.md → docs/howto/troubleshoot/logs/cloud-logs.md b/docs/howto/troubleshoot/logs.md → docs/howto/troubleshoot/logs/cloud-logs.md
@@ -1,8 +1,6 @@
-# Look at logs to troubleshoot issues
+(howto-troubleshoot:cloud-logs)=
 
-Looking at and interpreting logs produced by various components is the easiest
-way to debug most issues, and should be the first place to look at when issues
-are reported. 
+# Cloud specific logging
 
 This page describes how to look at various logs in different cloud providers.
 
@@ -32,6 +30,7 @@ logs are kept for 30 days, and are searchable.
 
 ### Common queries
 
+(howto-troubleshoot:gcp-autoscaler-logs)=
 #### Kubernetes autoscaler logs
 
 You can find scale up or scale down events by looking for decision events
@@ -107,6 +106,7 @@ special characters, highly recommend using the script instead - escaping
 errors can be frustrating!
 ```
 
+(howto-troubleshoot:gcloud-dask-gateway-logs)=
 #### Look at dask-gateway logs
 
 The following query will show logs from all the components of dask-gateway -
@@ -153,3 +153,39 @@ labels.k8s-pod/component="singleuser-server"
 resource.labels.namespace_name="<namespace>"
 textPayload=~"some-string"
 ```
+
+## Microsoft Azure
+
+On Azure, the logs produced by all containers and other components are sent to [Microsoft Azure Monitoring](https://learn.microsoft.com/en-us/azure/azure-monitor/overview), if the service is configured.
+
+### Accessing the Azure Monitoring
+
+Go to the [Azure Monitoring, container Insights](https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/containerInsights) section on your browser.
+
+Check if your cluster is in the list of monitored clusters.
+
+If yes, then you can go the the [logs section](https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/logs) and run queries on this data.
+
+Otherwise, it means that Azure Monitoring was not setup for your cluster and you are not able to access container logs from the portal.
+
+```{note}
+Azure Monitoring is not configured on any of the 2i2c Azure clusters (i.e. the utoronto `hub-cluster`).
+```
+
+## Amazon AWS
+
+On Amazon, the logs produced by all containers and other components are sent to [Amazon Cloud Watch](https://aws.amazon.com/cloudwatch), if the service is configured.
+
+### Accessing the CloudWatch
+
+Go to the [Amazon CloudWatch](https://console.aws.amazon.com/cloudwatch) section on your browser.
+
+Check if `Application Insights` was configured for the desired aws account and cluster.
+
+If yes, then you can go the the [logs section](https://ca-central-1.console.aws.amazon.com/cloudwatch/home?region=ca-central-1#logsV2:logs-insights) and run queries on the data generated by the cluster.
+
+Otherwise, it means that Cloud Watch was not setup for your cluster and you are not able to access the logs from the console.
+
+```{note}
+Amazon CloudWatch is not configured on the 2i2c Amazon clusters.
+```
diff --git a/docs/howto/troubleshoot/logs/common-errors.md b/docs/howto/troubleshoot/logs/common-errors.md
@@ -0,0 +1,19 @@
+# Common errors and what logs to check
+
+Based on the errors experienced, specific logs can have more information about the underlying issue.
+
+## 5xx errors during login or server start
+
+These kind of errors are reported by the hub, so checking the [hub pod logs](howto-troubleshoot:hub-pod-logs) might provide more insight on why they are happening.
+
+## Scaling issues
+
+If any scaling-related errors are reported, then the first thing to check is the cluster `autoscaler` logs from the [cloud console](howto-troubleshoot:gcp-autoscaler-logs) or through [kubectl](howto-troubleshoot:kubectl-autoscaler-logs).
+
+### Dask issues
+
+If users are experiencing issues related to Dask, then:
+- something might be going on with `dask-gateway` and the logs of the pods related with this service might have more useful info.
+  You can look at dask-gateway logs either with [kubectl](howto-troubleshoot:kubectl-dask-gateway-logs) or [from the cloud console](howto-troubleshoot:gcloud-dask-gateway-logs).
+- there might be some connectivity issue and the traefik logs might help.
+  Traefik pod logs are available from `kubectl` using the commands described in [this troubleshooting section](howto-troubleshoot:kubectl-traefik-logs).
diff --git a/docs/howto/troubleshoot/logs/index.md b/docs/howto/troubleshoot/logs/index.md
@@ -0,0 +1,15 @@
+# Look at logs to troubleshoot issues
+
+Looking at and interpreting logs produced by various components is the easiest
+way to debug most issues, and should be the first place to look at when issues
+are reported. 
+
+This page describes how to look at various logs in different cloud providers or
+by using cloud-agnostic kubectl and deployer commands.
+
+```{toctree}
+:maxdepth: 2
+cloud-logs
+kubectl-logs
+common-errors
+```
diff --git a/docs/howto/troubleshoot/logs/kubectl-logs.md b/docs/howto/troubleshoot/logs/kubectl-logs.md
@@ -0,0 +1,180 @@
+(howto-troubleshoot:kubectl-logs)=
+# Kubectl logging
+
+This page describes how to look at various logs by using some deployer commands that wrap the most common kubectl commands or by using kubectl directly. 
+
+## Look at logs via deployer sub-commands
+
+There are some `deployer debug` sub-commands that wrap up the most relevant `kubectl logs` arguments that allow conveniently checking logs with only one command.
+
+````{tip}
+You can export the cluster's and hub's names as environmental variables to directly use the copy-pasted commands in the sections below.
+
+```bash
+export CLUSTER_NAME=2i2c; export HUB_NAME=staging
+```
+````
+
+### Look at hub component logs
+
+The JupyterHub component's logs can be fetched with the `deployer debug component-logs` command, ran for each hub component.
+
+These commands are standalone and **don't require** running `deployer use-cluster-credentials` before.
+
+```{tip}
+1. The `--no-follow` flag
+
+   You can pass `--no-follow` to each of the deployer commands below to provide just logs up to the current point in time and then stop.
+
+2. The `--previous` flag
+
+   If the pod has restarted due to an error, you can pass `--previous` to look at the logs of the pod prior to the last restart.
+```
+
+(howto-troubleshoot:hub-pod-logs)=
+#### Hub pod logs
+```bash
+deployer debug component-logs $CLUSTER_NAME $HUB_NAME hub
+```
+
+#### Proxy pod logs
+```bash
+deployer debug component-logs $CLUSTER_NAME $HUB_NAME proxy
+```
+
+(howto-troubleshoot:kubectl-traefik-logs)=
+#### Traefik pod logs
+```bash
+deployer debug component-logs $CLUSTER_NAME $HUB_NAME traefik
+```
+
+(howto-troubleshoot:kubectl-dask-gateway-logs)=
+### Look at dask-gateway logs
+
+Display the logs from the dask-gateway's most important component pods.
+
+#### Dask-gateway-api pod logs
+```bash
+deployer debug component-logs $CLUSTER_NAME $HUB_NAME dask-gateway-api
+```
+
+#### Dask-gateway-controller pod logs
+```bash
+deployer debug component-logs $CLUSTER_NAME $HUB_NAME dask-gateway-controller
+```
+
+### Look at a specific user's logs
+
+Display logs from the notebook pod of a given user with the following command:
+
+```bash
+deployer debug user-logs  $CLUSTER_NAME $HUB_NAME <username>
+```
+
+Note that you don't need the *escaped* username, with this command.
+
+## Look at logs via kubectl
+
+### Pre-requisites
+
+Get the name of the cluster you want to debug and export its name as env vars. Then use the `deployer` to gain `kubectl` access into this specific cluster.
+
+Example:
+
+```bash
+export CLUSTER_NAME=2i2c;
+deployer use-cluster-credentials $CLUSTER_NAME
+```
+
+(howto-troubleshoot:kubectl-autoscaler-logs)=
+### Kubernetes autoscaler logs
+
+You can find scale up or scale down events by looking for decision events
+
+```
+kubectl describe -n kube-system configmap cluster-autoscaler-status
+```
+
+### Kubernetes node events and status
+
+1. Running nodes and their status
+    ```bash
+    kubectl get nodes
+    ```
+
+2. Get a node's events from the past 1h
+    ```bash
+    kubectl get events --field-selector involvedObject.kind=Node --field-selector involvedObject.name=<some-node-name>
+    ```
+
+3. Describe a node and any related events
+    ```bash
+    kubectl describe node <some-node-name> --show-events=true
+    ```
+
+### Kubernetes pod events and status
+
+```{tip}
+The following commands require passing the namespace where a specific pod is running. Usually this namespace is the same with the hub name.
+```
+
+1. Running pods in a namespace and their status
+    ```bash
+    kubectl get pods -n <namespace>
+    ```
+
+2. Running pods in all namespaces of the cluster and their status
+    ```bash
+    kubectl get pods --all-namespaces
+    ```
+
+3. Get a pod's events from the past 1h
+    ```bash
+    kubectl get events --field-selector involvedObject.kind=Pod --field-selector involvedObject.name=<some-pod-name>
+    ```
+
+4. Describe a pod and any related events
+    ```bash
+    kubectl describe pod <some-pod-name> --show-events=true
+    ```
+
+### Kubernetes pod logs
+You can access any pod's logs by using the `kubectl logs` commands. Bellow are some of the most common debugging commands.
+
+```{tip}
+1. The `--follow` flag
+
+   You can pass the `--follow` flag to each of the `kubectl logs` command below to stream the logs as they are happening, otherwise, they will just be presented up to the current point in time and then stop.
+
+2. The `--previous` flag
+
+   If the pod has restarted due to an error, you can pass `--previous` to look at the logs of the pod prior to the last restart.
+
+3. The `--tail` flag
+
+    With `--tail=<number>` flag you can pass the number of lines of recent log file to display, otherwise, it will show all log lines.
+
+4. The `--since` flag
+
+    This flag can be used like `--since=1h` to only return logs newer than 1h in this case, or any other relative duration like 5s, 2m, or 3h.
+```
+
+1. Print the logs of a pod
+    ```bash
+    kubectl logs <pod_name> --namespace <pod_namespace>
+    ```
+
+2. Print the logs for a container in a pod
+    ```bash
+    kubectl logs -c <container_name> <pod_name> --namespace <pod_namespace>
+    ```
+
+3. View the logs for a previously failed pod
+    ```bash
+    kubectl logs --previous <pod_name> --namespace <pod_namespace>
+    ```
+
+4. View the logs for all containers in a pod
+    ```bash
+    kubectl logs <pod_name> --all-containers --namespace <pod_namespace>
+    ```
diff --git a/docs/index.md b/docs/index.md
@@ -88,7 +88,7 @@ topic/access-creds/index.md
 topic/infrastructure/index.md
 topic/monitoring-alerting/index.md
 topic/features.md
-topic/resource-allocations.md
+topic/resource-allocation.md
 ```
 
 ## Reference