Skip to content

Commit

Permalink
Merge pull request #3366 from GeorgianaElena/30mchecklist
Browse files Browse the repository at this point in the history
Improve docs about log inspection
  • Loading branch information
GeorgianaElena authored Nov 23, 2023
2 parents 72bdcb4 + 5008904 commit 7db10b3
Show file tree
Hide file tree
Showing 6 changed files with 256 additions and 6 deletions.
2 changes: 1 addition & 1 deletion docs/howto/troubleshoot/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ issues that may arise.

```{toctree}
:maxdepth: 2
logs.md
logs/index.md
ssh.md
prometheus.md
cilogon-user-accounts.md
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# Look at logs to troubleshoot issues
(howto-troubleshoot:cloud-logs)=

Looking at and interpreting logs produced by various components is the easiest
way to debug most issues, and should be the first place to look at when issues
are reported.
# Cloud specific logging

This page describes how to look at various logs in different cloud providers.

Expand Down Expand Up @@ -32,6 +30,7 @@ logs are kept for 30 days, and are searchable.

### Common queries

(howto-troubleshoot:gcp-autoscaler-logs)=
#### Kubernetes autoscaler logs

You can find scale up or scale down events by looking for decision events
Expand Down Expand Up @@ -107,6 +106,7 @@ special characters, highly recommend using the script instead - escaping
errors can be frustrating!
```

(howto-troubleshoot:gcloud-dask-gateway-logs)=
#### Look at dask-gateway logs

The following query will show logs from all the components of dask-gateway -
Expand Down Expand Up @@ -153,3 +153,39 @@ labels.k8s-pod/component="singleuser-server"
resource.labels.namespace_name="<namespace>"
textPayload=~"some-string"
```

## Microsoft Azure

On Azure, the logs produced by all containers and other components are sent to [Microsoft Azure Monitoring](https://learn.microsoft.com/en-us/azure/azure-monitor/overview), if the service is configured.

### Accessing the Azure Monitoring

Go to the [Azure Monitoring, container Insights](https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/containerInsights) section on your browser.

Check if your cluster is in the list of monitored clusters.

If yes, then you can go the the [logs section](https://portal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/logs) and run queries on this data.

Otherwise, it means that Azure Monitoring was not setup for your cluster and you are not able to access container logs from the portal.

```{note}
Azure Monitoring is not configured on any of the 2i2c Azure clusters (i.e. the utoronto `hub-cluster`).
```

## Amazon AWS

On Amazon, the logs produced by all containers and other components are sent to [Amazon Cloud Watch](https://aws.amazon.com/cloudwatch), if the service is configured.

### Accessing the CloudWatch

Go to the [Amazon CloudWatch](https://console.aws.amazon.com/cloudwatch) section on your browser.

Check if `Application Insights` was configured for the desired aws account and cluster.

If yes, then you can go the the [logs section](https://ca-central-1.console.aws.amazon.com/cloudwatch/home?region=ca-central-1#logsV2:logs-insights) and run queries on the data generated by the cluster.

Otherwise, it means that Cloud Watch was not setup for your cluster and you are not able to access the logs from the console.

```{note}
Amazon CloudWatch is not configured on the 2i2c Amazon clusters.
```
19 changes: 19 additions & 0 deletions docs/howto/troubleshoot/logs/common-errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Common errors and what logs to check

Based on the errors experienced, specific logs can have more information about the underlying issue.

## 5xx errors during login or server start

These kind of errors are reported by the hub, so checking the [hub pod logs](howto-troubleshoot:hub-pod-logs) might provide more insight on why they are happening.

## Scaling issues

If any scaling-related errors are reported, then the first thing to check is the cluster `autoscaler` logs from the [cloud console](howto-troubleshoot:gcp-autoscaler-logs) or through [kubectl](howto-troubleshoot:kubectl-autoscaler-logs).

### Dask issues

If users are experiencing issues related to Dask, then:
- something might be going on with `dask-gateway` and the logs of the pods related with this service might have more useful info.
You can look at dask-gateway logs either with [kubectl](howto-troubleshoot:kubectl-dask-gateway-logs) or [from the cloud console](howto-troubleshoot:gcloud-dask-gateway-logs).
- there might be some connectivity issue and the traefik logs might help.
Traefik pod logs are available from `kubectl` using the commands described in [this troubleshooting section](howto-troubleshoot:kubectl-traefik-logs).
15 changes: 15 additions & 0 deletions docs/howto/troubleshoot/logs/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Look at logs to troubleshoot issues

Looking at and interpreting logs produced by various components is the easiest
way to debug most issues, and should be the first place to look at when issues
are reported.

This page describes how to look at various logs in different cloud providers or
by using cloud-agnostic kubectl and deployer commands.

```{toctree}
:maxdepth: 2
cloud-logs
kubectl-logs
common-errors
```
180 changes: 180 additions & 0 deletions docs/howto/troubleshoot/logs/kubectl-logs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
(howto-troubleshoot:kubectl-logs)=
# Kubectl logging

This page describes how to look at various logs by using some deployer commands that wrap the most common kubectl commands or by using kubectl directly.

## Look at logs via deployer sub-commands

There are some `deployer debug` sub-commands that wrap up the most relevant `kubectl logs` arguments that allow conveniently checking logs with only one command.

````{tip}
You can export the cluster's and hub's names as environmental variables to directly use the copy-pasted commands in the sections below.
```bash
export CLUSTER_NAME=2i2c; export HUB_NAME=staging
```
````

### Look at hub component logs

The JupyterHub component's logs can be fetched with the `deployer debug component-logs` command, ran for each hub component.

These commands are standalone and **don't require** running `deployer use-cluster-credentials` before.

```{tip}
1. The `--no-follow` flag
You can pass `--no-follow` to each of the deployer commands below to provide just logs up to the current point in time and then stop.
2. The `--previous` flag
If the pod has restarted due to an error, you can pass `--previous` to look at the logs of the pod prior to the last restart.
```

(howto-troubleshoot:hub-pod-logs)=
#### Hub pod logs
```bash
deployer debug component-logs $CLUSTER_NAME $HUB_NAME hub
```

#### Proxy pod logs
```bash
deployer debug component-logs $CLUSTER_NAME $HUB_NAME proxy
```

(howto-troubleshoot:kubectl-traefik-logs)=
#### Traefik pod logs
```bash
deployer debug component-logs $CLUSTER_NAME $HUB_NAME traefik
```

(howto-troubleshoot:kubectl-dask-gateway-logs)=
### Look at dask-gateway logs

Display the logs from the dask-gateway's most important component pods.

#### Dask-gateway-api pod logs
```bash
deployer debug component-logs $CLUSTER_NAME $HUB_NAME dask-gateway-api
```

#### Dask-gateway-controller pod logs
```bash
deployer debug component-logs $CLUSTER_NAME $HUB_NAME dask-gateway-controller
```

### Look at a specific user's logs

Display logs from the notebook pod of a given user with the following command:

```bash
deployer debug user-logs $CLUSTER_NAME $HUB_NAME <username>
```

Note that you don't need the *escaped* username, with this command.

## Look at logs via kubectl

### Pre-requisites

Get the name of the cluster you want to debug and export its name as env vars. Then use the `deployer` to gain `kubectl` access into this specific cluster.

Example:

```bash
export CLUSTER_NAME=2i2c;
deployer use-cluster-credentials $CLUSTER_NAME
```

(howto-troubleshoot:kubectl-autoscaler-logs)=
### Kubernetes autoscaler logs

You can find scale up or scale down events by looking for decision events

```
kubectl describe -n kube-system configmap cluster-autoscaler-status
```

### Kubernetes node events and status

1. Running nodes and their status
```bash
kubectl get nodes
```

2. Get a node's events from the past 1h
```bash
kubectl get events --field-selector involvedObject.kind=Node --field-selector involvedObject.name=<some-node-name>
```
3. Describe a node and any related events
```bash
kubectl describe node <some-node-name> --show-events=true
```
### Kubernetes pod events and status
```{tip}
The following commands require passing the namespace where a specific pod is running. Usually this namespace is the same with the hub name.
```
1. Running pods in a namespace and their status
```bash
kubectl get pods -n <namespace>
```
2. Running pods in all namespaces of the cluster and their status
```bash
kubectl get pods --all-namespaces
```
3. Get a pod's events from the past 1h
```bash
kubectl get events --field-selector involvedObject.kind=Pod --field-selector involvedObject.name=<some-pod-name>
```

4. Describe a pod and any related events
```bash
kubectl describe pod <some-pod-name> --show-events=true
```

### Kubernetes pod logs
You can access any pod's logs by using the `kubectl logs` commands. Bellow are some of the most common debugging commands.
```{tip}
1. The `--follow` flag
You can pass the `--follow` flag to each of the `kubectl logs` command below to stream the logs as they are happening, otherwise, they will just be presented up to the current point in time and then stop.
2. The `--previous` flag
If the pod has restarted due to an error, you can pass `--previous` to look at the logs of the pod prior to the last restart.
3. The `--tail` flag
With `--tail=<number>` flag you can pass the number of lines of recent log file to display, otherwise, it will show all log lines.
4. The `--since` flag
This flag can be used like `--since=1h` to only return logs newer than 1h in this case, or any other relative duration like 5s, 2m, or 3h.
```
1. Print the logs of a pod
```bash
kubectl logs <pod_name> --namespace <pod_namespace>
```
2. Print the logs for a container in a pod
```bash
kubectl logs -c <container_name> <pod_name> --namespace <pod_namespace>
```
3. View the logs for a previously failed pod
```bash
kubectl logs --previous <pod_name> --namespace <pod_namespace>
```
4. View the logs for all containers in a pod
```bash
kubectl logs <pod_name> --all-containers --namespace <pod_namespace>
```
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ topic/access-creds/index.md
topic/infrastructure/index.md
topic/monitoring-alerting/index.md
topic/features.md
topic/resource-allocations.md
topic/resource-allocation.md
```

## Reference
Expand Down

0 comments on commit 7db10b3

Please sign in to comment.