Skip to content

Commit

Permalink
add start of telemetry docs
Browse files Browse the repository at this point in the history
  • Loading branch information
msarahan committed Jan 23, 2025
1 parent 9a68276 commit 9f40338
Showing 1 changed file with 191 additions and 0 deletions.
191 changes: 191 additions & 0 deletions resources/telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
---
layout: default
nav_order: 3
parent: Resources
grand_parent: Maintainer Docs
title: Telemetry
---

# {{ page.title }}
{:.no_toc}

## Overview
{:.no_toc}

The RAPIDS team collects build-time and test-time telemetry from our CI jobs.
These data are used to provide insights into potential speed impacts and
optimizations that might be worthwhile.

### Intended audience
{: .no_toc }

Operations
{: .label .label-purple}

Developers
{: .label .label-green}

## Table of contents
{: .no_toc .text-delta }

1. TOC
{:toc}

## Introduction

This is an overview of how telemetry data is collected, how it is transmitted and recorded, and how visualizations can be obtained using Grafana.

## Collecting data

We can collect telemetry data from possibly multiple sources. Right now, the
main source of timing information is the Github Actions job metadata, which we
obtain using the [Github Actions REST
API](https://docs.github.com/en/rest/actions/workflow-jobs?apiVersion=2022-11-28#list-jobs-for-a-workflow-run-attempt).
These data contain jobs, which generally correspond to calls to reusable
workflows from our shared-workflows repository. They also itemize each step
within any given workflow.

The general implementation for telemetry is as follows:

1. Creating initial variables that get loaded
2. add metadata in shared-workflows using OTEL_ATTRIBUTES, run workflow, stash variables to collect in final job
3. On a self-hosted node, parse Github Actions job metadata, and associate jobs with attributes passed in from step 2). Send data to Tempo via a FluentBit agent that is also used to aggregate logs.

Steps 1 and 3 happen as jobs in the "top-level repository" - the workflow that initiates all subsequent workflows.

### Code changes for projects to add telemetry

These are the general steps necessary to add telemetry to a repository that does
not currently have it. This does not describe changes to shared-workflows as
implementation details. In the top-level workflow, such as
`cudf/.github/workflows/pr.yaml`:

1. Add a job for telemetry-setup, and add that job name to the pr-builder `needs` collection.
```
jobs:
# Please keep pr-builder as the top job here
pr-builder:
needs:
- check-nightly-ci
- changed-files
... <other job names> ...
- telemetry-setup
telemetry-setup:
continue-on-error: true
runs-on: ubuntu-latest
env:
OTEL_SERVICE_NAME: 'pr-cudf'
steps:
- name: Telemetry setup
if: ${{ vars.TELEMETRY_ENABLED == 'true' }}
uses: rapidsai/shared-actions/telemetry-dispatch-stash-base-env-vars@main
```

`OTEL_SERVICE_NAME` serves as one identifier in telemetry. It's not strictly necessary.

2. Add `telemetry-setup` as a `needs` entry for all jobs at the top of the tree. The purpose is to communicate telemetry variables that any job may use - even the checks.yaml job. `needs` that generally catch the required jobs are:

* checks
* changed-files
* devcontainer

3. Add an entry to skip the final job, `ignored_pr_jobs`:

```
checks:
secrets: inherit
needs: telemetry-setup
uses: rapidsai/shared-workflows/.github/workflows/[email protected]
with:
enable_check_generated_files: false
ignored_pr_jobs: "telemetry-summarize"
```

Syntax for the `ignored_pr_jobs` is [space-separated within the quotes](https://github.com/rapidsai/shared-workflows/blob/branch-25.02/.github/workflows/checks.yaml#L30).

4. Run the parsing and submission script job as the final job - after `pr-builder`:

```
telemetry-summarize:
# This job must use a self-hosted runner to record telemetry traces.
runs-on: linux-amd64-cpu4
needs: pr-builder
if: ${{ vars.TELEMETRY_ENABLED == 'true' && !cancelled() }}
continue-on-error: true
steps:
- name: Telemetry summarize
uses: rapidsai/shared-actions/telemetry-dispatch-summarize@main
```

### Implementation details

The "meat" implementation of telemetry lives primarily in the `shared-actions`
repository. As the middle layer, the reusable workflows in the shared-workflows
repo download the stashed environment variables from the top-level
`telemetry-setup`, load them in the local worker's environment, run the command,
and stash possible variables of interest, which capture metadata for that
particular job. You should not generally need to change shared-actions or
shared-workflows when maintaining other RAPIDS repositories. This section is
only necessary for someone looking to change or improve something like:

* Adding new metadata to associate with a job
* Adding telemetry from new sources, such as sccache statistics or build logs
* fixing one or more of what is surely multitudinous bugs

#### Shared-actions

[Shared-actions](https://github.com/rapidsai/shared-actions) use a pattern we
call "dispatch actions," which is broadly described as running an action that
clones the shared-actions repo, which enables referring to local files in the
clone. Folders at the top level in the shared-actions repo that start with
telemetry-dispatch are all dispatch actions.

Most of these scripts start with the `load-then-clone` action, which downloads
the initial file that contains the basic environment variables that should be
propagated. The main exception to this is
`telemetry-dispatch-stash-base-env-vars` which is used to populate the basic
environment variable file in the first place. Passing the environment variable
file is used instead of parameters to save on lots of bloat.

The ability to refer to local files is employed in the
[`telemetry-impls/summarize`](https://github.com/rapidsai/shared-actions/tree/main/telemetry-impls/summarize)
action, where a python file is used to parse and send OpenTelemetry data with
the OpenTelemetry Python SDK. If some value here seems like it's magically
coming from nowhere, it is probably being passed in an environment variable
file.

#### Shared-workflows

At this level, we load the base environment variables and add our own, then run our process. The
environment variables that we load ensures that if any build tool natively
supports OpenTelemetry, it has the necessary information to send that data (job
needs to be on a self-hosted runner)

```
jobs:
build:
steps:
- name: Telemetry setup
uses: rapidsai/shared-actions/telemetry-dispatch-setup@main
continue-on-error: true
if: ${{ vars.TELEMETRY_ENABLED == 'true' }}
with:
extra_attributes: "rapids.operation=build-cpp,rapids.package_type=conda,rapids.cuda=${{ matrix.CUDA_VER }},rapids.py=${{ matrix.PY_VER }},rapids.arch=${{ matrix.ARCH }},rapids.linux=${{ matrix.LINUX_VER }}"
```

Passing in the `extra_attibutes` parameter appends these comma-separated
key=value pairs to the `OTEL_RESOURCE_ATTRIBUTES` environment variable. This is
the way that we associate matrix values, such as CUDA version, architecture, and
python version with a particular job.

As implemented today, the `telemetry-dispatch-setup` action uploads an
environment variable file before running the action's command. That saves us
from needing to have that as a separate step in our shared-workflows workflows,
but also means that if our action changes any telemetry-related environment
variable, it won't be reflected in stored metadata.

TODO: Seems like we should move the upload AFTER the action command.

## Receiving and storing data

## Viewing captured data

0 comments on commit 9f40338

Please sign in to comment.