Skip to content

Commit

Permalink
doc: move information unrelated to nearone infra to book (near#11072)
Browse files Browse the repository at this point in the history
This information is potentially interesting to our opensource
contributors as well so it should not be hidden away in our private
docs.
  • Loading branch information
nagisa authored Apr 18, 2024
1 parent 2a3e0b5 commit ccf3873
Show file tree
Hide file tree
Showing 2 changed files with 108 additions and 10 deletions.
Binary file added docs/images/span-details.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
118 changes: 108 additions & 10 deletions docs/practices/workflows/otel_traces.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,46 @@ harder to extract signal in the future. Keep this trade off in mind.
[Loki]: https://grafana.com/oss/loki/
[Jaeger]: https://www.jaegertracing.io/


### Spans

We have a [style guide section on the use of Spans](../style.md#spans), please make yourself
familiar with it.

Every `tracing::debug_span!()` creates a new span, and usually it is attached to its parent
automatically.

However, a few corner cases exist.

* `do_apply_chunks()` starts 4 sub-tasks in parallel and waits for their completion. To make it
work, the parent span is passed explicitly to the sub-tasks.
* Messages to actix workers. If you do nothing, that the traces are limited to work done in a
single actor. But that is very restrictive and not useful enough. To workaround that, each actix
message gets attached `opentelemetry::Context`. That context somehow represents the information
about the parent span. This mechanism is the reason you see annoying `.with_span_context()`
function calls whenever you send a message to an actix Actor.
* Inter-process tracing is theoretically available, but I have never tested it. The plan was to
test it as soon as the Canary images get updated 😭 Therefore it most likely doesn’t work. Each
`PeerMessage` is injected with `TraceContext` (1, 2) and the receiving node extracts that context
and all spans generated in handling that message should be parented to the trace from another node.
* Some spans are created using `info_span!()` but they are few and mostly for the logs. Exporting
only info-level spans doesn’t give any useful tracing information in Grafana.

* `actix::Actor::handle()` deserves a special note. The design choice was to provide a macro that
lets us easily annotate every implementation of `actix::Actor::handle()`. This macro sets the
following span attributes:

* `actor` to the name of the struct that implements actix::Actor
* `handler` to the name of the message struct

And it lets you provide more span attributes. In the example, ClientActor specifies `msg_type`,
which in all cases is identical to `handler`.

## Configuration

[The OTLP documentation page in our terraform
repository](https://github.com/PagodaPlatform/tf-near-node/blob/main/doc/otlp.md) documents the
steps necessary to start moving the trace data from the node to our Grafana Cloud instance. Once
[The Tracing documentation page in nearone's
Outline](https://nearone.getoutline.com/doc/tracing-in-grafana-RgJUJZF2C0) documents the steps
necessary to start moving the trace data from the node to Nearone's Grafana Cloud instance. Once
you set up your nodes, you can use the explore page to verify that the traces are coming through.

![Image displaying the Grafana explore page interacting with the grafana-nearinc-traces data
Expand All @@ -78,7 +113,9 @@ and invoke `sudo pkill -HUP neard`. Double check that the collector is running a
<blockquote style="background: rgba(255, 200, 0, 0.1); border: 5px solid rgba(255, 200, 0, 0.4);">

**Good to know**: You can modify the event/span/log targets you’re interested in just like when
setting the `RUST_LOG` environment variable.
setting the `RUST_LOG` environment variable, including target filters. If you're setting verbose
levels, consider selecting specific targets you're interested in too. This will help to keep trace
ingest costs down.

For more information about the dynamic settings refer to `core/dyn-configs` code in the repository.

Expand All @@ -104,11 +141,47 @@ wish.

Now that the data is arriving into the databases, it is time to visualize the data to determine
what you want to know about the node. The only general advise I have here is to check that the data
source is indeed tempo or loki. Try out visualizations other than time series. For example the
author was interested in checking the execution speed before and after a change in a component.
To make the comparison visual, the span of interest was graphed using the histogram visualization
in order to obtain the following result. Note: you can hover your mouse on the image to see the
visualization of the baseline performance.
source is indeed tempo or loki.

### Explore

Initial exploration is best done with Grafana's Explore tool or some other mechanism to query and
display individual traces.

The query builder available in Grafana makes the process quite straightforward to start with, but
is also somewhat limited. Underlying [TraceQL has many more
features](https://grafana.com/docs/tempo/latest/traceql/) that are not available through the
builder. For example, you can query data in somewhat of a relational manner, such as this query
below queries only spans named `process_receipt` that take 50ms when run as part of `new_chunk`
processing for shard 3!

```
{ name="new_chunk" && span.shard_id = "3" } >> { name="process_receipt" && duration > 50ms }
```

<blockquote style="background: rgba(255, 200, 0, 0.1); border: 5px solid rgba(255, 200, 0, 0.4);">

**Good to know**: When querying, keep in mind the "Options" dropdown that allows you to specify the
limit of results and the format in which these results are presented! In particular, the
"Traces/Spans" toggle will affect the durations shown in the result table.

</blockquote>

Once you click on a span of interest, Grafana will open you a view with the trace that contains
said span, where you can inspect both the overall trace and the properties of the span:

![Image displaying a specific trace with two of the spans expanded to show their
details](../../images/span-details.png)

### Dashboards

Once you have arrived at an interesting query, you may be inclined to create a dashboard that
summarizes the data without having to dig into individual traces and spans.

As an example the author was interested in checking the execution speed before and after a change
in a component. To make the comparison visual, the span of interest was graphed using the histogram
visualization in order to obtain the following result. In this graph the Y axis displays the number
of occurrences for spans that took X-axis long to complete.

<div id="image-comparison">
<img src="../../images/compile-and-load-before.png" class="before" />
Expand All @@ -131,5 +204,30 @@ visualization of the baseline performance.
}
</style>

In general most of the panels work with tracing results directly but some of the most interesting
ones do not. It is necessary to experiment with certain options and settings to have grafana panels
start showing data. Some notable examples:

1. Time series – a “Prepare time series” data transformation with “Multi-frame time series” has to
be added;
2. Histogram – make sure to use "spans" table format option;
3. Heatmap - set “Calculate from data” option to “Yes”;
4. Bar chart – works out of the box, but x axis won't be readable ever.

You can also add a panel that shows all the trace events in a log-like representation using the log
visualization.
or table visualization.

### Multiple nodes

One frequently asked question is whether Grafana lets you distinguish between nodes that export
tracing information.

The answer is yes.

In addition to span attributes, each span has resource attributes. There you'll find properties
like `node_id` which uniquely identify a node.

* `account_id` is the `account_id` from `validator_key.json`;
* `chain_id` is taken from `genesis.json`;
* `node_id` is the public key from `node_key.json`;
* `service.name` is `account_id` if that is available, otherwise it is `node_id`.

0 comments on commit ccf3873

Please sign in to comment.