doc: move information unrelated to nearone infra to book (near#11072)

This information is potentially interesting to our opensource contributors as well so it should not be hidden away in our private docs.
kapenacunmin · Apr 18, 2024 · ccf3873 · ccf3873
1 parent 2a3e0b5
commit ccf3873
Show file tree

Hide file tree

Showing 2 changed files with 108 additions and 10 deletions.
diff --git a/docs/images/span-details.png b/docs/images/span-details.png
diff --git a/docs/practices/workflows/otel_traces.md b/docs/practices/workflows/otel_traces.md
@@ -48,11 +48,46 @@ harder to extract signal in the future. Keep this trade off in mind.
 [Loki]: https://grafana.com/oss/loki/
 [Jaeger]: https://www.jaegertracing.io/
 
+
+### Spans
+
+We have a [style guide section on the use of Spans](../style.md#spans), please make yourself
+familiar with it.
+
+Every `tracing::debug_span!()` creates a new span, and usually it is attached to its parent
+automatically.
+
+However, a few corner cases exist.
+
+* `do_apply_chunks()` starts 4 sub-tasks in parallel and waits for their completion. To make it
+work, the parent span is passed explicitly to the sub-tasks.
+* Messages to actix workers. If you do nothing, that the traces are limited to work done in a
+single actor. But that is very restrictive and not useful enough. To workaround that, each actix
+message gets attached `opentelemetry::Context`. That context somehow represents the information
+about the parent span. This mechanism is the reason you see annoying `.with_span_context()`
+function calls whenever you send a message to an actix Actor.
+* Inter-process tracing is theoretically available, but I have never tested it. The plan was to
+test it as soon as the Canary images get updated 😭 Therefore it most likely doesn’t work. Each
+`PeerMessage` is injected with `TraceContext` (1, 2) and the receiving node extracts that context
+and all spans generated in handling that message should be parented to the trace from another node.
+* Some spans are created using `info_span!()` but they are few and mostly for the logs. Exporting
+only info-level spans doesn’t give any useful tracing information in Grafana.
+
+* `actix::Actor::handle()` deserves a special note. The design choice was to provide a macro that
+lets us easily annotate every implementation of `actix::Actor::handle()`. This macro sets the
+following span attributes:
+
+   * `actor` to the name of the struct that implements actix::Actor
+   * `handler` to the name of the message struct
+
+   And it lets you provide more span attributes. In the example, ClientActor specifies `msg_type`,
+   which in all cases is identical to `handler`.
+
 ## Configuration
 
-[The OTLP documentation page in our terraform
-repository](https://github.com/PagodaPlatform/tf-near-node/blob/main/doc/otlp.md) documents the
-steps necessary to start moving the trace data from the node to our Grafana Cloud instance. Once
+[The Tracing documentation page in nearone's
+Outline](https://nearone.getoutline.com/doc/tracing-in-grafana-RgJUJZF2C0) documents the steps
+necessary to start moving the trace data from the node to Nearone's Grafana Cloud instance. Once
 you set up your nodes, you can use the explore page to verify that the traces are coming through.
 
 ![Image displaying the Grafana explore page interacting with the grafana-nearinc-traces data
@@ -78,7 +113,9 @@ and invoke `sudo pkill -HUP neard`. Double check that the collector is running a
 <blockquote style="background: rgba(255, 200, 0, 0.1); border: 5px solid rgba(255, 200, 0, 0.4);">
 
 **Good to know**: You can modify the event/span/log targets you’re interested in just like when
-setting the `RUST_LOG` environment variable.
+setting the `RUST_LOG` environment variable, including target filters. If you're setting verbose
+levels, consider selecting specific targets you're interested in too. This will help to keep trace
+ingest costs down.
 
 For more information about the dynamic settings refer to `core/dyn-configs` code in the repository.
 
@@ -104,11 +141,47 @@ wish.
 
 Now that the data is arriving into the databases, it is time to visualize the data to determine
 what you want to know about the node. The only general advise I have here is to check that the data
-source is indeed tempo or loki. Try out visualizations other than time series. For example the
-author was interested in checking the execution speed before and after a change in a component.
-To make the comparison visual, the span of interest was graphed using the histogram visualization
-in order to obtain the following result. Note: you can hover your mouse on the image to see the
-visualization of the baseline performance.
+source is indeed tempo or loki.
+
+### Explore
+
+Initial exploration is best done with Grafana's Explore tool or some other mechanism to query and
+display individual traces.
+
+The query builder available in Grafana makes the process quite straightforward to start with, but
+is also somewhat limited. Underlying [TraceQL has many more
+features](https://grafana.com/docs/tempo/latest/traceql/) that are not available through the
+builder. For example, you can query data in somewhat of a relational manner, such as this query
+below queries only spans named `process_receipt` that take 50ms when run as part of `new_chunk`
+processing for shard 3!
+
+```
+{ name="new_chunk" && span.shard_id = "3" } >> { name="process_receipt" && duration > 50ms }
+```
+
+<blockquote style="background: rgba(255, 200, 0, 0.1); border: 5px solid rgba(255, 200, 0, 0.4);">
+
+**Good to know**: When querying, keep in mind the "Options" dropdown that allows you to specify the
+limit of results and the format in which these results are presented! In particular, the
+"Traces/Spans" toggle will affect the durations shown in the result table.
+
+</blockquote>
+
+Once you click on a span of interest, Grafana will open you a view with the trace that contains
+said span, where you can inspect both the overall trace and the properties of the span:
+
+![Image displaying a specific trace with two of the spans expanded to show their
+details](../../images/span-details.png)
+
+### Dashboards
+
+Once you have arrived at an interesting query, you may be inclined to create a dashboard that
+summarizes the data without having to dig into individual traces and spans.
+
+As an example the author was interested in checking the execution speed before and after a change
+in a component. To make the comparison visual, the span of interest was graphed using the histogram
+visualization in order to obtain the following result. In this graph the Y axis displays the number
+of occurrences for spans that took X-axis long to complete.
 
 <div id="image-comparison">
 <img src="../../images/compile-and-load-before.png" class="before" />
@@ -131,5 +204,30 @@ visualization of the baseline performance.
 }
 </style>
 
+In general most of the panels work with tracing results directly but some of the most interesting
+ones do not. It is necessary to experiment with certain options and settings to have grafana panels
+start showing data. Some notable examples:
+
+1. Time series – a “Prepare time series” data transformation with “Multi-frame time series” has to
+   be added;
+2. Histogram – make sure to use "spans" table format option;
+3. Heatmap - set “Calculate from data” option to “Yes”;
+4. Bar chart – works out of the box, but x axis won't be readable ever.
+
 You can also add a panel that shows all the trace events in a log-like representation using the log
-visualization.
+or table visualization.
+
+### Multiple nodes
+
+One frequently asked question is whether Grafana lets you distinguish between nodes that export
+tracing information.
+
+The answer is yes.
+
+In addition to span attributes, each span has resource attributes. There you'll find properties
+like `node_id` which uniquely identify a node.
+
+* `account_id` is the `account_id` from `validator_key.json`;
+* `chain_id` is taken from `genesis.json`;
+* `node_id` is the public key from `node_key.json`;
+* `service.name` is `account_id` if that is available, otherwise it is `node_id`.