Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/improve traceroute #159

Merged
merged 47 commits into from
Aug 7, 2024
Merged

Feat/improve traceroute #159

merged 47 commits into from
Aug 7, 2024

Conversation

niklastreml
Copy link
Collaborator

@niklastreml niklastreml commented Jul 29, 2024

Motivation

This PR aims to improve the traceroute check.

Closes #112

Relates to #111

Changes

Feature: Retrieves IP addresses of routers when icmp is available

We have now removed the dependency on github.com/aeden/traceroute by building our own implementation. This gives us more flexibility in how the check runs, mainly allowing us to heavily parallelize it and instrument it. The core of this feature can be found in traceroute.go

Performance: Concurrency for everything

Previously, the check would run every target and every hop of that target serially, causing a single cycle, even with low amounts of targets and hop counts, to take time on the order of multiple minutes.

This pr addresses this, by running all targets and their hops in parallel. For this to work properly, we need to ensure that we don't mix up ICMP of different execution threads, that are coming in at the same time. This is easily done by reading the payload of the incoming ICMP Time Exceeded packet. The payload of this packet will include the ip header + 64 bits of the payload of the packet that caused this Time Exceeded message. Those 64 bits just about contain the tcp packets source and destination ports. We use this functionality by assigning a random portnumber to every thread, relying on the OS to ensure that no two threads have the same port. When an ICMP packet comes we check the payload for the port and discard it, if it doesn't match the threads assigned port. For now every thread gets its own ICMP socket, to keep logic simple. If this causes issues in the future we should refactor it, so there is only one shared socket.

Fix: Shutdown hangs forever

The pr addresses an issue in the checks shutdown logic, where an unbuffered channel caused the checks shutdown routine to block forever.

Metrics: Prometheus metrics

It doesn't make sense to try and use prometheus to capture any of the more detailed traceroute metrics since those would be of pretty high cardinality and would essentially cause prometheus to explode.

I've still added some metrics, that I thought made sense:

$ curl -s localhost:8080/metrics | grep traceroute
# HELP sparrow_traceroute_check_duration_ms
# TYPE sparrow_traceroute_check_duration_ms gauge
sparrow_traceroute_check_duration_ms{target="google.com"} 415
sparrow_traceroute_check_duration_ms{target="telekom.de"} 232
# HELP sparrow_traceroute_minimum_hops
# TYPE sparrow_traceroute_minimum_hops gauge
sparrow_traceroute_minimum_hops{target="google.com"} 14
sparrow_traceroute_minimum_hops{target="telekom.de"} 4

Metrics: API Metrics

There are more detailed metrics available through the json API, which roughly matches the output provided by traceroute when run in a terminal:

curl -s localhost:8080/v1/metrics/traceroute | head -n18
{
  "data": {
    "telekom.de": {
      "MinHops": 4,
      "Hops": {
        "1": [
          {
            "Latency": 14102685,
            "Addr": {
              "IP": "10.74.11.36",
              "Port": 80,
              "Zone": ""
            },
            "Name": "telekom.de.",
            "Ttl": 1,
            "Reached": false
          }
        ],
    ...

Development: Added a local test setup for traceroute

To be able to actually test this thing atleast manually I've provided a kathara-based test setup, which is documented here. This should make it pretty easy to test and debug this. We can maybe even use this tool for e2e tests in CI

Testing: Added E2E tests for the traceroute check

This test sets up a kathara network and starts sparrow on one system, tracerouting to the others webserver. It then checks the prometheus metrics and the metrics api to ensure that the data collected is what is expected. Look at this for some example output.

Guide for reviewing

I have moved the logic of the check (setup, shutdown, implementing the interface and such) into check.go (prev. traceroute.go). traceroute.go now contains the implementation of the actual logic. In general I tried to keep my changes constrained to that file, by only building a new implementation of the tracerouteFactory type, so as to make reviewing the code easier. It's probably best to review traceroute.go completely start to finish, ignoring the diff on github and diffing the check.go against traceroute.go of the main branch.

What's next

Setup opentelemetry

We want to expose the more detailed in a standardized format, that easily be consumed by tools like grafana. OpenTelemetry should be fine and simple to implement, as the go-sdk is nice to use and the protocol is also well supported. Once that's doen we can create otel traces for the traceroute hops, that can be visualized easily.

Support ICMP and UDP

Currently we only support TCP as the probing protocol. After this is merged, I will do a small refactor and make the probing protocol swappable. UDP should be simple to implement (basically just change the string in the constructor of the dialer, the rest of the logic is prety similar to TCP). ICMP might be a little tricky, since we need to ensure, that we can differentiate packets sent by different threads, like tcp, but this time, we can't rely on the OS to ensure no duplicate ports being used.

Tests done

  • Unit tests succeeded
  • E2E tests succeeded

TODO

  • I've assigned this PR to myself
  • I've labeled this PR correctly

Adds a couple of features to the traceroute check:
- address of every hop
- dns name, if available, of the hops
- latency for that hop

There have also been major performance improvements:
- All targets are now checked concurrently
- every hop is checked concurrently

So now, instead of waiting for the slowest hop of every target in
series, we parallelized as much as possible. We should now only have to
wait for the slowest target + dns resolution. In my testing (with a
local setup), the time per run with a few target was cut down from a minute-ish
to a few seconds.

Signed-off-by: Niklas Treml <[email protected]>
This is what I used to test traceroute locally. the tool is called
kathara and it allows setting up simple or complex networks completely
based on docker containers. I am in awe of how amazing this is

Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
@niklastreml niklastreml added feature Introduces a new feature area/checks Issues/PRs related to Checks labels Jul 29, 2024
@niklastreml niklastreml requested a review from y-eight July 29, 2024 09:00
@niklastreml niklastreml self-assigned this Jul 29, 2024
Copy link
Member

@y-eight y-eight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review done

pkg/checks/traceroute/check.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
@niklastreml niklastreml requested a review from y-eight July 30, 2024 09:46
Copy link
Collaborator

@lvlcn-t lvlcn-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review, overall good implementation only some minor things to think about/improve 👍

.github/workflows/e2e_checks.yml Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/check.go Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Show resolved Hide resolved
pkg/checks/traceroute/traceroute.go Outdated Show resolved Hide resolved
.github/workflows/e2e_checks.yml Show resolved Hide resolved
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
@niklastreml niklastreml requested a review from lvlcn-t August 5, 2024 08:10
Copy link
Collaborator

@lvlcn-t lvlcn-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@niklastreml niklastreml merged commit 24ae314 into main Aug 7, 2024
8 checks passed
@niklastreml niklastreml deleted the feat/improve-traceroute branch August 7, 2024 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Issues/PRs related to Checks feature Introduces a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Use tcp for traceroute
3 participants