-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/improve traceroute #159
Conversation
Adds a couple of features to the traceroute check: - address of every hop - dns name, if available, of the hops - latency for that hop There have also been major performance improvements: - All targets are now checked concurrently - every hop is checked concurrently So now, instead of waiting for the slowest hop of every target in series, we parallelized as much as possible. We should now only have to wait for the slowest target + dns resolution. In my testing (with a local setup), the time per run with a few target was cut down from a minute-ish to a few seconds. Signed-off-by: Niklas Treml <[email protected]>
This is what I used to test traceroute locally. the tool is called kathara and it allows setting up simple or complex networks completely based on docker containers. I am in awe of how amazing this is Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
…eroute waas enabled Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review done
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
…ntended Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review, overall good implementation only some minor things to think about/improve 👍
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
Signed-off-by: Niklas Treml <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
This PR aims to improve the traceroute check.
Closes #112
Relates to #111
Changes
Feature: Retrieves IP addresses of routers when icmp is available
We have now removed the dependency on
github.com/aeden/traceroute
by building our own implementation. This gives us more flexibility in how the check runs, mainly allowing us to heavily parallelize it and instrument it. The core of this feature can be found intraceroute.go
Performance: Concurrency for everything
Previously, the check would run every target and every hop of that target serially, causing a single cycle, even with low amounts of targets and hop counts, to take time on the order of multiple minutes.
This pr addresses this, by running all targets and their hops in parallel. For this to work properly, we need to ensure that we don't mix up ICMP of different execution threads, that are coming in at the same time. This is easily done by reading the payload of the incoming ICMP
Time Exceeded
packet. The payload of this packet will include the ip header + 64 bits of the payload of the packet that caused thisTime Exceeded
message. Those 64 bits just about contain the tcp packets source and destination ports. We use this functionality by assigning a random portnumber to every thread, relying on the OS to ensure that no two threads have the same port. When an ICMP packet comes we check the payload for the port and discard it, if it doesn't match the threads assigned port. For now every thread gets its own ICMP socket, to keep logic simple. If this causes issues in the future we should refactor it, so there is only one shared socket.Fix: Shutdown hangs forever
The pr addresses an issue in the checks shutdown logic, where an unbuffered channel caused the checks shutdown routine to block forever.
Metrics: Prometheus metrics
It doesn't make sense to try and use prometheus to capture any of the more detailed traceroute metrics since those would be of pretty high cardinality and would essentially cause prometheus to explode.
I've still added some metrics, that I thought made sense:
Metrics: API Metrics
There are more detailed metrics available through the json API, which roughly matches the output provided by traceroute when run in a terminal:
Development: Added a local test setup for traceroute
To be able to actually test this thing atleast manually I've provided a kathara-based test setup, which is documented here. This should make it pretty easy to test and debug this. We can maybe even use this tool for e2e tests in CI
Testing: Added E2E tests for the traceroute check
This test sets up a kathara network and starts sparrow on one system, tracerouting to the others webserver. It then checks the prometheus metrics and the metrics api to ensure that the data collected is what is expected. Look at this for some example output.
Guide for reviewing
I have moved the logic of the check (setup, shutdown, implementing the interface and such) into
check.go
(prev.traceroute.go
).traceroute.go
now contains the implementation of the actual logic. In general I tried to keep my changes constrained to that file, by only building a new implementation of thetracerouteFactory
type, so as to make reviewing the code easier. It's probably best to reviewtraceroute.go
completely start to finish, ignoring the diff on github and diffing thecheck.go
againsttraceroute.go
of the main branch.What's next
Setup opentelemetry
We want to expose the more detailed in a standardized format, that easily be consumed by tools like grafana. OpenTelemetry should be fine and simple to implement, as the go-sdk is nice to use and the protocol is also well supported. Once that's doen we can create otel traces for the traceroute hops, that can be visualized easily.
Support ICMP and UDP
Currently we only support TCP as the probing protocol. After this is merged, I will do a small refactor and make the probing protocol swappable. UDP should be simple to implement (basically just change the string in the constructor of the dialer, the rest of the logic is prety similar to TCP). ICMP might be a little tricky, since we need to ensure, that we can differentiate packets sent by different threads, like tcp, but this time, we can't rely on the OS to ensure no duplicate ports being used.
Tests done
TODO