Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make metadata pod lookups more resilient to short lived processes #2094

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ddelnano
Copy link
Member

@ddelnano ddelnano commented Jan 21, 2025

Summary: Make metadata pod lookups more resilient to short lived processes

This is a continuation of the work started from #1989. Since the local_addr column is populated for client side traces, it can be used as a fallback lookup for these traces. This doesn't solve all of the permutations of missing short lived processes (#1638), but provides more coverage than before.

Relevant Issues: #1638

Type of change: /kind bugfix

Test Plan: Verified the following

  • Compared the performance with and without this change with src/e2e_test/vizier/exectime:exectime. This change has a minor performance impact, but it closes the gap on certain situations that previously caused users to distrust Pixie's instrumentation
# Performance baseline
$ ./exectime benchmark -a testing.getcosmic.ai:443 -c <cluster_id> 2>&1 | tee baseline_for_simple_udf_swap_e20880ffd.txt
# Performance of this change
./exectime benchmark -a testing.getcosmic.ai:443 -c <cluster_id> 2>&1 | tee simple_udf_swap_cd217c05c.txt

simple_udf_swap_cd217c05c.txt
baseline_for_simple_udf_swap_e20880ffd.txt

  • Ran for i in $(seq 0 1000); do curl http://google.com/$i; sleep 2; done within a pod and verified that with this change all traces are shown, without this change a significant number of traces are missed. See before and after screenshots below:

vizier-0 14 14-curl-with-missing-data
traces-with-new-fallback

Changelog Message: Fix a certain class of cases where Pixie previously missed protocol traces from short lived connections

This opts the df.ctx['pod'] syntax sugar to try another pod name
lookup if the default upid -> pod name lookup fails. This failure
is common for pods with short lived processes, so using a pod IP
based lookup (local_addr) is attempted if the first lookup fails

Signed-off-by: Dom Del Nano <[email protected]>
@ddelnano ddelnano requested a review from a team as a code owner January 21, 2025 23:52
Signed-off-by: Dom Del Nano <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant