You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Initially spotted in #244 (comment), opening a separate issue since the fix might be more involved.
Running the profiler next to a Java process for more than an hour and noticed two things: 1. After one hour of runtime, stub frames have as function name UNRESOLVED, since their metadata is submitted only once since the addrToStubNameID never expires see: https://github.com/DataDog/opentelemetry-ebpf-profiler/blob/0599c7c681ee15f6c394c9b3acdf2dd45c7d8279/interpreter/hotspot/instance.go#L240 2. (not very critical) Once per hour, in a single reporting cycle, some of the frames would have as function name UNREPORTED, since the expiry logic expired an actively used file ID. The next reporting cycle fixes this, so this only affects one profile every hour.
Expected behavior: We shouldn't see UNRESOLVED / UNREPORTED frames after one hour.
Run the profiler via sudo ./ebpf-profiler -collection-agent=127.0.0.1:11000 -disable-tls -reporter-interval=1m
Wait for one hour (1h) after interpreted frames are seen for a Java service
Stub frames consistently have as function name UNRESOLVED (1.)
For a single profile every hour, some frames have UNREPORTED as their function name. (2.)
Potential fixes
The frames cache is an LRU cache of fileID -> map[libpf.AddressOrLineno]sourceInfo. The internal map has no limit on size.
We could use the newly added GetAndRefreshKeys() in go-freelru so that actively used entry in the frames cache are not expired after an hour. This would fix both issues, with the tradeoff that, for actively used files, entries in the inner map[libpf.AddressOrLineno]sourceInfo would never get GCed. This sounds reasonable since if a file is actively used, it would make sense to keep all potential line numbers in cache (I wouldn't expect the memory overhead to be significant due to rarely used line numbers).
Alternatively, if we want to ensure that the size of the inner map[libpf.AddressOrLineno]sourceInfo is capped, we could turn it into an LRU.
The text was updated successfully, but these errors were encountered:
Using GetAndRefreshKeys() for outer cache is a good idea.
What concerns me a bit is that for long/forever running large executables there is currently the potential of increasing the inner map without limit. Leaf frames can be seen at any opcode address (a matter of probability).
Using an LRU instead of map for the inner caches to have a cap sound reasonable.
The problem I see is the fixed size of the LRU. The LRU is not able to grow and so we pre-allocate a relatively large amount of possibly never used memory for every fileid.
To mitigate that we could write a thin wrapper around the LRU (ResizableLRU) that has a cap but starts with e.g. 32 pre-allocated entries (starting size needs to be determined). Adding an element on a full LRU would lead to a resize until the cap is reached.
We also need to extend the eviction of expired items for the inner caches (search for r.frames.PurgeExpired() in otlp_reporter.go.
Similar to the reported issue #248 around lifetime of executables, the same issue happens for frames.
So keep cached frame information alive.
Signed-off-by: Florian Lehner <[email protected]>
What happened?
Initially spotted in #244 (comment), opening a separate issue since the fix might be more involved.
Running the profiler next to a Java process for more than an hour and noticed two things:
1. After one hour of runtime, stub frames have as function name
UNRESOLVED
, since their metadata is submitted only once since theaddrToStubNameID
never expires see: https://github.com/DataDog/opentelemetry-ebpf-profiler/blob/0599c7c681ee15f6c394c9b3acdf2dd45c7d8279/interpreter/hotspot/instance.go#L2402. (not very critical) Once per hour, in a single reporting cycle, some of the frames would have as function name
UNREPORTED
, since the expiry logic expired an actively used file ID. The next reporting cycle fixes this, so this only affects one profile every hour.Expected behavior: We shouldn't see
UNRESOLVED
/UNREPORTED
frames after one hour.Reproducing
On latest main (4f405f2):
UNRESOLVED
(1.)UNREPORTED
as their function name. (2.)Potential fixes
The
frames
cache is an LRU cache offileID
->map[libpf.AddressOrLineno]sourceInfo
. The internal map has no limit on size.We could use the newly added
GetAndRefreshKeys()
ingo-freelru
so that actively used entry in theframes
cache are not expired after an hour. This would fix both issues, with the tradeoff that, for actively used files, entries in the innermap[libpf.AddressOrLineno]sourceInfo
would never get GCed. This sounds reasonable since if a file is actively used, it would make sense to keep all potential line numbers in cache (I wouldn't expect the memory overhead to be significant due to rarely used line numbers).Alternatively, if we want to ensure that the size of the inner
map[libpf.AddressOrLineno]sourceInfo
is capped, we could turn it into an LRU.The text was updated successfully, but these errors were encountered: