Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchProcessor shutdown issue #2478

Open
cijothomas opened this issue Dec 27, 2024 · 7 comments
Open

BatchProcessor shutdown issue #2478

cijothomas opened this issue Dec 27, 2024 · 7 comments

Comments

@cijothomas
Copy link
Member

cijothomas commented Dec 27, 2024

BatchProcessor relies on a channel to pass messages (which includes Log/Span, and also control signals like Shutdown/Flush). If the channel is full, then Shutdown() signals will get lost. For example, the buffer got 500 items, and the thread is doing blocking export() call. During this time, 2000 items got into the channel. Any new items are dropped, which is by design. But if a shutdown is initiated during this time, that message is lost as well. When the thread is back to processing the queue, it'll drain the normal telemetry items, but shutdown signal is lost forever. Even when shutdown message is not lost, the message gets to the last position in the queue, so it'll be processed after all other export messages are processed (potential after multiple exports()).

May need to use a separate channel for control flow like shutdown/flush to mitigate the issue. This may not be easy to fully eliminate due to blocking call being made, and anything occurring during this time is hard to respond to.

PeriodicReader has a similar problem, but less severe as the Shutdown message will not be processed, but reacted to immediately after the blocking export is done.

@cijothomas cijothomas added this to the Logging SDK Stable milestone Dec 27, 2024
@lalitb
Copy link
Member

lalitb commented Dec 27, 2024

Good point, I observed this while doing unit-tests for dedicated thread span processor. The separate control channel for shutdown/flush looks to be good idea - reading control channel first. However, there may now be some log events in data-queue arrived before shutdown message. We need to somehow ensure to also drain those message for export, apart from one existing in internal vector.

@utpilla
Copy link
Contributor

utpilla commented Jan 3, 2025

There is another issue with the current implementation of BatchLogProcessor. The background thread is woken up for every log record that gets emitted. This is not ideal. We want the background thread to have low utilization so that it doesn't aggressively compete with the user's application threads. We only want to wake up the background thread on these conditions:

  1. The number of log records in the channel has reached max_export_batch_size.
  2. The scheduled_delay interval has elapsed.
  3. force_flush is called.
  4. shut_down is called.

I have created #2494 to show one possible approach to handle both the shutdown issue and the background thread's utilization issue.

@ThomsonTan
Copy link
Contributor

ThomsonTan commented Jan 6, 2025

For missing shutdown/forceflush, probably we can just use a CondVar instead of a separate channel? As it should be fine to aggregate multiple flushes into one.

Ideally, we could replace the channel here by our own circular buffer like C++/C# to avoid the potential excessive synchronization.

@ThomsonTan
Copy link
Contributor

There could also be another issue about BatchProcessor. As it spawns its own thread and the other threads are not blocked before it is ready, the shutdown or log will fail before that. I think we can add an extra synchronization when creating the background thread to notify the user thread that the BatchProcessor is ready.

@lalitb
Copy link
Member

lalitb commented Jan 6, 2025

As it spawns its own thread and the other threads are not blocked before it is ready, the shutdown or log will fail before that.

I don’t think this would really be an issue. In the current design, the background thread for the BatchLogProcessor is ready to handle requests as soon as the LoggerProvider (or OpenTelemetryTracingBridge) is fully initialized. It’s important for the application to make sure that shutdown or force_flush are only called after everything is properly set up.

If those methods are called too early, they’d essentially run on Noop objects, which should be expected behavior.

@cijothomas
Copy link
Member Author

As it spawns its own thread and the other threads are not blocked before it is ready, the shutdown or log will fail before that.

I don’t think this would really be an issue. In the current design, the background thread for the BatchLogProcessor is ready to handle requests as soon as the LoggerProvider (or OpenTelemetryTracingBridge) is fully initialized. It’s important for the application to make sure that shutdown or force_flush are only called after everything is properly set up.

If those methods are called too early, they’d essentially run on Noop objects, which should be expected behavior.

It’s important for the application to make sure that shutdown or force_flush are only called after everything is properly set up.

Is that something users have to worry about? They can call shutdown only on an instance of TracerProvider, and they won't have TracerProvider unless the processors (including Batch) are already setup, which would not be setup without the required threads being spun up.

@lalitb
Copy link
Member

lalitb commented Jan 7, 2025

Is that something users have to worry about? They can call shutdown only on an instance of TracerProvider, and they won't have TracerProvider unless the processors (including Batch) are already setup, which would not be setup without the required threads being spun up.

Yes, was thinking about global::shutdown_tracer_provide() invoked in another thread while main thread is initializing TracerProvider. We don't have that scenario anymore :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants