Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDC context is lost when calling asynchronous operations with executor #2142

Closed
getaceres opened this issue Nov 10, 2020 · 7 comments
Closed
Labels
bug This issue is a bug. closed-for-staleness

Comments

@getaceres
Copy link

When executing asynchronous, non blocking operations, MDC context is lost after receiving the result, although an executor that copies the MDC context is used to run the operation.

Describe the bug

A proof of concept can be found at https://github.com/getaceres/aws-mdc-test
There are two classes, one with a ThreadPool that copies the MDC context between calls and another one with the test. The test first executes some ficticious work in the executor to check that the MDC context is conserved, then it executes getQueueUrl, createQueue and receiveMessages operations blocking with a join in SQS to check that the MDC is conserved after obtaining the result.
The problematic part is executing the receiveMessages operation in a non-blocking way: The MDC context is lost and in the response the correlation identifier is lost for good for the rest of the processing chain.

Expected Behavior

The provided executor gets used and so the MDC context is conserved.

Current Behavior

Looking at the log:

2020-11-10 19:22:25,426 INFO  [pool-1-thread-1] [correlation:b5037b4f-8a10-4cf3-bc63-b807cfd27be2] o.e.a.MdcDeleteAwsTest.98: Sleeping for 1 second 
2020-11-10 19:22:25,769 INFO  [           main] [correlation:b5037b4f-8a10-4cf3-bc63-b807cfd27be2] o.e.a.MdcDeleteAwsTest.33: Trying to find queue test-queue 
2020-11-10 19:22:26,432 INFO  [pool-1-thread-2] [correlation:b5037b4f-8a10-4cf3-bc63-b807cfd27be2] o.e.a.MdcDeleteAwsTest.72: Finished doing work with correlation b5037b4f-8a10-4cf3-bc63-b807cfd27be2 
2020-11-10 19:22:27,128 ERROR [aws-java-sdk-NettyEventLoop-0-4] [correlation:] o.e.a.MdcPropagationPool$RunnableWrapperWithMdc.61: Received a null MDC map 
2020-11-10 19:22:27,132 INFO  [           main] [correlation:b5037b4f-8a10-4cf3-bc63-b807cfd27be2] o.e.a.MdcDeleteAwsTest.48: Queue test-queue found at URL http://localhost:4566/000000000000/test-queue 
2020-11-10 19:22:30,232 ERROR [aws-java-sdk-NettyEventLoop-0-2] [correlation:] o.e.a.MdcPropagationPool$RunnableWrapperWithMdc.61: Received a null MDC map 
2020-11-10 19:22:30,233 INFO  [pool-1-thread-5] [correlation:] o.e.a.MdcDeleteAwsTest.91: Received asynchronous receive response with 0 messages and correlation null 
2020-11-10 19:22:30,234 ERROR [aws-java-sdk-NettyEventLoop-0-10] [correlation:] o.e.a.MdcPropagationPool$RunnableWrapperWithMdc.61: Received a null MDC map 
2020-11-10 19:22:30,235 INFO  [pool-1-thread-4] [correlation:b5037b4f-8a10-4cf3-bc63-b807cfd27be2] o.e.a.MdcDeleteAwsTest.87: Got blocking receive response with 0 messages and correlation b5037b4f-8a10-4cf3-bc63-b807cfd27be2 

It seems that a Netty executor is used to execute the HTTP calls, even when netty-nio-client is not in the classpath. After returning the result, the MDC context is lost.

Steps to Reproduce

Download and run the project at https://github.com/getaceres/aws-mdc-test

Possible Solution

Right now the only workaround that I can think of is to always use blocking calls, which negates the advantage of the asynchronous API.
Another possibility is to use the provided executor to execute the HTTP calls as well.
A Reactive API also could help since it can receive an executor at publishing/subscription time and use it in a non-blocking way.

Context

I'm getting a series of SQS messages from a queue. They contain an attribute with a correlation identifier that was generated by the sending service to identify the whole transaction between the different services involved in a business operation for an user. For debugging and troubleshooting purposes, this correlation identifier is used in every subsequent log operation relative to the business transaction.
This works for blocking calls to SQS even when further work is done asynchronously using a similar executor to the one provided in the example. However we've found this problem when we tried to move to the asynchronous API.

Your Environment

  • AWS Java SDK version used: 2.15.20
  • JDK version used: openjdk version "11.0.8" 2020-07-14
  • Operating System and version: MacOS Catalina 10.15.17
@getaceres getaceres added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Nov 10, 2020
@debora-ito
Copy link
Member

@getaceres I'm not familiar with MDC context, I'll refer this to the Java SDK team and ask them to take a look.

@debora-ito debora-ito added needs-discussion This issue/PR requires more discussion with community. and removed needs-triage This issue or PR still needs to be triaged. labels Nov 13, 2020
@dagnir
Copy link
Contributor

dagnir commented Nov 25, 2020

Hi @getaceres,

It seems that a Netty executor is used to execute the HTTP calls, even when netty-nio-client is not in the classpath. After returning the result, the MDC context is lost.

Netty is being used here because it's the default async HTTP client used by the SDK when using an async service client.

The SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR is only used when completing the future returned by the SDK when invoking the async operation:

// Offload the completion of the future returned from this stage onto
// the future completion executor
responseHandlerFuture.whenCompleteAsync((r, t) -> {
if (t == null) {
responseFuture.complete(r);
} else {
responseFuture.completeExceptionally(t);
}
}, futureCompletionExecutor);

In this case it does not mean at the SDK will use the executor to do any work required to execute the request.

We see that a Netty thread is submitting the task onto the SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR because in the snippet linked above, the completion stage is being invoked from the context of a Netty event loop because the SDK is using the Netty HTTP client under the hood.

@debora-ito debora-ito removed the needs-discussion This issue/PR requires more discussion with community. label Dec 21, 2020
@debora-ito
Copy link
Member

@getaceres marking this to auto close soon, let us know if you have more questions.

@debora-ito debora-ito added the closing-soon This issue will close in 4 days unless further comments are made. label Dec 22, 2020
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will close in 4 days unless further comments are made. labels Dec 27, 2020
@caiyo
Copy link

caiyo commented Jul 29, 2021

@dagnir Is there any way to propagate the MDC context to logs that are logged from within the Netty event loop? I have a Spring boot application where i want all aws sdk logs to have trace ids on them, But i can't figure out anyway to do this

aws-sdk-java-automation added a commit that referenced this issue Sep 9, 2022
…fb1689861

Pull request: release <- staging/bea6ab8b-b330-4cc1-8ca8-94bfb1689861
@electronic-dk
Copy link

electronic-dk commented Apr 14, 2023

@debora-ito is it possible to continue the discussion here or is it better to open a new issue? There's one nearly identical to this, but it's also closed - #1613

Basically the issue is that since aws sdk v2 uses netty nio, all thread locals are lost after any async sdk method is completed, that includes spring security context, mdc, servlet request attributes or anything else the developer might have put into a thread local. Unfortunately, there seems to be no way to tell the sdk to preserve this context or extend the sdk in a way that makes it possible. This makes it really hard to use aws sdk v2 in a spring mvc environment without resorting to migrating to kotlin coroutines or some hacky workarounds.
Specifying SdkAdvancedAsyncClientOption.FUTURE_COMPLETION_EXECUTOR is also not a solution, since this executor is used after the control has returned from the netty thread pool so the context is already lost at this point.

@dasmowenator
Copy link

Is there some reason why this was closed? This is still a real issue

@electronic-dk
Copy link

I went ahead and opened a new issue #5242 as I think this one is not monitored on account of it being closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness
Projects
None yet
Development

No branches or pull requests

6 participants