-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JAMES-3715] upgrade to netty 4.1.72.Final #886
Conversation
Awesome work! I will of course be reviewing this and extensively test this in the coming days (QA + perf tests) Thanks! Ps: don t you mind opening a JIRA ticket for your work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks overall good to me. Tiny details mostly.
I would be interested to know if we could access netty metrics like the one exposed via MX easily.
We need a little rebase to solve the conflict in the removed JMX code.
On my side I will build, QA and performance test this.
We also need to open a JIRA: https://issues.apache.org/jira/browse/JAMES
@ottoka I was not able to determine if the change still enable client TLS authentication, you might want to have a look at this.
Regards, Benoit
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Outdated
Show resolved
Hide resolved
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Outdated
Show resolved
Hide resolved
...ols/netty/src/main/java/org/apache/james/protocols/netty/AbstractChannelPipelineFactory.java
Show resolved
Hide resolved
...ocols-imap4/src/test/java/org/apache/james/imapserver/netty/ImapRequestFrameDecoderTest.java
Outdated
Show resolved
Hide resolved
...ocols-imap4/src/test/java/org/apache/james/imapserver/netty/ImapRequestFrameDecoderTest.java
Outdated
Show resolved
Hide resolved
/** | ||
* {@link OrderedMemoryAwareThreadPoolExecutor} subclass which expose statistics via JMX | ||
*/ | ||
public class JMXEnabledOrderedMemoryAwareThreadPoolExecutor extends OrderedMemoryAwareThreadPoolExecutor implements JMXEnabledOrderedMemoryAwareThreadPoolExecutorMBean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for the removal of the JMX protocol metrics too
Maybe we could document how to access similar Netty metrics?
protected ChannelPipelineFactory createPipelineFactory(final ChannelGroup group) { | ||
protected AbstractChannelPipelineFactory createPipelineFactory(final ChannelGroup group) { | ||
|
||
return new AbstractChannelPipelineFactory(0,0, 0, group, createFrameHandlerFactory()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new AbstractChannelPipelineFactory(0,0, 0, group, createFrameHandlerFactory()) { | |
return new AbstractChannelPipelineFactory(0, 0, 0, group, createFrameHandlerFactory()) { |
And idem we likely should extract these 0 magic numbers to named constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a new constructor to AbstractChannelPipelineFactory without timeout and rate limit options (corresponding handlers are disabled)
server/task/task-memory/src/test/java/org/apache/james/task/SerialTaskManagerWorkerTest.java
Outdated
Show resolved
Hide resolved
Quick QA
When I send a multi MB email using Thunderbird, the message is well appended in the INBOX However when I open to read it, it takes forever for Thunderbird to read it. As if the process hanged. log: Fun fact: for one of the test message restarting Thunderbird helped, not for the other one.
Thunderbird hangs when saving the draft. Here are James logs (DEBUG)
This indicates a mis-management of IMAP literal management in the frame decoder. 
Thanks a lot for this contribution, it's very appreciated :)
I've fixed the But I can't figure out why thunderbird's use of APPEND didn't work. I've tried it on command line using openssl, and it worked. Do you have any idea? |
From what I see, the configuration in AbstractConfigurableAsyncServer remains unchanged, and everybody else is using Encryption.getContext().createSSLEngine(). So it should still be fine. |
Cool!
You might be sending a single frame
While TB might send
Then
Also there's a notion of synchronized litterals ie What I see is that we might want to ensure the IMAP TCP layer is tested for both prior to this changeset. IMAPServerTest only covers non-synchronized litterals (ie I can devote some time writing such unit tests but likely not before monday... IMAPServerTest new tests to add(Separated pull request)
|
I'm working on this solution:
In one test case I had a problem, but I can't reproduce it again. So I'm testing more before committing (on Monday). |
@normanmaurer you might be interested in this too |
IMAP APPEND should now work (as far I can see). Another problem raised while executing DistributedPOP3ServerTest#linesStartingWithDotShouldBeWellHandled, maybe related with a changed memory handling with netty4. Fix with my next commit. |
Hello. I succeeded to pull a bit more time for testing... This looks overall in better shape! This time the handling of IMAP APPEND in STARTTLS works just fine! I still noticed a few things looking at the log file:
And a few more non released buffers...
I still get a few of those:
Also I might be missing something here... Connecting manually in IMAP SSL (port 993) works great however when doing so with Thunderbird I get the following error...
Here is what OpenSSL have to say about it:
Cheers, Benoit |
For the last issue, since I have been looking at these far too often lately: |
I agree with @ottoka , in my experience as well Thunderbird can be very insisting on the certificate. I don't think it's related to this PR :) |
Already tried, unsuccessfully. What is weird to me is that from the stacktrace, it seems to be James aborting the SSL connection... |
BTW if you rebase on master and solve the ongoing conflict, we shall be able to run the build on the ASF infra @ff-wl |
Fixed discarded pipeline error
…rrectly: extra CRLF in test DistributedPOP3ServerTest#linesStartingWithDotShouldBeWellHandled
…ppendsShouldBeRejected
From my point of view there are no open issues. The failed test org.apache.james.blob.cassandra.CassandraBlobStoreClOneTest.readShouldPublishMissRatioClOneMetric is not related to this PR |
It is not and I started a new build. Regards |
Hello, I did took the time to retest your work this evening, built the latest status of your pull request, started the resulting James and played with Thunderbird. I noticed that big messages (> MB) tend to load forever in Thunderbird. Opening the error console I got this: Wich to be honnest is not very helpful to me. Debugging, I would really suspect IMAP partial FETCH is involed in this, which Thunderbird uses to download such messages... Here is a simple exchange to reproduce the issue...
I tested the same commands on master and it worked great. I think we should try to add unit tests for partial fetch... I did not have time today to further identify, nor test things tonight, but will definitly continue having a look at this tomorrow morning. Regards, Benoit |
Hello, FYI we discussed quickly on the mailing list and we decided to release a 3.7.0 version prior to merging the Netty 4 upgrade. The rationals is to ensure stability and make sure it gets deployed and used before being included in a release. CF https://www.mail-archive.com/[email protected]/msg71619.html . With partial fetchs and compression we clearly showed we did not master fully the topic and that the TCP level test suite for IMAP (at least) misses the coverage to identify such defects... I will start working on the (long overdue) 3.7.0 today, we should then be able to swap to Also I am very satisfied and confident about the status of this work. Personally, I would like us to take the time to do a regular performance test campaign on this change (using Gatling) to further validates performance benefits. This should also take 1-2 days. Cheers, Benoit |
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Outdated
Show resolved
Hide resolved
Before this parameter was ignored, we can use it to size Netty server child executor instead.
Netty 4 no longer allows setting internal threads and instead relies on event loop groups. Removing unused methods...
…ExecutorThreads Computation are now conducted on the child threads of the Netty4 Server bootstraps. On Netty 3 those computations were conducted on a thread pool yet separated from the Netty 'workers' which could be seen as unnecessary context switches - Netty worker threads were merely decoding and encoding the traffic.
This is dead code since the Netty 4 migration. Remove. We now have easier to exploit metrics (JMX, HTTP endpoint, logs, ES reportings) that expose more interesting applicative level metrics.
Hello, Here is a status of our ongoing testing work. We encounter several issues testing the Netty 4 migration on our pre-production environment. (@Arsnael is involved on it too) We did not yet reach the point where we could actually play our performance tests. Issue 1: Partially written SELECT QRESYNC responsesWhen using evolution MUA (evolution in debug mode logs the full IMAP exchange which is convenient) and QRESYNC is enabled, for one mailbox of the account an error is encountered. Evolution complains the response was truncated... Here is the exchange:
You can see there is no 'OK' response. I try to reproduce locally but working with QRESYNC is horrible. We still need to conduct regression tests to know if this happens also on Netty 3... ISSUE 2: Thunderbird & IDLE not valid in this stateTo reproduce, we open Thunderbird and switch mailboxes quickly. After a few quick mailbox switches, thunderbird complains the mailbox can't be opened and says
Environment: distributed James in a cloud setup (k8s hosted at OVH). We don't reproduce locally. We did not yet manage to get a traffic capture. I suspect concurrency issues: one IMAP request could be processed before the previous one thus don't benefit from state changes of previous commands? To be honnset this is still unclear to me... I did spend a few hour reproducing by writing unit tests sending multiple commands at once but failed reproducing... I have a blind bet about this one: this changeset added the @Shareable annotation on a couple of handlers including the Imap handler, allowing concurrency on this handler might lead to incorrect handling in the context of a connected, stateful protocol. We need still to try such a change out... ISSUE 3: Gatling listOur performance tests lists mailboxes, append a few message, selects a mailbox, then fetches a few messages; however the listing always fails - gatling cannot find an INBOX. We reliably reproduce running gatling & james locally.
Using telnet, it seems we are getting "out of order" responses on the wire:
Obviously we should be having the OK comming after the untagged responses... Would that mean we need to "await" or "chain" channel futures when we write responses to ensure a correct order? Issue 4: Blocking on netty IO event loop (Suspission)I suspect we currently run everything straight on the io event loop (woken up when channels receive reads/writes). I suspect we might have no choices but to run handlers performing DB query operations (but likely also synchronous channel read/writes) on a separate thread pool as it was done before. Local benchmarks conducted so far would fail at detecting the impact of "blocking on the event loop" as the concurrency level is low (less than 8 concurrent connections). I wonder how the current set up will behave with high concurrency levels (hundreds of concurrent connections). Actions
Apparently, this changeset might keep us busy for still quite some time... Regards, Benoit |
Follow up regarding #886 (comment) Issue 1We reproduced the bug on master thus this is not a regression. This bug will thus be handled separately cf https://issues.apache.org/jira/projects/JAMES/issues/JAMES-3722 Issue 2We confirm the ordering of the actual writes on the channel is not necessarily the same as those specified in the code (channel.writeAndFlush(xxx)). We solved the issue by awaiting that the write is done in I'd like also to try something relying on future chaining, relying on Issue 3No progress yet. Issue 4No progress yet. Issue 5More buffer leaks. @Arsnael spotted this:
We need to be sure we run with What's weird to me is that it seems allocated by SSLHandler thus logically the SSLHandler is responsible of releasing the buffer, not us. Also from what I could see |
The SSL-Handler allocates the Buffer (after decryption) and puts it into the handler queue. Then the next handler is responsible to release the buffer or "fire" it to the next handler. The last handler must release it. With the advanced leak detection level you will see the handling of the responsibility. |
Answer picked up from this issue. (Thanks Scottmitch@github). Netty can provide "ordering" in the following situations:
It is subtle but important thing to remember as it can cause painful bugs. If some channel.write() calls are coming from application threads and some are coming from EventLoop then the writes can seem out of order from application's perspective. See more details on the linked issue. In the context of list, untagged response are delegated to reactor, which in the case of the distributed James server is scheduled on an Elastic thread. In
Note that processResult sends untagged responses. So the conclusions are:
Something like (
I will propose those changes on a separate pull request, that @Arsnael will be able to try, and, once validated, that you would be able to cherry-pick. Good weekend, Cheers |
sorry for the absence (I had to do other tasks). |
Hello @ff-wl ! No worry, we'd been actively working on our side too! That is days I want to make a status statement regarding our progress! #903 is not that critical I'd rather not put it on the critical path. Idem for #898 . We will rebase them later. Remaining blockers
Please note that 3.7.0 being released when addressing the above points we can merge ;-) We are making slow progress, but methodically, adding critical pieces of testing for the IMAP stack, significantly improving its reliability. Best regards, Benoit TELLIER |
I just fixed the IMAP IDLE + COMPRESS issue. An issue in an attempt to prevent an array copy, offset was ignored. Performance wise Netty4 stack seems slightly slower than Netty3 stack but this do not seems like a blocker to me, we can keep on investigating in the coming weeks, and less flushes + native EPOLL + SSL would anyway outweight that. BeforeAfterOn the performance side I also did proceed to flame graph analysis. For the record: Help welcome on performance investigations. Regarding the merge, I will rebundle the two PRs together. You don't need to take an action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be merged at the same time than #908
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to add. Thank you for your great effort!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor review comments! Happy to send in pull request to fix some of these :-)
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Show resolved
Hide resolved
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Show resolved
Hide resolved
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Show resolved
Hide resolved
} | ||
|
||
if (workerGroup != null) { | ||
workerGroup.shutdownGracefully(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad code - should either wait for promises to finish (i.e. blocking code, not recommended) or return promise.
Example blocking code; (not recommended)
EventLoopGroup workerGroup = this.workerGroup;
if (workerGroup != null) {
this.workerGroup = null;
workerGroup.shutdownGracefully().syncUninterruptibly();
}
EventLoopGroup bossGroup = this.bossGroup;
if (bossGroup != null) {
this.bossGroup = null;
bossGroup.shutdownGracefully().syncUninterruptibly();
}
Alternatively and better - unbind()
should return a promise with all the combined promises so that the caller can sync()
if required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CF #898 to fix this. Won't fix here in my opinion.
protocols/netty/src/main/java/org/apache/james/protocols/netty/AbstractAsyncServer.java
Show resolved
Hide resolved
|
||
/** | ||
* {@link IdleStateHandler} implementation which disconnect the {@link Channel} after a configured | ||
* idle timeout. Be aware that this handle is not thread safe so it can't be shared across pipelines | ||
*/ | ||
@ChannelHandler.Sharable | ||
public class TimeoutHandler extends IdleStateHandler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant implementation - just use the Netty provided implementation - io.netty.handler.timeout.ReadTimeoutHandler
- and delete this class!
protocols/netty/src/main/java/org/apache/james/protocols/netty/ChannelGroupHandler.java
Show resolved
Hide resolved
...ty/src/main/java/org/apache/james/protocols/netty/AllButStartTlsLineBasedChannelHandler.java
Show resolved
Hide resolved
protocols/netty/src/main/java/org/apache/james/protocols/netty/BasicChannelUpstreamHandler.java
Show resolved
Hide resolved
...ols/netty/src/main/java/org/apache/james/protocols/netty/AbstractChannelPipelineFactory.java
Show resolved
Hide resolved
Hello @glennosss ! Thanks for the precious feedback. I added those suggestions to #924 |
I've migrated Apache James to use Netty 4.1.72.Final
The code compiles and almost all unit tests are running (some failed due to my test environment)
Since this is a big code change, reviewers, testers and co-workers are welcome.
I'm still working on this, maybe some minor commits will come.