Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JAMES-3715] upgrade to netty 4.1.72.Final #886

Closed
wants to merge 14 commits into from
Closed

Conversation

ff-wl
Copy link
Contributor

@ff-wl ff-wl commented Feb 16, 2022

I've migrated Apache James to use Netty 4.1.72.Final

The code compiles and almost all unit tests are running (some failed due to my test environment)

Since this is a big code change, reviewers, testers and co-workers are welcome.

I'm still working on this, maybe some minor commits will come.

@ff-wl ff-wl marked this pull request as draft February 16, 2022 15:59
@chibenwa
Copy link
Contributor

Awesome work!

I will of course be reviewing this and extensively test this in the coming days (QA + perf tests)

Thanks!

Ps: don t you mind opening a JIRA ticket for your work?

Copy link
Contributor

@chibenwa chibenwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks overall good to me. Tiny details mostly.

I would be interested to know if we could access netty metrics like the one exposed via MX easily.

We need a little rebase to solve the conflict in the removed JMX code.

On my side I will build, QA and performance test this.

We also need to open a JIRA: https://issues.apache.org/jira/browse/JAMES

@ottoka I was not able to determine if the change still enable client TLS authentication, you might want to have a look at this.

Regards, Benoit

/**
* {@link OrderedMemoryAwareThreadPoolExecutor} subclass which expose statistics via JMX
*/
public class JMXEnabledOrderedMemoryAwareThreadPoolExecutor extends OrderedMemoryAwareThreadPoolExecutor implements JMXEnabledOrderedMemoryAwareThreadPoolExecutorMBean {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for the removal of the JMX protocol metrics too

Maybe we could document how to access similar Netty metrics?

protected ChannelPipelineFactory createPipelineFactory(final ChannelGroup group) {
protected AbstractChannelPipelineFactory createPipelineFactory(final ChannelGroup group) {

return new AbstractChannelPipelineFactory(0,0, 0, group, createFrameHandlerFactory()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new AbstractChannelPipelineFactory(0,0, 0, group, createFrameHandlerFactory()) {
return new AbstractChannelPipelineFactory(0, 0, 0, group, createFrameHandlerFactory()) {

And idem we likely should extract these 0 magic numbers to named constants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new constructor to AbstractChannelPipelineFactory without timeout and rate limit options (corresponding handlers are disabled)

@chibenwa
Copy link
Contributor

chibenwa commented Feb 17, 2022

Quick QA

  • openssl s_client -connect 172.17.0.2:993 🍏
  • Thunderbird SMTP SSL sending EMail 🍏
  • Thunderbord IMAP SSL email access - small email - (listing emails and opening it) 🍏
  • Thunderbord IMAP SSL email access - large mail - (listing emails and opening it) 🔴

When I send a multi MB email using Thunderbird, the message is well appended in the INBOX

However when I open to read it, it takes forever for Thunderbird to read it. As if the process hanged.

log:

reading_big_mail_in_tb.log

Fun fact: for one of the test message restarting Thunderbird helped, not for the other one.

  • Thunderbird IMAP SSL save the draft (IMAP APPEND) 🔴

Thunderbird hangs when saving the draft.

Here are James logs (DEBUG)

08:38:32.172 [INFO ] o.a.j.i.n.ImapChannelUpstreamHandler - Connection established from 172.17.0.1
08:38:32.206 [DEBUG] i.n.h.s.SslHandler - [id: 0xc7662865, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39360] HANDSHAKEN: protocol:TLSv1.3 cipher suite:TLS_AES_128_GCM_SHA256
08:38:32.210 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message org.apache.james.imap.message.request.CapabilityRequest@434fb3b2 that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:32.210 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xc7662865, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39360].
08:38:32.212 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message AuthenticateRequest{authType=PLAIN} that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:32.213 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, lineHandler0, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xc7662865, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39360].
08:38:32.222 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message org.apache.james.imap.message.request.CapabilityRequest@656f939a that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:32.222 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xc7662865, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39360].
08:38:32.224 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message EnableRequest{capabilities=[Capability{value=UTF8=ACCEPT}]} that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:32.224 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xc7662865, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39360].

...
Few seconds pass
...

08:38:52.240 [INFO ] o.a.j.i.n.ImapChannelUpstreamHandler - Connection closed for 172.17.0.1
08:38:52.247 [INFO ] o.a.j.i.n.ImapChannelUpstreamHandler - Connection established from 172.17.0.1
08:38:52.300 [DEBUG] i.n.h.s.SslHandler - [id: 0xef5e4890, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39362] HANDSHAKEN: protocol:TLSv1.3 cipher suite:TLS_AES_128_GCM_SHA256
08:38:52.304 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message org.apache.james.imap.message.request.CapabilityRequest@6139c2fb that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:52.304 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xef5e4890, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39362].
08:38:52.305 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message AuthenticateRequest{authType=PLAIN} that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:52.306 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, lineHandler0, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xef5e4890, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39362].
08:38:52.314 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message org.apache.james.imap.message.request.CapabilityRequest@243b50b8 that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:52.314 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xef5e4890, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39362].
08:38:52.316 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message EnableRequest{capabilities=[Capability{value=UTF8=ACCEPT}]} that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:52.316 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xef5e4890, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39362].
08:38:57.972 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message org.apache.james.imap.message.request.NoopRequest@669668ac that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:57.973 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0x95a2e755, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39358].
08:38:57.982 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message GetQuotaRootRequest{mailboxName=INBOX} that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:57.982 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0x95a2e755, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39358].
08:38:57.993 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message org.apache.james.imap.message.request.IdleRequest@2ce9352a that reached at the tail of the pipeline. Please check your pipeline configuration.
08:38:57.993 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler, idleHandler, timeoutHandler, connectionLimitHandler, connectionPerIpLimitHandler, framer, connectionCountHandler, chunkWriteHandler, lineHandler0, requestDecoder, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0x95a2e755, L:/172.17.0.2:993 - R:172.17.0.1/172.17.0.1:39358].

This indicates a mis-management of IMAP literal management in the frame decoder.

![Screenshot from 2022-02-17 15-34-18](https://user-images.githubusercontent.com/6928740/154436806-40aa23b4-a10d-4e93-be14-bacefd68c2a3.png

  • Thunderbird SMTP STARTTLS sending EMail
  • Thunderbord IMAP STARTTLS email access (listing emails and opening it)
  • Thunderbird IMAP STARTTLS save the draft (IMAP APPEND)

I also had this worrying resource management related log

08:45:10.266 [INFO ] o.a.j.p.n.BasicChannelUpstreamHandler - Connection established from 172.17.0.1
08:45:10.294 [DEBUG] i.n.h.s.SslHandler - [id: 0xb73464bf, L:/172.17.0.2:465 - R:172.17.0.1/172.17.0.1:57380] HANDSHAKEN: protocol:TLSv1.3 cipher suite:TLS_AES_128_GCM_SHA256
08:45:10.297 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message PooledSlicedByteBuf(ridx: 0, widx: 19, cap: 19/19, unwrapped: PooledUnsafeDirectByteBuf(ridx: 19, widx: 19, cap: 41)) that reached at the tail of the pipeline. Please check your pipeline configuration.
08:45:10.298 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler,  connectionLimit, connectionPerIpLimit, framer, chunkHandler, timeoutHandler, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xb73464bf, L:/172.17.0.2:465 - R:172.17.0.1/172.17.0.1:57380].
08:45:10.332 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message PooledSlicedByteBuf(ridx: 0, widx: 37, cap: 37/37, unwrapped: PooledUnsafeDirectByteBuf(ridx: 37, widx: 37, cap: 59)) that reached at the tail of the pipeline. Please check your pipeline configuration.
08:45:10.332 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler,  connectionLimit, connectionPerIpLimit, framer, chunkHandler, timeoutHandler, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xb73464bf, L:/172.17.0.2:465 - R:172.17.0.1/172.17.0.1:57380].
08:45:10.361 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message PooledSlicedByteBuf(ridx: 0, widx: 37, cap: 37/37, unwrapped: PooledUnsafeDirectByteBuf(ridx: 37, widx: 37, cap: 59)) that reached at the tail of the pipeline. Please check your pipeline configuration.
08:45:10.361 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler,  connectionLimit, connectionPerIpLimit, framer, chunkHandler, timeoutHandler, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xb73464bf, L:/172.17.0.2:465 - R:172.17.0.1/172.17.0.1:57380].
08:45:10.364 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message PooledSlicedByteBuf(ridx: 0, widx: 21, cap: 21/21, unwrapped: PooledUnsafeDirectByteBuf(ridx: 21, widx: 21, cap: 43)) that reached at the tail of the pipeline. Please check your pipeline configuration.
08:45:10.364 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler,  connectionLimit, connectionPerIpLimit, framer, chunkHandler, timeoutHandler, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xb73464bf, L:/172.17.0.2:465 - R:172.17.0.1/172.17.0.1:57380].
08:45:10.366 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message PooledSlicedByteBuf(ridx: 0, widx: 6, cap: 6/6, unwrapped: PooledUnsafeDirectByteBuf(ridx: 6, widx: 6, cap: 28)) that reached at the tail of the pipeline. Please check your pipeline configuration.
08:45:10.366 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler,  connectionLimit, connectionPerIpLimit, framer, chunkHandler, timeoutHandler, lineHandler1, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xb73464bf, L:/172.17.0.2:465 - R:172.17.0.1/172.17.0.1:57380].
08:45:10.618 [ERROR] i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:116)
	io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
	io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371)
	io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)
	io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.base/java.lang.Thread.run(Unknown Source)

🟠 Netty debug logs complains of discarded messages and asks us to check our pipeline configuration. Does this indicates we misses some kind of terminal elements in the Netty pipeline?

  • IMAP IDLE seem to still work 🍏
  • Check the IMAP timeout configs are well applied....
  • Ensure IMAP compression can be enabled
  • Ensure STARTTLS injections are not possible for IMAP, SMTP and POP3 cf https://nostarttls.secvuln.info/
  • Ensure POP3 still works (Thunderbird)
  • REnsure ManageSieve still works (Thunderbird plugin)

@ff-wl ff-wl changed the title upgrade to netty 4.1.72.Final [JAMES-3715] upgrade to netty 4.1.72.Final Feb 17, 2022
Copy link
Contributor

@Arsnael Arsnael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read it. It looks overall good, nothing to add. Just need to test it more and fix the potential issues :)

Thanks a lot for this contribution, it's very appreciated :)

@ff-wl
Copy link
Contributor Author

ff-wl commented Feb 17, 2022

I've fixed the Discarded inbound message issue.

But I can't figure out why thunderbird's use of APPEND didn't work. I've tried it on command line using openssl, and it worked. Do you have any idea?

@ottoka
Copy link
Contributor

ottoka commented Feb 17, 2022

@ottoka I was not able to determine if the change still enable client TLS authentication, you might want to have a look at this.

From what I see, the configuration in AbstractConfigurableAsyncServer remains unchanged, and everybody else is using Encryption.getContext().createSSLEngine(). So it should still be fine.

@chibenwa
Copy link
Contributor

chibenwa commented Feb 18, 2022

I've fixed the Discarded inbound message issue.

Cool!

But I can't figure out why thunderbird's use of APPEND didn't work. I've tried it on command line using openssl, and it worked. Do you have any idea?

You might be sending a single frame

a0 APPEND INBOX {312}
xxxxx

While TB might send

a0 APPEND INBOX {312}

Then

xxxxx

Also there's a notion of synchronized litterals ie {312} VS {312+} that are not handled the same way by the IMAP / Netty stack. You might be using one, and TB the others.

What I see is that we might want to ensure the IMAP TCP layer is tested for both prior to this changeset. IMAPServerTest only covers non-synchronized litterals (ie {312}). We thus need to write the unit tests for small/big synchronized litterals.

I can devote some time writing such unit tests but likely not before monday...

IMAPServerTest new tests to add

(Separated pull request)

  • Small synchronized litteral akka {1312+} when fitting in memory limit
  • Big synchronized litteral akka {13123121+} when fitting in memory limit
  • Big synchronized litteral akka {13123121+} when not fitting in memory limit

@ff-wl
Copy link
Contributor Author

ff-wl commented Feb 18, 2022

But I can't figure out why thunderbird's use of APPEND didn't work. I've tried it on command line using openssl, and it worked. Do you have any idea?

SwitchableLineBasedFrameDecoder.disableFraming()) is not working correctly.

I'm working on this solution:

    public synchronized void disableFraming(ChannelHandlerContext ctx) {
        this.framingEnabled = false;

        ByteBuf spareBytes = internalBuffer().retainedDuplicate();
        internalBuffer().clear();

        ctx.fireChannelRead(spareBytes);
    }

In one test case I had a problem, but I can't reproduce it again. So I'm testing more before committing (on Monday).

@chibenwa
Copy link
Contributor

@normanmaurer you might be interested in this too

@ff-wl
Copy link
Contributor Author

ff-wl commented Feb 21, 2022

IMAP APPEND should now work (as far I can see).

Another problem raised while executing DistributedPOP3ServerTest#linesStartingWithDotShouldBeWellHandled, maybe related with a changed memory handling with netty4.

Fix with my next commit.

@chibenwa
Copy link
Contributor

Hello.

I succeeded to pull a bit more time for testing...

This looks overall in better shape! This time the handling of IMAP APPEND in STARTTLS works just fine!

I still noticed a few things looking at the log file:

19:41:00.724 [ERROR] o.a.j.p.n.BasicChannelUpstreamHandler - Unable to process request
java.lang.NullPointerException: null
	at org.apache.james.protocols.netty.BasicChannelUpstreamHandler.channelInactive(BasicChannelUpstreamHandler.java:138)
	at org.apache.james.smtpserver.netty.SMTPChannelUpstreamHandler.channelInactive(SMTPChannelUpstreamHandler.java:65)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:277)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.handler.stream.ChunkedWriteHandler.channelInactive(ChunkedWriteHandler.java:137)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:389)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:354)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at org.apache.james.protocols.netty.ConnectionPerIpLimitUpstreamHandler.channelInactive(ConnectionPerIpLimitUpstreamHandler.java:87)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81)
	at org.apache.james.protocols.netty.ConnectionLimitUpstreamHandler.channelInactive(ConnectionLimitUpstreamHandler.java:64)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:389)
	at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:354)
	at io.netty.handler.ssl.SslHandler.channelInactive(SslHandler.java:1073)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:241)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1405)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:262)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:248)
	at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:901)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:831)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

And a few more non released buffers...

19:42:48.705 [ERROR] i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
	io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
	io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:116)
	io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
	io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371)
	io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)
	io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.base/java.lang.Thread.run(Thread.java:829)
19:42:48.706 [ERROR] i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.netty.buffer.SimpleLeakAwareByteBuf.unwrappedDerived(SimpleLeakAwareByteBuf.java:143)
	io.netty.buffer.SimpleLeakAwareByteBuf.readRetainedSlice(SimpleLeakAwareByteBuf.java:67)
	io.netty.handler.codec.LineBasedFrameDecoder.decode(LineBasedFrameDecoder.java:116)
	org.apache.james.protocols.netty.AllButStartTlsLineBasedChannelHandler.decode(AllButStartTlsLineBasedChannelHandler.java:61)
	io.netty.handler.codec.LineBasedFrameDecoder.decode(LineBasedFrameDecoder.java:84)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371)
	io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)
	io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)
	io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
	io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	java.base/java.lang.Thread.run(Thread.java:829)

I still get a few of those:

20:00:49.966 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded inbound message PooledSlicedByteBuf(ridx: 0, widx: 6, cap: 6/6, unwrapped: PooledUnsafeDirectByteBuf(ridx: 6, widx: 6, cap: 28)) that reached at the tail of the pipeline. Please check your pipeline configuration.
20:00:49.966 [DEBUG] i.n.c.DefaultChannelPipeline - Discarded message pipeline : [sslHandler, groupHandler,  connectionLimit, connectionPerIpLimit, framer, chunkHandler, timeoutHandler, coreHandler, DefaultChannelPipeline$TailContext#0]. Channel : [id: 0xe1a7e0d8, L:/127.0.0.1:465 ! R:localhost/127.0.0.1:57294].

Also I might be missing something here... Connecting manually in IMAP SSL (port 993) works great however when doing so with Thunderbird I get the following error...

19:53:21.589 [WARN ] o.a.j.i.n.ImapChannelUpstreamHandler - Error while processing imap request
javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:131)
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:336)
	at java.base/sun.security.ssl.Alert$AlertConsumer.consume(Alert.java:293)
	at java.base/sun.security.ssl.TransportContext.dispatch(TransportContext.java:185)
	at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:172)
	at java.base/sun.security.ssl.SSLEngineImpl.decode(SSLEngineImpl.java:681)
	at java.base/sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:636)
	at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:454)
	at java.base/sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:433)
	at java.base/javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:637)
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:295)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1341)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
	... 17 common frames omitted
Wrapped by: io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:477)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)

Here is what OpenSSL have to say about it:

$ openssl s_client -connect 127.0.0.1:993
CONNECTED(00000003)
---
Certificate chain
 0 s:C = FR, ST = Unknown, L = Puteaux, O = James, OU = Linagora, CN = Benoit Tellier
   i:C = FR, ST = Unknown, L = Puteaux, O = James, OU = Linagora, CN = Benoit Tellier
---
Server certificate
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
subject=C = FR, ST = Unknown, L = Puteaux, O = James, OU = Linagora, CN = Benoit Tellier

issuer=C = FR, ST = Unknown, L = Puteaux, O = James, OU = Linagora, CN = Benoit Tellier

---
No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
---
SSL handshake has read 1425 bytes and written 363 bytes
Verification error: certificate has expired
---
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
Early data was not sent
Verify return code: 10 (certificate has expired)
---
---
Post-Handshake New Session Ticket arrived:
SSL-Session:
    Protocol  : TLSv1.3
    Cipher    : TLS_AES_256_GCM_SHA384
    Session-ID: F8BBBD9BEA21D5203EF578E281A7635594CADCDB2A1A1247EDE3354980EDC5E7
    Session-ID-ctx: 
    Resumption PSK: D25D69A576858FF0BA13618CAF628AA43BD98066F1CC9E53C98229BF14F18FA3045806DD249E1D079F18F86BA252A00E
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    TLS session ticket lifetime hint: 86400 (seconds)
    TLS session ticket:
    0000 - 52 d5 b7 3e b6 1f 04 36-e4 58 34 db ef d6 ff 8f   R..>...6.X4.....
    0010 - a4 28 bb 95 8f 11 f5 6e-f9 81 b7 fc e4 c5 f3 a2   .(.....n........

    Start Time: 1645447845
    Timeout   : 7200 (sec)
    Verify return code: 10 (certificate has expired)
    Extended master secret: no
    Max Early Data: 0
---
read R BLOCK
* OK JAMES IMAP4rev1 Server interview1-HP-ProBook-440-G6 is ready.
a0 LOGIN [email protected] 123456
a0 OK LOGIN completed.

Cheers,

Benoit

@ottoka
Copy link
Contributor

ottoka commented Feb 21, 2022

For the last issue, since I have been looking at these far too often lately:
Verification error: certificate has expired
Verify return code: 10 (certificate has expired)
are bad if the client cares about actually verifying the certificate, as I assume Thunderbird does. I suggest you create/install a fresh certificate and try again.

@Arsnael
Copy link
Contributor

Arsnael commented Feb 22, 2022

I agree with @ottoka , in my experience as well Thunderbird can be very insisting on the certificate. I don't think it's related to this PR :)

@chibenwa
Copy link
Contributor

I suggest you create/install a fresh certificate and try again.

Already tried, unsuccessfully.

What is weird to me is that from the stacktrace, it seems to be James aborting the SSL connection...

@chibenwa
Copy link
Contributor

BTW if you rebase on master and solve the ongoing conflict, we shall be able to run the build on the ASF infra @ff-wl

@ff-wl
Copy link
Contributor Author

ff-wl commented Feb 23, 2022

From my point of view there are no open issues. The failed test org.apache.james.blob.cassandra.CassandraBlobStoreClOneTest.readShouldPublishMissRatioClOneMetric is not related to this PR

@chibenwa
Copy link
Contributor

It is not and I started a new build.

Regards

@ff-wl ff-wl marked this pull request as ready for review February 23, 2022 14:17
@chibenwa
Copy link
Contributor

chibenwa commented Feb 23, 2022

Hello,

I did took the time to retest your work this evening, built the latest status of your pull request, started the resulting James and played with Thunderbird.

I noticed that big messages (> MB) tend to load forever in Thunderbird. Opening the error console I got this:

Screenshot from 2022-02-23 20-24-20

Wich to be honnest is not very helpful to me.

Debugging, I would really suspect IMAP partial FETCH is involed in this, which Thunderbird uses to download such messages...

Here is a simple exchange to reproduce the issue...

# Establish a TLS connection
openssl s_client -connect 127.0.0.1:993
# Login + SELECT
a0 login [email protected] 123456
a1 SELECT INBOX
# Fetch the size of message no 1 - convenient to play afterward... Size 664 bits
a2 FETCH 1 RFC822.SIZE
# Fetching the whole body succeed
a3 FETCH 1 BODY[]
# Fetching the body from offset 400 works}
a4 FETCH 1 BODY[]<400>
# We can also successfully fetch from offset 400 beyong the message size
a5 FETCH 1 BODY[]<400.400>
# We can get it with exactly the right offset and limit
a6 FETCH 1 BODY[]<400.264>
# But as soon as we drop below we hang forever
a7 FETCH 1 BODY[]<400.263>

I tested the same commands on master and it worked great.

I think we should try to add unit tests for partial fetch...

I did not have time today to further identify, nor test things tonight, but will definitly continue having a look at this tomorrow morning.

Regards,

Benoit

@chibenwa
Copy link
Contributor

chibenwa commented Mar 1, 2022

Hello,

FYI we discussed quickly on the mailing list and we decided to release a 3.7.0 version prior to merging the Netty 4 upgrade. The rationals is to ensure stability and make sure it gets deployed and used before being included in a release. CF https://www.mail-archive.com/[email protected]/msg71619.html . With partial fetchs and compression we clearly showed we did not master fully the topic and that the TCP level test suite for IMAP (at least) misses the coverage to identify such defects...

I will start working on the (long overdue) 3.7.0 today, we should then be able to swap to 3.8.0-SNAPSHOT and merge this very soon.

Also I am very satisfied and confident about the status of this work. Personally, I would like us to take the time to do a regular performance test campaign on this change (using Gatling) to further validates performance benefits. This should also take 1-2 days.

Cheers,

Benoit

chibenwa added 5 commits March 1, 2022 12:24
Before this parameter was ignored, we can use it to size Netty server
child executor instead.
Netty 4 no longer allows setting internal threads and instead relies
on event loop groups.

Removing unused methods...
…ExecutorThreads

Computation are now conducted on the child threads of the Netty4 Server bootstraps.

On Netty 3 those computations were conducted on a thread pool yet separated
from the Netty 'workers' which could be seen as unnecessary context switches -
Netty worker threads were merely decoding and encoding the traffic.
This is dead code since the Netty 4 migration. Remove.

We now have easier to exploit metrics (JMX, HTTP endpoint, logs, ES reportings)
that expose more interesting applicative level metrics.
@chibenwa
Copy link
Contributor

chibenwa commented Mar 4, 2022

Hello,

Here is a status of our ongoing testing work. We encounter several issues testing the Netty 4 migration on our pre-production environment. (@Arsnael is involved on it too)

We did not yet reach the point where we could actually play our performance tests.

Issue 1: Partially written SELECT QRESYNC responses

When using evolution MUA (evolution in debug mode logs the full IMAP exchange which is convenient) and QRESYNC is enabled, for one mailbox of the account an error is encountered. Evolution complains the response was truncated...

Here is the exchange:

[imapx:I] I/O: 'I00104 SELECT Spam (QRESYNC (203407991 88 2:37 (1,10,28 2,11,29)))'
[imapx:I] I/O: '* FLAGS (\Answered \Deleted \Draft \Flagged \Seen Junk NonJunk $Forwarded)'
[imapx:I] I/O: '* 36 EXISTS'
[imapx:I] I/O: '* 0 RECENT'
[imapx:I] I/O: '* OK [UIDVALIDITY 203407991] UIDs valid'
[imapx:I] I/O: '* OK [UNSEEN 1] MailboxMessage 1 is first unseen'
[imapx:I] I/O: '* OK [PERMANENTFLAGS (\Answered \Deleted \Draft \Flagged \Seen Junk NonJunk $Forwarded \*)] Limited'
[imapx:I] I/O: '* OK [HIGHESTMODSEQ 124] Highest'
[imapx:I] I/O: '* OK [UIDNEXT 38] Predicted next UID'
[imapx:I] I/O: '* VANISHED (EARLIER) '
[imapx:I] I/O: ''

You can see there is no 'OK' response.

I try to reproduce locally but working with QRESYNC is horrible. We still need to conduct regression tests to know if this happens also on Netty 3...

ISSUE 2: Thunderbird & IDLE not valid in this state

To reproduce, we open Thunderbird and switch mailboxes quickly. After a few quick mailbox switches, thunderbird complains the mailbox can't be opened and says server answered: IDLE command invalid in this state. All subsequent IMAP requests fails and Thunderbird need to be restarted to start in a clean state. It seems as if James logout the session without closing the channel. It's unclear why this happens...

The current operation in inbox did not succeed. The mail server for account name [user_mail] responded: IDLE failed. Command not valid in this state.

Environment: distributed James in a cloud setup (k8s hosted at OVH). We don't reproduce locally.

We did not yet manage to get a traffic capture.

I suspect concurrency issues: one IMAP request could be processed before the previous one thus don't benefit from state changes of previous commands? To be honnset this is still unclear to me... I did spend a few hour reproducing by writing unit tests sending multiple commands at once but failed reproducing...

I have a blind bet about this one: this changeset added the @Shareable annotation on a couple of handlers including the Imap handler, allowing concurrency on this handler might lead to incorrect handling in the context of a connected, stateful protocol. We need still to try such a change out...

ISSUE 3: Gatling list

Our performance tests lists mailboxes, append a few message, selects a mailbox, then fetches a few messages; however the listing always fails - gatling cannot find an INBOX. We reliably reproduce running gatling & james locally.

================================================================================
2022-03-01 04:53:35                                         505s elapsed
---- Requests ------------------------------------------------------------------
> Global                                                   (OK=26316  KO=8768  )
> Connect                                                  (OK=2516   KO=0     )
> login                                                    (OK=2516   KO=0     )
> heavyScenario / append                                   (OK=3761   KO=0     )
> lightScenario / list                                     (OK=0      KO=1252  )
> heavyScenario / list                                     (OK=0      KO=7516  )
> lightScenario / select                                   (OK=1251   KO=0     )
> heavyScenario / select                                   (OK=7514   KO=0     )
> heavyScenario / fetch                                    (OK=7508   KO=0     )
> lightScenario / fetch                                    (OK=1250   KO=0     )
---- Errors --------------------------------------------------------------------
> Unable to find folder 'INBOX' in                                 8768 (100.0%)

---- ImapPlatformValidation ----------------------------------------------------
[---------------------------------------------------------------           ]  0%
          waiting: 475    / active: 2525   / done: 0     
================================================================================

CF https://github.com/linagora/james-gatling/blob/master/src/it/scala-2.12/org/apache/james/gatling/imap/PlatformValidationIT.scala

Using telnet, it seems we are getting "out of order" responses on the wire:

A1 list "" "*"
A1 OK LIST completed.
* LIST (\HasNoChildren) "." "INBOX"
* LIST (\HasNoChildren) "." "Sent"
* LIST (\HasNoChildren) "." "Trash"

Obviously we should be having the OK comming after the untagged responses...

Would that mean we need to "await" or "chain" channel futures when we write responses to ensure a correct order?

Issue 4: Blocking on netty IO event loop (Suspission)

I suspect we currently run everything straight on the io event loop (woken up when channels receive reads/writes). I suspect we might have no choices but to run handlers performing DB query operations (but likely also synchronous channel read/writes) on a separate thread pool as it was done before. Local benchmarks conducted so far would fail at detecting the impact of "blocking on the event loop" as the concurrency level is low (less than 8 concurrent connections). I wonder how the current set up will behave with high concurrency levels (hundreds of concurrent connections).

Actions

  • Evaluate if the QRESYNC issue is a regression compared to Netty 3
  • Unit tests reproducing the QRESYNC issue (hard)
  • Capture of what James thinks he sends for QRESYNC issue ?
  • Unit test reproducing the LIST ordering issue
  • Try if awaits after writes solves the LIST response ordering issue
  • Try the @Shareable change for IMAP handler and see if it have impacts on the IDLE TB issue

Apparently, this changeset might keep us busy for still quite some time...

Regards,

Benoit

@chibenwa
Copy link
Contributor

chibenwa commented Mar 4, 2022

Follow up regarding #886 (comment)

Issue 1

We reproduced the bug on master thus this is not a regression. This bug will thus be handled separately cf https://issues.apache.org/jira/projects/JAMES/issues/JAMES-3722

Issue 2

We confirm the ordering of the actual writes on the channel is not necessarily the same as those specified in the code (channel.writeAndFlush(xxx)).

We solved the issue by awaiting that the write is done in ChannelImapResponseWriter (which is a terrible practice). @Arsnael will share us that version of the code.

I'd like also to try something relying on future chaining, relying on channel.writeAndFlush(data, previousFuture); and carrying over the futures of the pending writes to enforce ordering without blocking. More on this later...

Issue 3

No progress yet.

Issue 4

No progress yet.

Issue 5

More buffer leaks. @Arsnael spotted this:

09:15:59.804 [ERROR] i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.       
Recent access records:                                                                     
Created at:                                                                                
        io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)                                   
        io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)                                                 
        io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)           
        io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:116) 
        io.netty.handler.ssl.SslHandler.allocate(SslHandler.java:2247)                                                                 
        io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1336)                       
        io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234)          
        io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283)                                                     
        io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
        io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)                                
        io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)                                     
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)                
        io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)           
        io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)              
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)          
        io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)                           
        io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
        io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
        io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
        io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
        io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        java.base/java.lang.Thread.run(Unknown Source)

We need to be sure we run with -Dio.netty.leakDetectionLevel=advanced cc @Arsnael

What's weird to me is that it seems allocated by SSLHandler thus logically the SSLHandler is responsible of releasing the buffer, not us. Also from what I could see ByteToMessageDecoder manages the buffers for ImapRequestFrameDecoder ... Should we try opening a bug report upstream to the Netty project?

@ff-wl
Copy link
Contributor Author

ff-wl commented Mar 4, 2022

What's weird to me is that it seems allocated by SSLHandler thus logically the SSLHandler is responsible of releasing the buffer, not us. Also from what I could see ByteToMessageDecoder manages the buffers for ImapRequestFrameDecoder ... Should we try opening a bug report upstream to the Netty project?

The SSL-Handler allocates the Buffer (after decryption) and puts it into the handler queue. Then the next handler is responsible to release the buffer or "fire" it to the next handler. The last handler must release it. With the advanced leak detection level you will see the handling of the responsibility.

@chibenwa
Copy link
Contributor

chibenwa commented Mar 5, 2022

https://stackoverflow.com/questions/9082101/do-messages-sent-via-nettys-channel-write-preserve-their-order-when-begin-sen

Answer picked up from this issue. (Thanks Scottmitch@github).

Netty can provide "ordering" in the following situations:

  • You are doing all writes from the EventLoop thread; OR
  • You are doing no writes from the EventLoop thread (i.e. all writes are being done in other thread(s)).

It is subtle but important thing to remember as it can cause painful bugs. If some channel.write() calls are coming from application threads and some are coming from EventLoop then the writes can seem out of order from application's perspective. See more details on the linked issue.

In the context of list, untagged response are delegated to reactor, which in the case of the distributed James server is scheduled on an Elastic thread.

In ListProcessor:

        getMailboxManager().search(
                MailboxQuery.builder()
                    .userAndNamespaceFrom(basePath)
                    .expression(new PrefixedRegex(
                        basePath.getName(),
                        ModifiedUtf7.decodeModifiedUTF7(mailboxName),
                        mailboxSession.getPathDelimiter()))
                    .build(), Minimal, mailboxSession)
            // processResult sends untagged notification from an elastic thread
            .doOnNext(metaData -> processResult(responder, isRelative, metaData, getMailboxType(session, metaData.getPath())))
            .then()
            .block();

   // OK is currently sent from the event loop.
   okComplete(request, responder);

Note that processResult sends untagged responses.

So the conclusions are:

  • We don't need to await writes nor chain futures
  • The fix is similar to the performance problems mentionned above. Solution is to switch ImapChannelUpstreamHandler out of the event loop.

Something like (ImapServer):

public class IMAPServer extends AbstractConfigurableAsyncServer implements ImapConstants, IMAPServerMBean, NettyConstants {
    private EventExecutorGroup eventExecutors;
    // ...

    @Override
    public void doConfigure(HierarchicalConfiguration<ImmutableNode> configuration) throws ConfigurationException {
        // ...
        eventExecutors = new DefaultEventExecutorGroup(maxExecutorThreads, NamedThreadFactory.withName("imapserver"));
    }
    // ...
    
    @Override
    protected AbstractChannelPipelineFactory createPipelineFactory() {
        return new AbstractChannelPipelineFactory(getFrameHandlerFactory()) {
            // ...
            @Override
            public void initChannel(Channel channel) {
                //...
                pipeline.addLast(eventExecutors, CORE_HANDLER, createHandler());
            }

        };
    }
    // ...
}

I will propose those changes on a separate pull request, that @Arsnael will be able to try, and, once validated, that you would be able to cherry-pick.

Good weekend,

Cheers

@ff-wl
Copy link
Contributor Author

ff-wl commented Mar 16, 2022

sorry for the absence (I had to do other tasks).
What is the current status? Should I cherry-pick the commits from #903?

@chibenwa
Copy link
Contributor

chibenwa commented Mar 17, 2022

sorry for the absence (I had to do other tasks).
What is the current status? Should I cherry-pick the commits from #903?

Hello @ff-wl !

No worry, we'd been actively working on our side too! That is days I want to make a status statement regarding our progress!

#903 is not that critical I'd rather not put it on the critical path. Idem for #898 . We will rebase them later.

Remaining blockers

Please note that 3.7.0 being released when addressing the above points we can merge ;-)

We are making slow progress, but methodically, adding critical pieces of testing for the IMAP stack, significantly improving its reliability.

Best regards,

Benoit TELLIER

@chibenwa
Copy link
Contributor

chibenwa commented Mar 17, 2022

I just fixed the IMAP IDLE + COMPRESS issue. An issue in an attempt to prevent an array copy, offset was ignored.

Performance wise Netty4 stack seems slightly slower than Netty3 stack but this do not seems like a blocker to me, we can keep on investigating in the coming weeks, and less flushes + native EPOLL + SSL would anyway outweight that.

Before

Screenshot from 2022-03-17 16-09-44

After

Screenshot from 2022-03-17 16-09-35

On the performance side I also did proceed to flame graph analysis. For the record:

netty4-prod-2.zip

Help welcome on performance investigations.

Regarding the merge, I will rebundle the two PRs together. You don't need to take an action.

Copy link
Contributor

@chibenwa chibenwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be merged at the same time than #908

Copy link
Member

@quantranhong1999 quantranhong1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to add. Thank you for your great effort!

Copy link
Contributor

@glennosss glennosss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor review comments! Happy to send in pull request to fix some of these :-)

}

if (workerGroup != null) {
workerGroup.shutdownGracefully();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad code - should either wait for promises to finish (i.e. blocking code, not recommended) or return promise.

Example blocking code; (not recommended)

        EventLoopGroup workerGroup = this.workerGroup;
        if (workerGroup != null) {
            this.workerGroup = null;
            workerGroup.shutdownGracefully().syncUninterruptibly();
        }

        EventLoopGroup bossGroup = this.bossGroup;
        if (bossGroup != null) {
            this.bossGroup = null;
            bossGroup.shutdownGracefully().syncUninterruptibly();
        }

Alternatively and better - unbind() should return a promise with all the combined promises so that the caller can sync() if required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CF #898 to fix this. Won't fix here in my opinion.


/**
* {@link IdleStateHandler} implementation which disconnect the {@link Channel} after a configured
* idle timeout. Be aware that this handle is not thread safe so it can't be shared across pipelines
*/
@ChannelHandler.Sharable
public class TimeoutHandler extends IdleStateHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant implementation - just use the Netty provided implementation - io.netty.handler.timeout.ReadTimeoutHandler - and delete this class!

@chibenwa
Copy link
Contributor

Hello @glennosss !

Thanks for the precious feedback. I added those suggestions to #924

@chibenwa
Copy link
Contributor

Closing. To be continued at #924.

If there's any further comments, I will address them in #924

Thanks for the time and energy of all the involved people!

@chibenwa chibenwa closed this Mar 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants