Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Client auto reconnect after connection failure is damaged in 1.5.3 #57

Open
VictorVisoki opened this issue Mar 20, 2019 · 11 comments
Open

Comments

@VictorVisoki
Copy link

With Proteus client of version 1.5.3 the client won't reconnect to the Proteus server after connection failure, for example when the server was restarted.

It does work well for clients from version 1.5.2, even more:
In a machine with multiple clients, both 1.5.2 and 1.5.3, only the clients of 1.5.2 had survived Proteus server restart (restarted more than 2 times to verify) .

In addition, the ability to use Proteus in integration tests (bringing up a test container with Proteus server) was also damaged. Again, when going back to 1.5.2 it works fine.

The ability to reconnect to the server is critical for applications in production, making it impossible to use the new 1.5.3 version.

@robertroeser
Copy link
Contributor

Hi @VictorVisoki

Can you try with 1.5.5?
https://github.com/netifi-proteus/proteus-java/releases/tag/1.5.5

Thanks,
Robert

@VictorVisoki
Copy link
Author

Hey Robert,
the entire application is using spring to work with Proteus,
as far as I've seen there is no 1.5.5 version for spring-proteus yet.
It would be difficult for me to try it in production without the spring support.
I maybe can try the integration test

@robertroeser
Copy link
Contributor

You can use proteus-java 1.5.5 and proteus-spring-1.5.3 - but yes - it looks like proteus-spring still transitively pulls in 1.5.3

I will get this fixed.

@VictorVisoki
Copy link
Author

Thanks,
update: tried to run the integration test with 1.5.5 - it doesn't work.
About the bigger issue, the auto reconnect after failure -
I won't be able to test that without spring.

@robertroeser
Copy link
Contributor

can you share what your integration test does so I can take a look?

@robertroeser
Copy link
Contributor

also - are you running the 1.5.3 broker?

@VictorVisoki
Copy link
Author

yes, the broker is 1.5.3

@robertroeser
Copy link
Contributor

Hi @VictorVisoki ,

I forked the proteus spring example, and created a branch: https://github.com/robertroeser/proteus-spring/tree/retry

I changed the client to continually retry to send messages. If the broker is down it reconnects. I restarted the broker server times, and the connections on the client and the server both automatically reconnect, and messages start flowing again. This was with 1.5.3.

I'm going to need some more context to track down the issue you're having.

@VictorVisoki
Copy link
Author

VictorVisoki commented Mar 21, 2019

I have a k8s cluster on which the Proteus broker is one of the pods.
The ip of the borker is dynamic and can be different after each restart. (it's not localhost like in your example)
I use a hostname of course

When I run this setup with proteus-spring 1.5.3, and do couple of restarts to the Proteus pod,
the connection is not restored.
The clients keep throwing this:

2019-03-19 12:35:51.132 INFO [ parallel-1] .p.DefaultProteusBrokerService selecting socket WeightedClientTransportSupplier{', supplierId=2, errorPercentage=Ewma(value=6.633688616271088E-72, age=1969229), latency=Ewma(value=0.0, age=1083497048), socketAddress=127.0.0.1:8001, activeConnections=0, selectedCounter=1} with weight 6.633688616271088E-72
2019-03-19 12:35:51.148 ERROR [actor-tcp-nio-3] .r.WeightedReconnectingRSocket error trying to broker
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:8001
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:1.8.0_181]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:1.8.0_181]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327) ~[netty-transport-4.1.33.Final.jar:4.1.33.Final]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[netty-transport-4.1.33.Final.jar:4.1.33.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:665) ~[netty-transport-4.1.33.Final.jar:4.1.33.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:612) ~[netty-transport-4.1.33.Final.jar:4.1.33.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:529) ~[netty-transport-4.1.33.Final.jar:4.1.33.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:491) [netty-transport-4.1.33.Final.jar:4.1.33.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:905) [netty-common-4.1.33.Final.jar:4.1.33.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
Caused by: java.net.ConnectException: Connection refused
... 10 more

When I switch back to 1.5.2 it just works fine.

@VictorVisoki
Copy link
Author

About the integration test, here's what I'm doing (in Kotlin):

@testcontainers
class ProteusExampleTest {

companion object {

    @Container
    @JvmField
    val proteusBroker = TestProteus()

}

private class TestHelloService(private val replayMessage: String) : HelloService {

    lateinit var receivedMessage : String

    override fun sayHello(message: HelloRequest, metadata: ByteBuf): Mono<HelloResponse> {
        receivedMessage = message.name
        return Mono.just(HelloResponse.newBuilder().setMessage(replayMessage).build())
    }

}


@Test
fun testConnection() {
    val serversGroupName = "testservices"
    val messageFromService = "I am Proteus server"
    val messageFromClient = "I am Proteus Client"

    // Server
    val testHelloService = TestHelloService(messageFromService)
    getProteus(serversGroupName).addService(
        HelloServiceServer(
            testHelloService,
            Optional.empty(),
            Optional.empty()
        )
    )

    // Client
    val connection = getProteus("testClientsGroup").group(serversGroupName)
    val helloServiceClient = HelloServiceClient(connection)

    val replayFromServer = helloServiceClient.sayHello(HelloRequest.newBuilder().setName(messageFromClient).build()).block()!!.message

    assertEquals(messageFromService, replayFromServer)
    assertEquals(messageFromClient, testHelloService.receivedMessage)
}

private fun getProteus(groupName: String) =
    Proteus.builder()
        .host(proteusBroker.serverHost)
        .port(proteusBroker.tcpPort)
        .accessKey(proteusBroker.accessKey)
        .accessToken(proteusBroker.accessToken)
        .group(groupName)
        .poolSize(2)
        .build()

}

When the client version is 1.5.3, it simply hangs...
Again, with 1.5.2 it works fine.
TestProteus is just a test container that btings up proteus borker 1.5.3

@robertroeser
Copy link
Contributor

Hi @VictorVisoki ,

Thanks for writing back. This makes sense now. There was bug in 1.5.2 that was fixed that would have allowed this to work. At a high level because of the way docker works the Proteus broker needs a private address it binds to and a public address that it advertises clients to connect on.
For instance the broker in a docker container would bind to 127.0.0.1:8001, but tell clients to connect to it on 10.0.0.1:8001. This is necessary for cluster to work properly. We have more documentation and helm chart that will be coming out in our next release.

In the meantime I have created an example application that runs on Kubernetes.

Here is a link to the application:
https://github.com/robertroeser/example-app/

Here is the YAML I used to configure the brokers. It automatically sets the broker’s public address so that other brokers and clients can find it:
https://github.com/robertroeser/example-app/blob/master/scripts/k8s/acme/broker.yaml

Additionally please take a look at a service:
https://github.com/robertroeser/example-app/blob/master/scripts/k8s/acme/flight.yaml

This will allow the service to connect to a broker and then start to get a stream of the other brokers.

I believe this information should help get the application working correctly. Let me know if you have other questions.

Thanks,
Robert

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants