Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout for long pooling connections and reconnect if connection breaks #32

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

starstuck
Copy link
Contributor

I have added polling reconnect on periodical timeout and errors.

When ruining long session I come into situations where connection would brake. In those situations browser would still be able to send messages via skewer.log to skewer-repl buffer, bund commands wouldn't make it the other way. I am not entirely if it fixes all issues with dropping connection, but it seamed to help in my cases. What do you thing about this changes ?

@skeeto
Copy link
Owner

skeeto commented Sep 25, 2013

The likely cause is the browser timing out the connection if it's sitting idle too long. When this happens you can reconnect at any time just by calling skewer() in the browser console. How long of a session does it have to be before you experience this? In my experience, Firefox seems to never timeout and Chromium will timeout after maybe 6 hours or so.

Automatically reconnecting would be nice so that the user doesn't need to worry about it. However, I'm definitely not satisfied with your changes for a few reasons:

  • This aborts the connection and restarts it every 5 seconds. This sort of defeats the purpose of a long poll.
  • It increases the complexity of the skewer() function, which I think is already complex enough with its internal asynchronous loop.
  • It's a very specific fix to skewer(), when I think it could be generally added to skewer.getJSON() and skewer.postJSON().

What I'd like to see instead is an option for getJSON() and postJSON(), probably as a config object argument, that tells these functions to attempt to try the request again and again until it finally succeeds. The retry should happen pretty much immediately the first time, then there should be a back-off factor that slows down the retries until it's a minute or more in between (maxing out).

skewer.getJSON(skewer.host + '/skewer/get', callback, {retry: true});

The retry option could also accept a number instead that limits the number of retries. More properties could control the back-off factor and such. Maybe the callback should even be rolled into the config object as a success property -- though the API should be backwards compatible. I'm thinking long-term with this.

For detecting when a retry should occur, definitely don't just abort and retry. Rely on events for this, like onreadystatechange, onerror, or maybe even ontimeout (not sure if this ever fires with xhr.timeout = 0) and really try to avoid a polling timer. Testing this out is a problem. I'm not seeing any way to simulate a browser timeout, so I don't know what it looks like. Looks like it depends on the OS settings? Waiting hours between tests isn't practical.

@skeeto
Copy link
Owner

skeeto commented Sep 25, 2013

Oh yeah, I also want to say thanks for looking into this!

@starstuck
Copy link
Contributor Author

I see the point of having error handling logic outside of main skwer() function.

I still have some concerns to handling only connection errors. It was working fine for me when I was using apache proxy on top of emacs. It was throwing 502 (proxy Timeout) responses after 3 minutes of idle connection. With the changes i just made it was effectively as algorithm before, maybe except time being longer.

This will not work if you restart your http-server or if you are connections through NAT gateways which silently drop inactive connections. I can imagine that in some situations you may get stalled TCP connections in busy Wi-Fi when using skewer. I am mostly concern about mobile device usage, as it is difficult to have console ther and manually call skewer().

I was thinking that on top of retrying on errors, we could have server initiated ping every 2-3 minutes. With that we would get sort of keep-alive packet for routers and proxies on the way. Also it would be useful info for server, so client list would be more up to date. That already may improve connection reliability. After having systematic ping, we could also add watchdog on client, which would reconnect if there was no response within some reasonable time. Alternatively we could have pin initiated by client to keep server side simple. Effect on connection reliability would be the same.

The only drawback would be some overhead on wire, but with reasonable timeouts it should not be big issue. Something between 203 minutes seams like reasonable compromise.

What are you opinions on implementing that approach? I would be happy to prepare some proof of concept and run some tests in different setups.

If you don't like that, the only better solution I can come up with right now is implementing communication channel via WebSocket. In WebSocket keep-alive payload could be kept to minimum, so network overhead would be negligible. The only problem with WS is, that it is not supported on older browser and androids. It would mean that we would have to support roll-back of connection to long-polling as described above, which in turn means more code to support.

Oh, there is one more thing. You have suggested to retry last request. I am not sure if it is always the thing to do. On post request i was getting usually error when data was already delivered and we were waiting from command from server. If we keep current logic, that we re-use POST for long-polling as well, we would have to have smarter retry logic. If error was because of timeout or 502 or similar, retry only GET request. If anything else retry original request. Alternatively we could keep them separate and use GET for long polling and POST for delivering messages to server. In that case after each command from server client would send new GET request to resume long polling connection and separate POST request with just payload. In that case POSTs would be always re-requested on errors. I know it gives significant network overhead, but let us keep client logic simpler.

I also thing that we should retry on 50x errors, but not on 40x. If access is forbidden we most probably will not get allowed any time soon, contrary to when server is temporarily down. In the second case there is some point to retry after some time. Maybe on 30x we should follow redirect request to other location for full compatibility with HTTP? What do you think?

@skeeto
Copy link
Owner

skeeto commented Sep 27, 2013

Tomasz Szarstuk [email protected] writes:

This will not work if you restart your http-server or if you are
connections through NAT gateways which silently drop inactive
connections. I can imagine that in some situations you may get stalled
TCP connections in busy Wi-Fi when using skewer.

What happens to the XHR when the TCP connection is silently dropped?
Does it trigger the error event or is it effectively undetectable? If
it's detectable I'd really like to avoid a keep-alive ping.

I am mostly concern about mobile device usage, as it is difficult to
have console ther and manually call skewer().

Good point.

If you don't like that, the only better solution I can come up with
right now is implementing communication channel via WebSocket. In
WebSocket keep-alive payload could be kept to minimum, so network
overhead would be negligible. The only problem with WS is, that it is
not supported on older browser and androids. It would mean that we
would have to support roll-back of connection to long-polling as
described above, which in turn means more code to support.

Some months ago I also considered adding WebSocket support, making it
the primary connection method, and leaving long polling as a fallback
for older browsers or other limited scenarios. There already exists a
nice WebSocket library that could immediately be put to use for this:

https://github.com/ahyatt/emacs-websocket

I've hesitated to do it because it complicates the end-user
configuration. They would need to worry about two ports and two servers,
one for simple-httpd (serving /skewer) and another for the WebSocket
server.

Oh, there is one more thing. You have suggested to retry last request.
I am not sure if it is always the thing to do. On post request i was
getting usually error when data was already delivered and we were
waiting from command from server. If we keep current logic, that we
re-use POST for long-polling as well, we would have to have smarter
retry logic. If error was because of timeout or 502 or similar, retry
only GET request. If anything else retry original request.

That's a really good point. I hadn't thought of that.

Alternatively we could keep them separate and use GET for long polling
and POST for delivering messages to server. In that case after each
command from server client would send new GET request to resume long
polling connection and separate POST request with just payload. In
that case POSTs would be always re-requested on errors. I know it
gives significant network overhead, but let us keep client logic
simpler.

Good idea, I think this is the right direction to go. Requiring two TCP
connections is reasonable -- it's already doing so for skewer.log() --
and the additional overhead per Emacs response is acceptable. The
current poll-on-POST seems to be incompatible with timeouts so it
really leaves us with no other choice.

I also thing that we should retry on 50x errors, but not on 40x. If
access is forbidden we most probably will not get allowed any time
soon, contrary to when server is temporarily down. In the second case
there is some point to retry after some time. Maybe on 30x we should
follow redirect request to other location for full compatibility with
HTTP? What do you think?

This is a good idea. When talking to Emacs it should only get a 40x
response if Skewer was disabled but the web server continued running.
This could happen if Emacs was killed and restarted but Skewer hadn't
autoloaded yet, and with auto-reconnecting it would become likely to
happen.

Implementing any of these changes will be risky for Skewer's stability
since it's modifying the core of Skewer. Particularly with XHR,
different browsers are going to behave differently, so I want to spend
time testing and being cautious about it before merging anything into
master.

Moving forward I'd like to go with your idea of long-polling only with
GET, with POST responses on a second channel. I really like your queue
idea, so Emacs should always be expecting a collection of objects on
POST, not just a single response. For handling timeouts, if you can show
me that pinging is truly necessary I'll accept that, with a configurable
keep-alive window.

@skeeto
Copy link
Owner

skeeto commented Oct 2, 2013

Sorry, I've been busy so I didn't have a chance to look into this until tonight. Should I take a look at long-polling-reconnect again now?

@starstuck
Copy link
Contributor Author

Maybe give me few days to finish it up. I got it working well in my environment. I would like to spend some time testing it in different configurations and add original post and get methods for API compatibility. I think it would be best for you to look into proposal it is confirmed to perform well.

@starstuck
Copy link
Contributor Author

Sorry for taking long time with thiose changes. I was busy with other projects. I have refactored few bits of my original proposal. After your suggestion I have decided to leave original skewer.getJSON and skewer.postJSON for compatibility and implemented separate, reliable, queued message channel for communicating between JS and Emacs parts of skewer.

The solution is higly inspired on BOSH protocol. It keeps 2 persistent connections, from which one is almost always being kept in polling state, so server can send messages as soon as possible. At the same time the other connection is for client to send stuff immediately as it has anything to send.

To keep it efficient the polling is on 5 minutes timeouts to force change of tokens. From my experience this is reliable values. Otherwise proxies and browsers may kill pending connection. Also when connection is broken, browser will re-establish it as it becomes available.

I had to change a structure inside skewer-clients to have connection id, which is unique per client session (page visit). Only with this we can track and keep close unnecessary pending connections on server. The side effect is that now 'skewer-list-clients' show more reliable info.

I guess, that because of massive change since my first proposal, you may find it more convenient to see all the the changes compared to master, rather than previous ones.

In my tests the solution seems to be performing very well on desktop and mobile devices, even through not reliable network like 3G.

I have already implemented bulk message sending from browser to client. If you like the idea I could add it to messages sent from Emacs to browser. It is improving performance a lot if you have a lot of skewer.log statements in your code. Being able to combine many of those into single request seems to improve performance a lot. I have not notice Emacs to send many messages in short period of time.

What do you think? Do you like this approach? Any suggestion for further improvements?

@skeeto
Copy link
Owner

skeeto commented Feb 13, 2014

Thanks for working on this! It will take some time for me to review/test all of it.

@skeeto
Copy link
Owner

skeeto commented Feb 18, 2014

Why did you set a Keep-Alive: timeout=300, max=2 header?

@skeeto
Copy link
Owner

skeeto commented Feb 18, 2014

I really like this new protocol, especially having a client ID like that. It's looking pretty obvious I didn't know what the heck I was doing before! This is much cleaner now.

I took your code and ran with it in my own branch (of the same name). I refactored the client interface into generic functions (EIEIO) and extracted all the BOSH-specific code into skewer-bosh.el. My long-term goal is to add seamless support for WebSocket clients -- and potentially even node.js clients. The core skewer functions will remain unaware of what kind of client they're handling.

One significant change was moving the request queue from a global queue to a per-BOSH-client queue. A websocket client doesn't need an explicit queue, just as other implementations probably don't, plus this eliminates some race conditions that were around before. Client keys (which I've renamed to ID's on Emacs' side of things) allow this per-client queue, since it's much better tracked. The client ID maps to a specific EIEIO object, so the channel servlet can always get the associated client object (skewer-get-client).

One issue that's sort of unresolved is handling disconnects. I made a change that drops the client if it sees its process has disconnected (skewer-next), but this may be premature (and may cause the client to miss requests.)

Later on I'm going to run it through the gauntlet of browsers I have available to make sure they all work properly, then merge it into master when you've signed off on this. When I eventually add websocket support it will prefer websocket connections, falling back onto BOSH if needed (or if explicitly configured to prefer BOSH).

With the BOSH code separated it should be easier to follow the protocol implementation and ensure it's working properly, containing all that stateful logic. Could you take a look at my changes to your code, specifically in skewer-bosh.el?

@skeeto
Copy link
Owner

skeeto commented Feb 19, 2014

One problem I've noticed is that IE8 doesn't support setRequestHeader(), so the client key doesn't work. However, it may just be time to drop IE8 support anyway with XP going away in April.

@starstuck
Copy link
Contributor Author

I like the way you have separated each communication channel logic very much.

Answering your previous question: Keep-Alive header is required to instruct browsers not to close TCP connection after request. The goal is to keep 2 sockets opened for long time, to avoid overhead of establishing TCP connection with every request. The timeout is something which is default for most browser, so we may expect that most of proxies and NAT gateways will honor it. It can be further tuned, possibly subject to customization. Also we want to have 2 active connections so we want browser being able to send answer (make 2nd request) while Emacs is still holding first one (the one waiting for command from server).

I need to investigate more the idea of Emacs terminating clients which disconnected. We may need to be more tolerant here. What I liked in original solution was that the requests from Emacs remained in a queue for long time so even when connection was broken, when client has resumed it could pick-up all queued tasks and execute them. This was very useful when working with mobiles or devices over dodgy wi-fi networks where disconnects were quite often. I am even not sure if it is going to be a case any more. Let me play around with that to see how it performs.

You may have noticed that I have added support for connection .close() from client side. I found it quite useful when I was working with few browsers, but I wanted to execute some commands only in few of them.

I like the idea of having WebSockets support. This could be event better channel for low latency communication. I guess having BOSH fallback make sense, because even if you have support in browser and in Emacs, still some proxies may block you from using it.

If you like, I am very happy to help you out with WebSocket implementation, especially with part in a browser. I guess it is a matter for separate discussion/ticket.

I am afraid there is no easy workaround for IE8, because even if we pass client key somewhere in a body, still there will be problem with cross domain. Are you OK with dropping IE8 support or should we consider moving client identifier to body?

@skeeto
Copy link
Owner

skeeto commented Feb 19, 2014

The timeout is something which is default for most browser, so we may expect that most of proxies and NAT gateways will honor it.

How did you arrive at a timeout of 300 seconds? Is that a standard value? I noticed the clients hang around properly for much longer than 5 minutes, so is this automatically retried as needed? I'm curious about what exact problem this header solves -- i.e. what would go wrong if it wasn't present.

I need to investigate more the idea of Emacs terminating clients which disconnected. We may need to be more tolerant here.

How about we extend that timeout to apply here? I'm now keeping track of each client's last-seen time. If the client is disconnected and we haven't seen them in the configurable timeout, then drop the client. I don't want to keep it around forever because it's building up a queue of commands that will never get processed.

You may have noticed that I have added support for connection .close() from client side.

Yup, I did. You'll see I continued this in the websockets backend. Do you think it's worth trying to call this when the page is closing?

If you like, I am very happy to help you out with WebSocket implementation, especially with part in a browser. I guess it is a matter for separate discussion/ticket.

I actually did most of this last night. You can find it in the websockets branch. If the websocket ever disconnects or fails to connect for any reason, the BOSH method is immediately attempted instead. I was testing both it and BOSH out last night. One issue I noticed is that the Android 4.1 stock browser pretends to support websockets, but does not. The websocket sits in a "connecting" state forever (never triggering the BOSH fallback), making it hard to detect misbehavior. Safari 5 (the latest version Apple actually offers) also has broken websockets and, fortunately, it fails quickly, but it's noisy on Emacs' side from emacs-websocket's warnings.

Are you OK with dropping IE8 support or should we consider moving client identifier to body?

Yeah, I think it's probably time to do so. While it's officially supported by MS for another 6 years, the most trafficked parts of the Internet don't work with it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants