Occasional 502 errors in prod #17

justingrayston · 2020-08-03T20:11:53Z

Difficult to pin down the specifics on this, but it could be related to issue #13 or vice versa.

It looks like no instances are temporarily available to serve traffic. Notably it does seem to happen when there is a short burts of traffic, akin to a new user or a couple of new users that suddenly and quickly use the marquee tool. Adding the warmup request has improved the situation.

Option 1 (not sure what option 2 is yet) is to reduce the concurrency on the api calls, as it maybe forcing the only serving instance to fail before the warmup request has made the next instance ready.

roybgardner · 2020-08-04T07:37:38Z

@justingrayston Any logs or stack traces?

justingrayston · 2020-08-04T19:52:53Z

The 502 indicates there are no instances available to serve which is odd because we run min number of instances in prod.

There are multiple logs - working out which ones relate to the 502 isn't obvious.

I just noted that we have occasional 500 errors at similar times but all it says is:
2020-08-04 20:29:09.816 BSTERROR:root:'32'

justingrayston · 2020-08-04T20:16:38Z

Actually looking at an example trace it looks like the process timed out the above BSTERROR:root:'32' is not related and is a different error. The last few 502s ran for longer than 5 minutes before failing. I guess this is someone sending an overly large image for cluster analysis, meawhile the instance will be sent other traffic and eventually die making the instance unavailable to serve the original request. (5 mins is within the GAE timeout period of 10 minutes so it isn't that).

I think adding a max size catch to keep things healthy and respond to the user maybe an idea.

By reducing the Concurrency it should allow the longer running tasks more chance to complete. This isn't a fix, more to see if it reduces the frequency of 502s while a proper fix is put in place. Relates to #17

justingrayston · 2020-08-04T20:24:43Z

We already have a limit on Pixel size 🤔 so it isn't that.

justingrayston · 2020-08-04T21:57:19Z

I managed to reproduce the 502 error in dev (unfortunately after the reduced concurrency).

justingrayston · 2020-08-05T07:26:42Z

Increasing min instances so there is spare headroom for traffic spikes and resolving isssue #13 so far seemed to have stopped this. I will leave open while we monitor over the next few days.

roybgardner · 2020-08-05T07:59:00Z

@justingrayston Is there a sensible limit on the size of an image? The function get_source_image_pixels() in main.py has to process the entire image before determining that pixel_limit is exceeded.

justingrayston · 2020-09-09T08:43:42Z

Added googleartsculture/workbench#16 in workbench so we don't even send stuff we know could break.

justingrayston mentioned this issue Aug 4, 2020

Reduce Concurrency #18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional 502 errors in prod #17

Occasional 502 errors in prod #17

justingrayston commented Aug 3, 2020

roybgardner commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 5, 2020

roybgardner commented Aug 5, 2020

justingrayston commented Sep 9, 2020

Occasional 502 errors in prod #17

Occasional 502 errors in prod #17

Comments

justingrayston commented Aug 3, 2020

roybgardner commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 4, 2020

justingrayston commented Aug 5, 2020

roybgardner commented Aug 5, 2020

justingrayston commented Sep 9, 2020