Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional 502 errors in prod #17

Open
justingrayston opened this issue Aug 3, 2020 · 8 comments
Open

Occasional 502 errors in prod #17

justingrayston opened this issue Aug 3, 2020 · 8 comments

Comments

@justingrayston
Copy link
Collaborator

Difficult to pin down the specifics on this, but it could be related to issue #13 or vice versa.

It looks like no instances are temporarily available to serve traffic. Notably it does seem to happen when there is a short burts of traffic, akin to a new user or a couple of new users that suddenly and quickly use the marquee tool. Adding the warmup request has improved the situation.

Option 1 (not sure what option 2 is yet) is to reduce the concurrency on the api calls, as it maybe forcing the only serving instance to fail before the warmup request has made the next instance ready.

@roybgardner
Copy link

@justingrayston Any logs or stack traces?

@justingrayston
Copy link
Collaborator Author

The 502 indicates there are no instances available to serve which is odd because we run min number of instances in prod.

There are multiple logs - working out which ones relate to the 502 isn't obvious.

I just noted that we have occasional 500 errors at similar times but all it says is:
2020-08-04 20:29:09.816 BSTERROR:root:'32'

@justingrayston
Copy link
Collaborator Author

Actually looking at an example trace it looks like the process timed out the above BSTERROR:root:'32' is not related and is a different error. The last few 502s ran for longer than 5 minutes before failing. I guess this is someone sending an overly large image for cluster analysis, meawhile the instance will be sent other traffic and eventually die making the instance unavailable to serve the original request. (5 mins is within the GAE timeout period of 10 minutes so it isn't that).

I think adding a max size catch to keep things healthy and respond to the user maybe an idea.

justingrayston added a commit that referenced this issue Aug 4, 2020
By reducing the Concurrency it should allow the longer running
tasks more chance to complete.

This isn't a fix, more to see if it reduces the frequency of 502s
while a proper fix is put in place.

Relates to #17
@justingrayston
Copy link
Collaborator Author

We already have a limit on Pixel size 🤔 so it isn't that.

@justingrayston
Copy link
Collaborator Author

I managed to reproduce the 502 error in dev (unfortunately after the reduced concurrency).

@justingrayston
Copy link
Collaborator Author

Increasing min instances so there is spare headroom for traffic spikes and resolving isssue #13 so far seemed to have stopped this. I will leave open while we monitor over the next few days.

@roybgardner
Copy link

@justingrayston Is there a sensible limit on the size of an image? The function get_source_image_pixels() in main.py has to process the entire image before determining that pixel_limit is exceeded.

@justingrayston
Copy link
Collaborator Author

Added googleartsculture/workbench#16 in workbench so we don't even send stuff we know could break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants