-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional 502 errors in prod #17
Comments
@justingrayston Any logs or stack traces? |
The 502 indicates there are no instances available to serve which is odd because we run min number of instances in prod. There are multiple logs - working out which ones relate to the 502 isn't obvious. I just noted that we have occasional 500 errors at similar times but all it says is: |
Actually looking at an example trace it looks like the process timed out the above I think adding a max size catch to keep things healthy and respond to the user maybe an idea. |
By reducing the Concurrency it should allow the longer running tasks more chance to complete. This isn't a fix, more to see if it reduces the frequency of 502s while a proper fix is put in place. Relates to #17
We already have a limit on Pixel size 🤔 so it isn't that. |
I managed to reproduce the 502 error in dev (unfortunately after the reduced concurrency). |
Increasing min instances so there is spare headroom for traffic spikes and resolving isssue #13 so far seemed to have stopped this. I will leave open while we monitor over the next few days. |
@justingrayston Is there a sensible limit on the size of an image? The function |
Added googleartsculture/workbench#16 in workbench so we don't even send stuff we know could break. |
Difficult to pin down the specifics on this, but it could be related to issue #13 or vice versa.
It looks like no instances are temporarily available to serve traffic. Notably it does seem to happen when there is a short burts of traffic, akin to a new user or a couple of new users that suddenly and quickly use the marquee tool. Adding the warmup request has improved the situation.
Option 1 (not sure what option 2 is yet) is to reduce the concurrency on the api calls, as it maybe forcing the only serving instance to fail before the warmup request has made the next instance ready.
The text was updated successfully, but these errors were encountered: