Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting all for particular search results in error #3701

Closed
saseestone opened this issue Nov 29, 2023 · 8 comments · Fixed by #3820
Closed

Selecting all for particular search results in error #3701

saseestone opened this issue Nov 29, 2023 · 8 comments · Fixed by #3820
Assignees
Labels

Comments

@saseestone
Copy link

saseestone commented Nov 29, 2023

We received a report from Charles Fosselman in East Asia that he tried to "Select all" for this search: https://searchworks.stanford.edu/?f%5Baccess_facet%5D%5B%5D=Online&per_page=100&q=Zhongguo+li+shi&search_field=search

And received an error.
image

Cory/Chris indicated in Gryphon Core that Access team members will need to look into it.

Noting two details:

  • we seem to be returning an error for every item selected, so the only way to get away from the error message is to exit SearchWorks and restart the session.
  • The records are added to the user's selection queue. We just return an error. In a different session from the selection action, the user can email the selections (if they are logged into SearchWorks. If not, then the user runs into our reCaptcha issue too. See Emailing Selections not working for guest users #3700

Jira issue with original feedback: SW-4254

@saseestone saseestone added the bug label Nov 29, 2023
@jcoyne
Copy link
Contributor

jcoyne commented Jan 23, 2024

It appears that the front end here is overloading the server with requests. We'll need to increase backend capacity or modify the frontend so that it doesn't try to fire thousands of requests simultaneously. We could do this by limiting the feature to a reasonable number or queue the requests. If we did the latter, we'd need to add a fair amount of complexity to ensure the queue is drained before the user navigates away from the page.

Screenshot 2024-01-23 at 9 21 36 AM

@corylown
Copy link
Contributor

I'm wondering if the load balancer has something to do with this problem. If I bypass the load balancer and go directly to one of the sw-webapp-* VMs then the select-all function works fine. There's a long discussion about connection issues with the load balancer on this issue, could be related: https://github.com/sul-dlss/operations-tasks/issues/3519

@julianmorley
Copy link
Member

If you can track down any of those 503s to a specific app server it's unlikely that this is the same issue we're seeing with DSA - there, the app servers never even gets the traffic. @jcoyne's thought that this is a capacity issue looks more likely to me absent other evidence.

I don't know how those app servers are configured, but there's likely headroom available just from a config change.

( although if 'select all' here is generating 1,187 unique GETs to the backend server from a single client, that seems less desirable from a scalability perspective. That's an arms race I don't think we can win with server configs. )

@jcoyne
Copy link
Contributor

jcoyne commented Jan 25, 2024

@julianmorley we determined the max number is actually 100 requests.

@julianmorley
Copy link
Member

@jcoyne I just tried it - I see what you mean. I think your first assumption that this is blatting the app server is right, though. Those 503s came back fast, looks like whichever server the LB sent this client to ran out of available connections.

@corylown
Copy link
Contributor

corylown commented Jan 26, 2024

@julianmorley @jcoyne this was simple enough to test and confirm that the POST requests resulting in the 503 response are never making it to any of the sw-webapp-* VMs.

I tailed each of the Apache request logs on the 5 sw-webapp-* VMs and clicked the select all widget in SearchWorks. My browser recorded a total of 100 POST requests to /selections/* which resulted in 76 successful requests with a 200 response and 24 failed requests with the 503 response.

At the same time on each of the SearchWorks prod VMs I tailed the request logs: tail -f SearchWorks_access_ssl.log | grep 'POST /selections/'. Across the VMs there were 76 POST requests recorded with a 200 response. 24 of the requests sent by my browser were never recorded in the Apache request log on any of the VMs. The 24 missing requests correspond to the requests that returned a 503.

I guess there could be conditions where a request makes it to the VM, but something goes wrong and it's never recorded in the request log. But I'm not sure how to determine that.

@julianmorley
Copy link
Member

julianmorley commented Jan 26, 2024

Were the requests that got through spread evenly amongst the 5 VMs? The LB is set to balance via 'least connections', so I'd expect the 76 that got through to be on at least 2, preferably 4 or 5 VMs.

The 503s were definitely coming from the LB:

HTTP/1.0 503 Service Unavailable
Retry-After: 2
Server: BigIP
Connection: Keep-Alive
Content-Length: 0

Anyhow, I tried this out with a different search; select all worked fine with 20 & 50 items, gave 500 Internal Server Error (that actually made it to the backend servers) with 100 items. That, plus the response header from the BigIP suggesting a retry indicates you're hitting the load-balancer's anti-DDOS protection: too many instantaneous connections from a single client. 20 is fine, 50 probably fine maybe, 100 is not. I don't know why my test search gets through to the app servers (where it fails, because load) but the OPs does not.

( my search was https://searchworks.stanford.edu/?per_page=100&q=nelson&search_field=search )

Retries of the 503s worked fine, so that's more fuel for the "too many requests too fast" theory. I'd suggest trying to batch those requests client-side so you're not generating 100 simultaneous requests to the same URL.

EDIT: subsequent trial of my 'nelson' test case gave 503s, identical to OP. I'm thinking that's the LB's anti-DOS deciding that I'm a naughty person after initially allowing my tomfoolery.

@corylown
Copy link
Contributor

Thanks @julianmorley that's helpful and makes complete sense that we're hitting some kind of anti-DOS protection at the load-balancer. And yes, the 76 successful POST requests were nicely distributed across the 5 VMs. With this info we'll look at changes in the app so we're not firing off so many requests for this feature.

@jcoyne jcoyne self-assigned this Jan 29, 2024
jcoyne added a commit that referenced this issue Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants