Selecting all for particular search results in error #3701

saseestone · 2023-11-29T21:04:29Z

We received a report from Charles Fosselman in East Asia that he tried to "Select all" for this search: https://searchworks.stanford.edu/?f%5Baccess_facet%5D%5B%5D=Online&per_page=100&q=Zhongguo+li+shi&search_field=search

And received an error.

Cory/Chris indicated in Gryphon Core that Access team members will need to look into it.

Noting two details:

we seem to be returning an error for every item selected, so the only way to get away from the error message is to exit SearchWorks and restart the session.
The records are added to the user's selection queue. We just return an error. In a different session from the selection action, the user can email the selections (if they are logged into SearchWorks. If not, then the user runs into our reCaptcha issue too. See Emailing Selections not working for guest users #3700

Jira issue with original feedback: SW-4254

jcoyne · 2024-01-23T14:29:45Z

It appears that the front end here is overloading the server with requests. We'll need to increase backend capacity or modify the frontend so that it doesn't try to fire thousands of requests simultaneously. We could do this by limiting the feature to a reasonable number or queue the requests. If we did the latter, we'd need to add a fair amount of complexity to ensure the queue is drained before the user navigates away from the page.

corylown · 2024-01-25T15:42:49Z

I'm wondering if the load balancer has something to do with this problem. If I bypass the load balancer and go directly to one of the sw-webapp-* VMs then the select-all function works fine. There's a long discussion about connection issues with the load balancer on this issue, could be related: https://github.com/sul-dlss/operations-tasks/issues/3519

julianmorley · 2024-01-25T23:28:16Z

If you can track down any of those 503s to a specific app server it's unlikely that this is the same issue we're seeing with DSA - there, the app servers never even gets the traffic. @jcoyne's thought that this is a capacity issue looks more likely to me absent other evidence.

I don't know how those app servers are configured, but there's likely headroom available just from a config change.

( although if 'select all' here is generating 1,187 unique GETs to the backend server from a single client, that seems less desirable from a scalability perspective. That's an arms race I don't think we can win with server configs. )

jcoyne · 2024-01-25T23:39:49Z

@julianmorley we determined the max number is actually 100 requests.

julianmorley · 2024-01-26T00:50:36Z

@jcoyne I just tried it - I see what you mean. I think your first assumption that this is blatting the app server is right, though. Those 503s came back fast, looks like whichever server the LB sent this client to ran out of available connections.

corylown · 2024-01-26T13:50:52Z

@julianmorley @jcoyne this was simple enough to test and confirm that the POST requests resulting in the 503 response are never making it to any of the sw-webapp-* VMs.

I tailed each of the Apache request logs on the 5 sw-webapp-* VMs and clicked the select all widget in SearchWorks. My browser recorded a total of 100 POST requests to /selections/* which resulted in 76 successful requests with a 200 response and 24 failed requests with the 503 response.

At the same time on each of the SearchWorks prod VMs I tailed the request logs: tail -f SearchWorks_access_ssl.log | grep 'POST /selections/'. Across the VMs there were 76 POST requests recorded with a 200 response. 24 of the requests sent by my browser were never recorded in the Apache request log on any of the VMs. The 24 missing requests correspond to the requests that returned a 503.

I guess there could be conditions where a request makes it to the VM, but something goes wrong and it's never recorded in the request log. But I'm not sure how to determine that.

julianmorley · 2024-01-26T19:35:29Z

Were the requests that got through spread evenly amongst the 5 VMs? The LB is set to balance via 'least connections', so I'd expect the 76 that got through to be on at least 2, preferably 4 or 5 VMs.

The 503s were definitely coming from the LB:

HTTP/1.0 503 Service Unavailable
Retry-After: 2
Server: BigIP
Connection: Keep-Alive
Content-Length: 0

Anyhow, I tried this out with a different search; select all worked fine with 20 & 50 items, gave 500 Internal Server Error (that actually made it to the backend servers) with 100 items. That, plus the response header from the BigIP suggesting a retry indicates you're hitting the load-balancer's anti-DDOS protection: too many instantaneous connections from a single client. 20 is fine, 50 probably fine maybe, 100 is not. I don't know why my test search gets through to the app servers (where it fails, because load) but the OPs does not.

( my search was https://searchworks.stanford.edu/?per_page=100&q=nelson&search_field=search )

Retries of the 503s worked fine, so that's more fuel for the "too many requests too fast" theory. I'd suggest trying to batch those requests client-side so you're not generating 100 simultaneous requests to the same URL.

EDIT: subsequent trial of my 'nelson' test case gave 503s, identical to OP. I'm thinking that's the LB's anti-DOS deciding that I'm a naughty person after initially allowing my tomfoolery.

corylown · 2024-01-26T19:50:33Z

Thanks @julianmorley that's helpful and makes complete sense that we're hitting some kind of anti-DOS protection at the load-balancer. And yes, the 76 successful POST requests were nicely distributed across the 5 VMs. With this info we'll look at changes in the app so we're not firing off so many requests for this feature.

Fixes #3701

saseestone added the bug label Nov 29, 2023

jcoyne self-assigned this Jan 29, 2024

jcoyne added a commit that referenced this issue Jan 29, 2024

Apply a rate limit to check all

e9ce475

Fixes #3701

jcoyne mentioned this issue Jan 29, 2024

Apply a rate limit to check all #3820

Merged

corylown closed this as completed in #3820 Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting all for particular search results in error #3701

Selecting all for particular search results in error #3701

saseestone commented Nov 29, 2023 •

edited

Loading

jcoyne commented Jan 23, 2024 •

edited

Loading

corylown commented Jan 25, 2024

julianmorley commented Jan 25, 2024

jcoyne commented Jan 25, 2024

julianmorley commented Jan 26, 2024

corylown commented Jan 26, 2024 •

edited

Loading

julianmorley commented Jan 26, 2024 •

edited

Loading

corylown commented Jan 26, 2024

Selecting all for particular search results in error #3701

Selecting all for particular search results in error #3701

Comments

saseestone commented Nov 29, 2023 • edited Loading

jcoyne commented Jan 23, 2024 • edited Loading

corylown commented Jan 25, 2024

julianmorley commented Jan 25, 2024

jcoyne commented Jan 25, 2024

julianmorley commented Jan 26, 2024

corylown commented Jan 26, 2024 • edited Loading

julianmorley commented Jan 26, 2024 • edited Loading

corylown commented Jan 26, 2024

saseestone commented Nov 29, 2023 •

edited

Loading

jcoyne commented Jan 23, 2024 •

edited

Loading

corylown commented Jan 26, 2024 •

edited

Loading

julianmorley commented Jan 26, 2024 •

edited

Loading