Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitizing the 'rows' request parameter results in no documents #2

Open
sjtower opened this issue Aug 2, 2017 · 6 comments
Open

Sanitizing the 'rows' request parameter results in no documents #2

sjtower opened this issue Aug 2, 2017 · 6 comments

Comments

@sjtower
Copy link

sjtower commented Aug 2, 2017

I have a solr cloud setup with 16 shards.

I've set up the request sanitizer to limit rows to 1000 with the following in solrconfig.xml:

<str name="sanitize">rows=>1000:1000</str>

This works as expected and limits rows to 1000. However, the rows sanitation is affecting the start request parameter as well.

When I query this URL I see a valid response containing documents:
http://solr-901:8983/solr/journals_dev/select?fl=id&fq=doc_type:full&q=*:*&rows=1000&start=15000&wt=json
However, when I query this URL I see a response containing no documents:
http://solr-901:8983/solr/journals_dev/select?fl=id&fq=doc_type:full&q=*:*&rows=1000&start=16000&wt=json

Notice that the only difference is the start value.

I have determined that this behavior is dictated by the number of shards multiplied by the rows sanitation number. So for my case, 16 shards x 1000 row limit means I will get no results when I query with start > 16,000.

Is this expected behavior, and is there any way I can work around it? We use paging on our website and this will affect any searches that go beyond result 16,000. We still need to limit rows, though.

Thanks!

@janhoy
Copy link
Contributor

janhoy commented Aug 2, 2017

Do you not see this behavior without the component active?

@sjtower
Copy link
Author

sjtower commented Aug 2, 2017

Correct, I do not see this behavior if the component is inactive.

@janhoy
Copy link
Contributor

janhoy commented Aug 7, 2017

The only thing I can think of is that perhaps when doing distributed paging, each sub shard is somehow asked to increase rows and then the "entry point" shard limits the response. But will need to debug and read some more code to validate. Can you attempt some debug logging (bin/solr start -v) and see if you get some hints for what is going on?

It could be that the Request sanitizer should only do its magic for top-level requests and not do anything for local shard requests?

@sjtower
Copy link
Author

sjtower commented Aug 25, 2017

Hi @janhoy, we've determined that a single shard will be sufficient for our solrcloud implementation. If I have some free time in the coming weeks I can take another look at this and help debug it, but I'm afraid I can't spend much more time debugging it now. The issue should be easy to reproduce with two shards and a low row sanitation value, say 100. I'm happy to provide our solr configs. Thanks!

@janhoy
Copy link
Contributor

janhoy commented Aug 24, 2021

@sjtower did you ever dig into this?

Thinking some more, requesting 1000 hits from offset 16000 means that you need to request 2000 rows from each shard since it could worst case be that all hits come from one of the shards. So a fix would likely be to disable the plugin on the distributed requests somehow.

On the other hand, if you do deep paging like this, then CursorMark is perhaps a better alternative, and it may not cause the same behavior.

@sjtower
Copy link
Author

sjtower commented Aug 24, 2021

I was never able to dig into this, apologies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants