Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDs are tokenized during search gives strange results (bad tokenizing on colons) #157

Open
kltm opened this issue Nov 3, 2014 · 9 comments

Comments

@kltm
Copy link
Member

kltm commented Nov 3, 2014

This is the trackable issue for:

http://jira.geneontology.org/browse/GO-624

The current statement of the issue is:

This is considered an issue because it seems unlikely this gene would have the seen associations with GO:0007072.

Possibilities to consider are something wrong with the search such the the "go" bits in the synonyms are matching (although why so few then) or there is a hiccup in the ontology and these really are in the closure.

@kltm
Copy link
Member Author

kltm commented Nov 3, 2014

Ugh. It is looking like the "go" bits now. For example, take:

http://amigo2.berkeleybop.org/amigo/gene_product/RGD:1308769

any search starting with "go:" removes all annotations. Looking at:

http://amigo.geneontology.org/amigo/gene_product/FB:FBgn0000535

all attempts at filtering with "go:" string fail--nothing is filtered.

Direct response with debug at:

http://golr.berkeleybop.org/select?defType=edismax&qt=standard&indent=on&wt=json&rows=10&start=0&fl=*,score&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&fq=document_category:%22annotation%22&fq=bioentity:%22FB:FBgn0000535%22&facet.field=source&facet.field=assigned_by&facet.field=aspect&facet.field=evidence_type_closure&facet.field=panther_family_label&facet.field=qualifier&facet.field=taxon_closure_label&facet.field=annotation_class_label&facet.field=regulates_closure_label&facet.field=annotation_extension_class_closure_label&q=go:0022008&qf=annotation_class^2&qf=annotation_class_label_searchable^1&qf=bioentity^2&qf=bioentity_label_searchable^1&qf=bioentity_name_searchable^1&qf=annotation_extension_class^2&qf=annotation_extension_class_label_searchable^1&qf=reference^1&qf=panther_family_searchable^1&qf=panther_family_label_searchable^1&qf=bioentity_isoform^1&qf=regulates_closure^1&qf=regulates_closure_label_searchable^1&debugQuery=true

Quoting the string prevents this. As well, you can see in the results that it is tokenizing on the colon.

@kltm kltm changed the title Strange results when search AmiGO IDs are tokenized during search gives trange results (bad tokenizing on colons) Nov 3, 2014
@kltm
Copy link
Member Author

kltm commented Nov 3, 2014

This seems to be the same issue as #93. However, since the explanation is cleared here, I'm going to mark the earlier one as a dupe (although is should be read to get more background).

The current takeaway is that this is an issue and that there are a fair number of colon related issues in Solr, and it is probably not worth ripping up the plumbing right before we switch to Solr 4.x (which may have fixed this case or have slightly different issues, see: berkeleybop/bbop-js#16).

The current workaround for this is that in the case of ID search in free text (which was considered a marginal case initially, but not now), one can use quotes to force the correct behaviour.

berkeleybop/bbop-js#16

@cmungall
Copy link
Member

cmungall commented Nov 3, 2014

Can we not have the API intercept these queries and auto-quote them?

@kltm
Copy link
Member Author

kltm commented Nov 3, 2014

So specifically, to propose a possible fix, you might add something to the consumer search function (https://kltm.github.io/bbop-js/docs/files/golr/manager-js.html#bbop.golr.manager.set_comfy_query): any "token" that had a colon in it would be not further split downstream by being automatically quoted at this stage. I'm not wild about this approach here, mainly because there seem to be 1) actual problems with what our version of Solr is doing with the colons and 2) I believe the tokenizer we're using for searchables is eliminating them anyways. I'm not immediately sure how to work around these except for revisiting from the backend up. For example, take:

http://a2-proxy1.stanford.edu/solr/select?defType=edismax&qt=standard&indent=on&wt=json&rows=10&start=0&fl=annotation_class,description,source,synonym,alternate_id,annotation_class_label,score,id&facet=true&facet.mincount=1&facet.sort=count&json.nl=arrarr&facet.limit=25&hl=true&hl.simple.pre=%3Cem%20class=%22hilite%22%3E&fq=document_category:%22ontology_class%22&facet.field=source&facet.field=subset&facet.field=regulates_closure_label&facet.field=is_obsolete&qf=annotation_class^3&qf=annotation_class_label_searchable^5.5&qf=description_searchable^1&qf=comment_searchable^0.5&qf=synonym_searchable^1&qf=alternate_id^1&qf=regulates_closure^1&qf=regulates_closure_label_searchable^1&debugQuery=true&q=%22GO:0008750%22

you can see that it /mostly/ removed the colon from existence in the parsed query, meaning that there is certainly no match (this would likely be due to the search tokenizer we're using on the Solr end for "_searchable"s). Trying a couple of ways to url encode that ahead of time doesn't help, and gets the parsed query even weirder; moreover, even if you could, I don't believe anything would match anyways.

I think the easiest approach would be to switch to the better fixed 4.6 and take out a lot of these super annoying search issues in the process.

@kltm kltm changed the title IDs are tokenized during search gives trange results (bad tokenizing on colons) IDs are tokenized during search gives strange results (bad tokenizing on colons) Jan 19, 2016
@kltm kltm modified the milestone: wishlist Jul 27, 2016
@cmungall
Copy link
Member

Also from http://jira.geneontology.org/browse/GO-1428

Seems odd that this search returns no result:
http://amigo.geneontology.org/amigo/medial_search?q=S000000031
would have expected it to return this entry:
http://amigo.geneontology.org/amigo/gene_product/SGD:S000000031

@kltm
Copy link
Member Author

kltm commented Dec 16, 2016

This is the expected behavior given the tokenizing issue. Now that the work has been done for the new tokenizing with GOlr in the monarch stack, we just need to port it over to AmiGO by updating bbop-manager-golr.

@cmungall
Copy link
Member

We're running into this again, see geneontology/helpdesk#99

This is really key, people really expect to be able to search with the non-prefixed part of the ID. Do we still need to change bbop-manager-golr? Isn't this just a matter of adding the unprefixed form as something solr searches on?

@Antonialock
Copy link

hi, any news on this? we really would like to do some analysis for a paper that we would like to submit ASAP @ValWood

@ValWood
Copy link

ValWood commented Mar 1, 2018

@Antonialock This isn't our primary issue. this is a side issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants