Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite OT data_pipeline scoring logic using Spark #1

Closed
eric-czech opened this issue Oct 23, 2019 · 7 comments
Closed

Rewrite OT data_pipeline scoring logic using Spark #1

eric-czech opened this issue Oct 23, 2019 · 7 comments

Comments

@eric-czech
Copy link
Collaborator

Hey @hammer I think this is a good point to give you an update and start a conversation on where to go next with this.

I've now got a working pipeline script that replicates the OT scores at the association (i.e. target + disease) level, but I haven't checked per-source scores yet. This validation notebook though, shows a quick analysis with 979,752 of 979,765 (99.99%) of scores being identical to those that end up in ES.

This is only recreating the scores downstream of the initial data_pipeline processing steps (i.e. schema validation, gene id lookups, field renamings, other fiddly bits, etc.) but I was pleased to see that starting from about 14M evidence strings stored as json, it only takes ~60 seconds to calculate the scores. The same step in OT data_pipeline took ~50 minutes (yikes) when I ran it this afternoon in order to compare the results. Starting from parquet instead of json is an obvious potential improvement too so I'll certainly check to see if that buys me anything -- particularly given that so many of the fields in the evidence strings are not being used.

This is w/o EuropePMC so I don't know impact that will have yet. I would have thought it'd be some function of the number of evidence strings but that's not true, since a step of this pipeline involves exploding those records out based on a list of disease ids in the evidence records (though the method for this is source dependent). That's why I say 14M records even though there are barely even 1M raw evidence strings w/o EuropePMC.

Let me know if you see any other obvious avenues for improvement, otherwise I'm feeling good about working backwards from here and slowly subsuming more and more of what's in data_pipeline.

@hammer
Copy link
Contributor

hammer commented Oct 23, 2019

Nice! Will give it a look tomorrow.

@hammer
Copy link
Contributor

hammer commented Oct 23, 2019

BTW you okay making this repo public?

@eric-czech
Copy link
Collaborator Author

Yea for sure, I've got no objections to that.

@hammer
Copy link
Contributor

hammer commented Oct 24, 2019

Have you done any analysis on the 13 records that are scored differently by your code vs. OT?

@hammer
Copy link
Contributor

hammer commented Oct 24, 2019

So far this looks pretty sane! I think it's a great idea to see if we can start working our way back to the raw evidence files. As mentioned previously, if we can alter the existing data pipeline code to serialize evidence strings after fix_evidence but before score_evidence, that may be a good next target.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Oct 24, 2019

As for the validation discrepancies, it turns out that I had been writing the scala pipeline against an export of evidence objects from a couple weeks ago and comparing it to a data_pipeline run from yesterday, and despite the fact that both runs were pointing at the same gcloud files (via the mrtarget data configuration), raw resource_score's for a small number of records from a few of the sources changed. For phewas_catalog, sysbio, and expression_atlas specifically I think they updated the files backing the pipeline. I had assumed those would be static between runs but I guess not.

After re-exporting the evidence objects from the latest data_pipeline run and running the same Spark code, everything is equivalent now: Validation Notebook (updated)

@eric-czech eric-czech changed the title Rewrite OT data_pipeline on Spark Rewrite OT data_pipeline scoring logic using Spark Oct 28, 2019
@eric-czech
Copy link
Collaborator Author

Everything related to scoring for this is complete now with the final changes in https://github.com/related-sciences/ot-scoring/tree/644509ad2ac77c78bf9c9b7f08122dac0354cc32/src/main/scala/com/relatedsciences/opentargets/pipeline. This gives me equivalence to OT (with epsilon = 1e-6) when recomputing the association scores from the base resource scores (of which there are multiple for each evidence string). The Validation Notebook has these most recent results.

I'll call this issue complete and open other more specific issues for tasks related to going back further in the pipeline to schema validation, field renaming, invalid record assessment, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants