Rewrite OT data_pipeline scoring logic using Spark #1

eric-czech · 2019-10-23T20:44:31Z

Hey @hammer I think this is a good point to give you an update and start a conversation on where to go next with this.

I've now got a working pipeline script that replicates the OT scores at the association (i.e. target + disease) level, but I haven't checked per-source scores yet. This validation notebook though, shows a quick analysis with 979,752 of 979,765 (99.99%) of scores being identical to those that end up in ES.

This is only recreating the scores downstream of the initial data_pipeline processing steps (i.e. schema validation, gene id lookups, field renamings, other fiddly bits, etc.) but I was pleased to see that starting from about 14M evidence strings stored as json, it only takes ~60 seconds to calculate the scores. The same step in OT data_pipeline took ~50 minutes (yikes) when I ran it this afternoon in order to compare the results. Starting from parquet instead of json is an obvious potential improvement too so I'll certainly check to see if that buys me anything -- particularly given that so many of the fields in the evidence strings are not being used.

This is w/o EuropePMC so I don't know impact that will have yet. I would have thought it'd be some function of the number of evidence strings but that's not true, since a step of this pipeline involves exploding those records out based on a list of disease ids in the evidence records (though the method for this is source dependent). That's why I say 14M records even though there are barely even 1M raw evidence strings w/o EuropePMC.

Let me know if you see any other obvious avenues for improvement, otherwise I'm feeling good about working backwards from here and slowly subsuming more and more of what's in data_pipeline.

hammer · 2019-10-23T21:32:27Z

Nice! Will give it a look tomorrow.

hammer · 2019-10-23T21:36:50Z

BTW you okay making this repo public?

eric-czech · 2019-10-23T22:49:20Z

Yea for sure, I've got no objections to that.

hammer · 2019-10-24T12:43:36Z

Have you done any analysis on the 13 records that are scored differently by your code vs. OT?

hammer · 2019-10-24T12:44:41Z

So far this looks pretty sane! I think it's a great idea to see if we can start working our way back to the raw evidence files. As mentioned previously, if we can alter the existing data pipeline code to serialize evidence strings after fix_evidence but before score_evidence, that may be a good next target.

eric-czech · 2019-10-24T19:04:38Z

As for the validation discrepancies, it turns out that I had been writing the scala pipeline against an export of evidence objects from a couple weeks ago and comparing it to a data_pipeline run from yesterday, and despite the fact that both runs were pointing at the same gcloud files (via the mrtarget data configuration), raw resource_score's for a small number of records from a few of the sources changed. For phewas_catalog, sysbio, and expression_atlas specifically I think they updated the files backing the pipeline. I had assumed those would be static between runs but I guess not.

After re-exporting the evidence objects from the latest data_pipeline run and running the same Spark code, everything is equivalent now: Validation Notebook (updated)

eric-czech · 2019-10-28T18:39:34Z

Everything related to scoring for this is complete now with the final changes in https://github.com/related-sciences/ot-scoring/tree/644509ad2ac77c78bf9c9b7f08122dac0354cc32/src/main/scala/com/relatedsciences/opentargets/pipeline. This gives me equivalence to OT (with epsilon = 1e-6) when recomputing the association scores from the base resource scores (of which there are multiple for each evidence string). The Validation Notebook has these most recent results.

I'll call this issue complete and open other more specific issues for tasks related to going back further in the pipeline to schema validation, field renaming, invalid record assessment, etc.

eric-czech changed the title ~~Rewrite OT data_pipeline on Spark~~ Rewrite OT data_pipeline scoring logic using Spark Oct 28, 2019

eric-czech closed this as completed Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite OT data_pipeline scoring logic using Spark #1

Rewrite OT data_pipeline scoring logic using Spark #1

eric-czech commented Oct 23, 2019

hammer commented Oct 23, 2019

hammer commented Oct 23, 2019

eric-czech commented Oct 23, 2019

hammer commented Oct 24, 2019

hammer commented Oct 24, 2019

eric-czech commented Oct 24, 2019 •

edited

Loading

eric-czech commented Oct 28, 2019

Rewrite OT data_pipeline scoring logic using Spark #1

Rewrite OT data_pipeline scoring logic using Spark #1

Comments

eric-czech commented Oct 23, 2019

hammer commented Oct 23, 2019

hammer commented Oct 23, 2019

eric-czech commented Oct 23, 2019

hammer commented Oct 24, 2019

hammer commented Oct 24, 2019

eric-czech commented Oct 24, 2019 • edited Loading

eric-czech commented Oct 28, 2019

eric-czech commented Oct 24, 2019 •

edited

Loading