-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite OT data_pipeline scoring logic using Spark #1
Comments
Nice! Will give it a look tomorrow. |
BTW you okay making this repo public? |
Yea for sure, I've got no objections to that. |
Have you done any analysis on the 13 records that are scored differently by your code vs. OT? |
So far this looks pretty sane! I think it's a great idea to see if we can start working our way back to the raw evidence files. As mentioned previously, if we can alter the existing data pipeline code to serialize evidence strings after |
As for the validation discrepancies, it turns out that I had been writing the scala pipeline against an export of evidence objects from a couple weeks ago and comparing it to a data_pipeline run from yesterday, and despite the fact that both runs were pointing at the same gcloud files (via the mrtarget data configuration), raw resource_score's for a small number of records from a few of the sources changed. For phewas_catalog, sysbio, and expression_atlas specifically I think they updated the files backing the pipeline. I had assumed those would be static between runs but I guess not. After re-exporting the evidence objects from the latest data_pipeline run and running the same Spark code, everything is equivalent now: Validation Notebook (updated) |
Everything related to scoring for this is complete now with the final changes in https://github.com/related-sciences/ot-scoring/tree/644509ad2ac77c78bf9c9b7f08122dac0354cc32/src/main/scala/com/relatedsciences/opentargets/pipeline. This gives me equivalence to OT (with epsilon = 1e-6) when recomputing the association scores from the base resource scores (of which there are multiple for each evidence string). The Validation Notebook has these most recent results. I'll call this issue complete and open other more specific issues for tasks related to going back further in the pipeline to schema validation, field renaming, invalid record assessment, etc. |
Hey @hammer I think this is a good point to give you an update and start a conversation on where to go next with this.
I've now got a working pipeline script that replicates the OT scores at the association (i.e. target + disease) level, but I haven't checked per-source scores yet. This validation notebook though, shows a quick analysis with 979,752 of 979,765 (99.99%) of scores being identical to those that end up in ES.
This is only recreating the scores downstream of the initial data_pipeline processing steps (i.e. schema validation, gene id lookups, field renamings, other fiddly bits, etc.) but I was pleased to see that starting from about 14M evidence strings stored as json, it only takes ~60 seconds to calculate the scores. The same step in OT data_pipeline took ~50 minutes (yikes) when I ran it this afternoon in order to compare the results. Starting from parquet instead of json is an obvious potential improvement too so I'll certainly check to see if that buys me anything -- particularly given that so many of the fields in the evidence strings are not being used.
This is w/o EuropePMC so I don't know impact that will have yet. I would have thought it'd be some function of the number of evidence strings but that's not true, since a step of this pipeline involves exploding those records out based on a list of disease ids in the evidence records (though the method for this is source dependent). That's why I say 14M records even though there are barely even 1M raw evidence strings w/o EuropePMC.
Let me know if you see any other obvious avenues for improvement, otherwise I'm feeling good about working backwards from here and slowly subsuming more and more of what's in data_pipeline.
The text was updated successfully, but these errors were encountered: