You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Investigate differences in genetic ancestry imputation in Seqr data when upgrading imputed genetic ancestry models from gnomAD v2+CMG to v4. Among the GATK WES v23 callset, we've seen modest differences classification back and forth on 'Middle Eastern' <-> 'Other' between the two models. I posted a fairly large Slack message about this in #cmg-analysis this week, that I'll post below :
Hi analysts and analysis team! Apologies for the spam but some modestly analysis-relevant seqr news below:
The seqr team have been making some improvements to our sample QC code (which produces per-sample flags as well as imputed genetic ancestry) to both automate it with future loading and to get it to run on 🐉 DRAGEN-samples . Concerning the imputed genetic ancestry , we are updating the model we use for this from the previous Random Forest model & loadings from gnomAD v2 (+ CMG samples) to gnomAD v4. The prior model is very old and was written with some (now deprecated and) extremely outdated versions of some software packages that make it unreadable in recent versions of Python and Hail - possible to do single runs for, but this poses a “nigh-insurmountable technical barrier” to automation. Plus, v4 shoouuld just be better
Working from the most recent Whole Exome Sequencing (WES) callset with ~19,000 samples, we’ve compared the imputed genetic ancestry from the prior model (v2+CMG) to the recent one (v4). Of note is the fact that there are CMG samples in this Exome callset, which may bias the results. I’d be happy to exclude them and look again, if people would like.
With then a matrix comparing how individuals change from the prior model (as rows) to the recent model (as cols), down below. Of note is the shuffling of mid<->oth imputation, where 899 previously mid samples are now oth , and conversely 513 previously oth samples are now mid . And in general, there is the v4 model’s willingness to label things as oth that were not previously.
In general, gnomAD v4 is a much more diverse dataset than gnomAD v2+CMG. v4 has ~3000 people of mid genetic ancestry, while gnomAD v2 has none (labelled) and only ~350 reported from the CMG Gleeson cohort. Further, CMG Gleeson was pretty opaque (!?) about how these IDs were recorded , and might’ve been collected in closed proximity to each other or potentially the same place. v4 may just be classifying these samples differently because it just doesn’t have any training samples from that exact setting. Overall, (Mike Wilson quote) v4 should be better moving forward at classifying data
The text was updated successfully, but these errors were encountered:
Investigate differences in genetic ancestry imputation in Seqr data when upgrading imputed genetic ancestry models from gnomAD v2+CMG to v4. Among the GATK WES v23 callset, we've seen modest differences classification back and forth on 'Middle Eastern' <-> 'Other' between the two models. I posted a fairly large Slack message about this in #cmg-analysis this week, that I'll post below :
The text was updated successfully, but these errors were encountered: