You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Duplicate SNPs can be reported for a particular locus in a file (e.g., two SNPs with the same position on the same chromosome, potentially with different genotypes). In one of the example files, there are 234 SNPs where another SNP has the same locus:
import pandas as pd
from snps import SNPs
s = SNPs("resources/662.23andme.340.txt.gz")
df = pd.DataFrame()
for chrom in s.snps.chrom.unique():
temp = s.snps.loc[s.snps.chrom == chrom]
df = df.append(temp.loc[temp.pos.duplicated(keep=False)])
print(len(df))
Several of these SNPs have "internal IDs", but there are cases where two different rsids are reported for the same locus. Additionally, the genotype reported is not always the same for a particular locus.
To remedy this, I suggest adding a deduplicate_loci parameter to the SNPs object and renaming the deduplicate parameter to deduplicate_rsids.
Then, if deduplicate_loci=True (default), deduplicate the loci in this manner:
Chromosome by chromosome, find SNPs with duplicate positions. Then, for each duplicate position, sort the SNP rsids / internal IDs and keep the last in the normalized SNPs dataframe; this will ensure internal IDs are replaced with rsids (if available) and also keep IDs with higher numbers (potentially newer). In this process, add the kept rsid as a key to a _duplicate_locidict, where the value is a list of any deduplicated loci (str rsids). Additionally, merge the genotype (updating any nans) and identifying discrepant genotypes as discrepant_loci (similar to discrepant_XY) and mark the genotype as nan.
Finally, loci should optionally be deduplicated after each merge (add deduplicate_loci parameter to merge as well).
The text was updated successfully, but these errors were encountered:
Duplicate SNPs can be reported for a particular locus in a file (e.g., two SNPs with the same position on the same chromosome, potentially with different genotypes). In one of the example files, there are 234 SNPs where another SNP has the same locus:
Several of these SNPs have "internal IDs", but there are cases where two different rsids are reported for the same locus. Additionally, the genotype reported is not always the same for a particular locus.
To remedy this, I suggest adding a
deduplicate_loci
parameter to theSNPs
object and renaming thededuplicate
parameter todeduplicate_rsids
.Then, if
deduplicate_loci=True
(default), deduplicate the loci in this manner:Chromosome by chromosome, find SNPs with duplicate positions. Then, for each duplicate position, sort the SNP rsids / internal IDs and keep the last in the normalized SNPs dataframe; this will ensure internal IDs are replaced with rsids (if available) and also keep IDs with higher numbers (potentially newer). In this process, add the kept rsid as a key to a
_duplicate_loci
dict
, where the value is alist
of any deduplicated loci (str
rsids). Additionally, merge the genotype (updating any nans) and identifying discrepant genotypes asdiscrepant_loci
(similar todiscrepant_XY
) and mark the genotype as nan.Finally, loci should optionally be deduplicated after each merge (add
deduplicate_loci
parameter tomerge
as well).The text was updated successfully, but these errors were encountered: