Skip to content

Calculate multi-allelism in genomic variants, enacted using a snakemake pipeline

Notifications You must be signed in to change notification settings

CraigJAnderson/lce_ma

Repository files navigation

The following was used to calculate multi-allelism in WGS from liver nodules derived from C3H mice exposed to a single dose of diethylnitrosamine: https://www.nature.com/articles/s41586-020-2435-1

How to run on LSF Scheduler

bsub -J over_lord -W 72:00 -o log.out -e log.error 'snakemake -pr -j 800 --cluster-config cluster.json --cluster "bsub -W {cluster.run_time} -M {cluster.memory} -R {cluster.resources} -e {cluster.error} -o {cluster.output} "' 

The idea is that one can provide a different config file for independent cohorts using the --configfile command, which defaults to config_c3h.yaml.

Output files and fields:

.map
The counts of each of the types of variants in a sample, reporting the total proportion of multi-allelic variants, as well as the null estimate. This is based upon the mean of 100 random draws of variant positions seen in the population, exclusing sample specific variant sites
nodule,MA.count,BA.count,proportion.multiallelic.sites,mean.proportion.control.sites.with.unexpected.additional.allele
 4584 26989 0.145187 0.00127799

.allele_count_bed
A file of genomic coordinates that also provide information as to whether a variant is multi-allelic (MA) or biallelic (BA), as well as the counts of each of the alleles.
chromosome,start.bp,end.bp,Annotation[Varianttype.Acount.Ccount.Gcount.Tcount]
1	140510	140511	BA,26,8,0,0

.MA_combos
Identifies proximal variants to define their haplotypes
Location.and.mutation.class.allele1,Reference.allele1,location.and.mutation.class.allele2,Reference.allele2,bp.distance,combination.code,number.of.combinations,specific.haplotype.counts
1:18994504_A/C  A       1:18994839.0_A/T        A       335.0   011     3       (('A,A', 8), ('C,A', 2), ('A,T', 1))
combination codes are as follows:
100: only evidence of ALT alleles in phase
101: evidence of ALT alleles in phase, as well as the REF1 and ALT2
110: evidence of ALT alleles in phase, as well as the ALT1 and REF2
011: ALT alleles occur out of phase
111: evidence of ALT alleles occurring in phase, as well as out of phase
NOTE- there are other varieties, such as 010, 001 and 000, these reflect instance where we can't equivically place one, the other or both ALT alleles with respect to each other and are therefore not included in subsequent analyses.

.combos_txt
Grades proximal combinations of markers as those the proportion of those impossible under a classical model of clonal expansion.
Nodule,grade1,grade2,grade3,grade4,grade5,(grade3+4+5/1+2+3+4+5) ##the script should have output the proportion of multiallelic sites on the end, but this isn't happening for some reason.
 71 14 0 2 5 0.076087
NOTES ON GRADES:
each pair of markers is given a grade defined by a 4 digit code referring to: ALT1+ALT2,REF1+REF2,REF1+ALT2,ALT1+REF2.
Grade 1 = 1,1,0,0
Grade 2 = 1,1,1,0 & 1,1,0,1
Grade 3 = >=1,1,0,>=1 & >=1,1,>=1,0
Grade 4 = 1,1,1,1
Grade 5 = >=1,1,>=1,>=1 or >=1,1,0,0
NOTES: only considers the following mutation haplotypes occuring as pairs of variants between 3 and 150 bp of each other: 100,111,101,110

.fnod_bed
A genomic coordinate file of sister chromatid exchange locations and their strand asymmetry 
chromosome,start.bp,end.bp,segment.specific.identifier.with.strand.asymmetry.score
1       22780   88690149        1_1:-0.02432676881172

.sce_bed
as the fnod_bed, but the proportion of multi-allelic variants is calculated for each segment
chromosome,start.bp,end.bp,chromosome,segment.specific.identifier,strand.asymmetry.score,proportion.multi-allelic.sites
1 22780 88690149 1 1 -0.02432676881172 0.0483019

.sce_bed2
as the sce_bed, but the relative coordinates for plotting are given.
chromosome,start.bp,end.bp,chromosome,segment.specific.identifier,strand.asymmetry.score,proportion.multi-allelic.sites,relative.start.site,relative.end.site
1 22780 88690149 1 1 -0.02432676881172 0.0483019 9.11447e-06 0.0354857

.timing
the proportion of segments that are multiallelic is used to infer the generation after exposure that a clonal population initiated tumouregenesis. I provide annotations to easily perform a Fisher's exact test to see if it's earlier or later than expected. We set the threshold proportion for calling a segment multi-allelic at 0.04
nodule,proportion.multialleic.segments,timing.subgroup[1=early,0=late],drivers,braf[1=true,0=false],egfr[1=true,0=false],hras[1=true,0=false],kras[1=true,0=false]
 0.638889 0     Kras;Ctnnb1     0       0       0       1

About

Calculate multi-allelism in genomic variants, enacted using a snakemake pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published