Skip to content

maf_filter.py

Suleyman Vural edited this page Jun 23, 2020 · 2 revisions

maf_filter.py

Inputs:

input_file = sys.argv[1] Input maf file

roslin_version_string = sys.argv[2]

is_impact = True if sys.argv[3]=='True' else False

analyst_file = sys.argv[4] Output file name

portal_file = sys.argv[5] Output file name

Outputs:

Generates two output files, analyst_file and portal_file, file names taken from command line arguments

Differences in output files:

Headers:

# The analyst MAF needs all the same columns as the input MAF (from vcf2maf, ngs-filters, etc.)

analyst_header = '\t'.join(header) + '\n'

# The portal MAF can be minimized since Genome Nexus re-annotates it when HGVSp_Short column is missing

header[header.index('HGVSp_Short')] = 'Amino_Acid_Change' #replace 'HGVSp_Short' -> 'Amino_Acid_Change'

portal_header = '\t'.join(header[0:45]) + '\n'

Variants:

The portal file includes fill-out lines and skips silent muts, genes without Entrez IDs, and intronic events

Filters summary:

Keep a variant if

filter column is ‘PASS’ or 'common_variant' or (is_impact and only_ccs_filters) and not reported uniquely by Pindel

and not MuTect-Rescue event unless it is_impact (IMPACT/HemePACT projects)

and not splice region variants in non-coding genes

and not splice region variants that are >3bp into introns

and

(consequence column matches one of these

['missense_','stop_','frameshift_','splice_','inframe_','protein_altering_','start_','synonymous_','coding_sequence_','transcript_','exon_','initiator_codon_','disruptive_inframe_','conservative_missense_','rare_amino_acid_','mature_miRNA_','TFBS_']

or

variant is on 'TERT' gene and position 1295141 between 1295340

)

and not (is_impact and Chromosome == 'MT')

tad ← 't_alt_count'

tdp ← 't_depth'

if tdp < 20 or

tad < 8 or

tumor_vaf < 0.02 or

(line[hotspot_col] == 'FALSE' and (tad < 10 or tumor_vaf < 0.05)):

and if is_impact:

filter out

else:

if line[filter_col] == 'PASS':line[filter_col] = "dmp_filter"

else :line[filter_col]+= ";dmp_filter"

Filters detailed:

#1 Identify if the variant is removed by one or more ccs filters and nothing else

only_ccs_filters = True

filters = re.split(';|,', line[filter_col])

for filter in filters:

if filter != "mq55" and filter != "nm2" and filter != "asb" and filter != "nad3":

only_ccs_filters = False

break

#2

Skip any that failed false-positive filters, except common_variant

Skip all events reported uniquely by Pindel

A variant is removed if it doesn’t satisfy this condition and kept if it doesn’t fail in the following conditions

(

line[filter_col] == 'PASS' or

line[filter_col] == 'common_variant' or

(is_impact and only_ccs_filters)

)

and line[set_col] != 'Pindel': <- Skip all events reported uniquely by Pindel

 Skip MuTect-Rescue events for all but IMPACT/HemePACT projects

            if line[set_col] == 'MuTect-Rescue' and not is_impact

      continue

Skip splice region variants in non-coding genes, 

Skip splice region variants that are >3bp into introns

            splice_dist = 0

            if re.match(r'splice_region_variant', line[csq_col]) is not None:

                if re.search(r'non_coding_', line[csq_col]) is not None:

                    continue

                # Parse the complex HGVSc format to determine the distance from the splice junction

                m = re.match(r'[nc]\.\d+[-+](\d+)_\d+[-+](\d+)|[nc]\.\d+[-+](\d+)', line[hgvsc_col])

                if m is not None:

                    # For indels, use the closest distance to the nearby splice junction

                    splice_dist = min(int(d) for d in [x for x in m.group(1,2,3) if x is not None])

                    if splice_dist > 3:

                        Continue

Skip all non-coding events except interesting ones like TERT promoter mutations

            csq_keep = ['missense_', 'stop_', 'frameshift_', 'splice_', 'inframe_', 'protein_altering_', 'start_', 'synonymous_', 'coding_sequence_', 'transcript_', 'exon_', 'initiator_codon_', 'disruptive_inframe_', 'conservative_missense_', 'rare_amino_acid_', 'mature_miRNA_', 'TFBS_']

            if 

              re.match(r'|'.join(csq_keep), line[csq_col]) is not None or # if variant matches one of those in the csq_keep list

              (line[gene_col] == <strong>'TERT'</strong> and int(line[pos_col]) >= <strong>1295141</strong> and int(line[pos_col]) &lt;= <strong>1295340</strong>):

                # Skip reporting MT muts in IMPACT, 

                # and apply the DMP's depth/allele-count/VAF cutoffs as hard filters in IMPACT, 

                # and soft filters in non-IMPACT

                if is_impact and line[chrom_col] == 'MT':

                    continue

        tad ←  't_alt_count'

    tdp ←  't_depth'

                tumor_vaf = float(tad) / float(tdp) if tdp != 0 else 0

                if tdp &lt; <strong>20</strong> or tad &lt;<strong> 8</strong> or 

                   tumor_vaf &lt;<strong> 0.02</strong> or 

                   (line[hotspot_col] == <strong>'FALSE'</strong> and (tad &lt; 10 or tumor_vaf &lt; 0.05)):

                    if is_impact:

                        continue

                    else:

                        line[filter_col] = "dmp_filter" if line[filter_col] == 'PASS' else line[filter_col] + ";dmp_filter"

                # The portal also skips silent muts, genes without Entrez IDs, and intronic events

                if re.match(r'synonymous_|stop_retained_', line[csq_col]) is None and 

                   line[entrez_id_col] != 0 and 

                   splice_dist &lt;= 2:

                    dict_portal_kept.setdefault(key, []).append('\t'.join(line[0:45]))

                    dict_analyst_kept.setdefault(key, []).append('\t'.join(line))

                # tag this events in analysis maf as "skipped_by_portal" in column "Mutation_Status"

                else:

                    line[mut_status_col] = "skipped_by_portal"

                    dict_analyst_kept.setdefault(key, []).append('\t'.join(line))
Clone this wiki locally