-
Notifications
You must be signed in to change notification settings - Fork 2
maf_filter.py
input_file = sys.argv[1] Input maf file
roslin_version_string = sys.argv[2]
is_impact = True if sys.argv[3]=='True' else False
analyst_file = sys.argv[4] Output file name
portal_file = sys.argv[5] Output file name
Generates two output files, analyst_file and portal_file, file names taken from command line arguments
Differences in output files:
Headers:
# The analyst MAF needs all the same columns as the input MAF (from vcf2maf, ngs-filters, etc.)
analyst_header = '\t'.join(header) + '\n'
# The portal MAF can be minimized since Genome Nexus re-annotates it when HGVSp_Short column is missing
header[header.index('HGVSp_Short')] = 'Amino_Acid_Change' #replace 'HGVSp_Short' -> 'Amino_Acid_Change'
portal_header = '\t'.join(header[0:45]) + '\n'
Variants:
The portal file includes fill-out lines and skips silent muts, genes without Entrez IDs, and intronic events
Keep a variant if
filter column is ‘PASS’ or 'common_variant' or (is_impact and only_ccs_filters) and not reported uniquely by Pindel
and not MuTect-Rescue event unless it is_impact (IMPACT/HemePACT projects)
and not splice region variants in non-coding genes
and not splice region variants that are >3bp into introns
and
(consequence column matches one of these
['missense_','stop_','frameshift_','splice_','inframe_','protein_altering_','start_','synonymous_','coding_sequence_','transcript_','exon_','initiator_codon_','disruptive_inframe_','conservative_missense_','rare_amino_acid_','mature_miRNA_','TFBS_']
or
variant is on 'TERT' gene and position 1295141 between 1295340
)
and not (is_impact and Chromosome == 'MT')
tad ← 't_alt_count'
tdp ← 't_depth'
if tdp < 20 or
tad < 8 or
tumor_vaf < 0.02 or
(line[hotspot_col] == 'FALSE' and (tad < 10 or tumor_vaf < 0.05)):
and if is_impact:
filter out
else:
if line[filter_col] == 'PASS':line[filter_col] = "dmp_filter"
else :line[filter_col]+= ";dmp_filter"
#1 Identify if the variant is removed by one or more ccs filters and nothing else
only_ccs_filters = True
filters = re.split(';|,', line[filter_col])
for filter in filters:
if filter != "mq55" and filter != "nm2" and filter != "asb" and filter != "nad3":
only_ccs_filters = False
break
#2
Skip any that failed false-positive filters, except common_variant
Skip all events reported uniquely by Pindel
A variant is removed if it doesn’t satisfy this condition and kept if it doesn’t fail in the following conditions
(
line[filter_col] == 'PASS' or
line[filter_col] == 'common_variant' or
(is_impact and only_ccs_filters)
)
and line[set_col] != 'Pindel': <- Skip all events reported uniquely by Pindel
Skip MuTect-Rescue events for all but IMPACT/HemePACT projects
if line[set_col] == 'MuTect-Rescue' and not is_impact
continue
Skip splice region variants in non-coding genes,
Skip splice region variants that are >3bp into introns
splice_dist = 0
if re.match(r'splice_region_variant', line[csq_col]) is not None:
if re.search(r'non_coding_', line[csq_col]) is not None:
continue
# Parse the complex HGVSc format to determine the distance from the splice junction
m = re.match(r'[nc]\.\d+[-+](\d+)_\d+[-+](\d+)|[nc]\.\d+[-+](\d+)', line[hgvsc_col])
if m is not None:
# For indels, use the closest distance to the nearby splice junction
splice_dist = min(int(d) for d in [x for x in m.group(1,2,3) if x is not None])
if splice_dist > 3:
Continue
Skip all non-coding events except interesting ones like TERT promoter mutations
csq_keep = ['missense_', 'stop_', 'frameshift_', 'splice_', 'inframe_', 'protein_altering_', 'start_', 'synonymous_', 'coding_sequence_', 'transcript_', 'exon_', 'initiator_codon_', 'disruptive_inframe_', 'conservative_missense_', 'rare_amino_acid_', 'mature_miRNA_', 'TFBS_']
if
re.match(r'|'.join(csq_keep), line[csq_col]) is not None or # if variant matches one of those in the csq_keep list
(line[gene_col] == <strong>'TERT'</strong> and int(line[pos_col]) >= <strong>1295141</strong> and int(line[pos_col]) <= <strong>1295340</strong>):
# Skip reporting MT muts in IMPACT,
# and apply the DMP's depth/allele-count/VAF cutoffs as hard filters in IMPACT,
# and soft filters in non-IMPACT
if is_impact and line[chrom_col] == 'MT':
continue
tad ← 't_alt_count'
tdp ← 't_depth'
tumor_vaf = float(tad) / float(tdp) if tdp != 0 else 0
if tdp < <strong>20</strong> or tad <<strong> 8</strong> or
tumor_vaf <<strong> 0.02</strong> or
(line[hotspot_col] == <strong>'FALSE'</strong> and (tad < 10 or tumor_vaf < 0.05)):
if is_impact:
continue
else:
line[filter_col] = "dmp_filter" if line[filter_col] == 'PASS' else line[filter_col] + ";dmp_filter"
# The portal also skips silent muts, genes without Entrez IDs, and intronic events
if re.match(r'synonymous_|stop_retained_', line[csq_col]) is None and
line[entrez_id_col] != 0 and
splice_dist <= 2:
dict_portal_kept.setdefault(key, []).append('\t'.join(line[0:45]))
dict_analyst_kept.setdefault(key, []).append('\t'.join(line))
# tag this events in analysis maf as "skipped_by_portal" in column "Mutation_Status"
else:
line[mut_status_col] = "skipped_by_portal"
dict_analyst_kept.setdefault(key, []).append('\t'.join(line))