We recommend running TOGA which incorporates an improved version of the gene loss detection pipeline and offers many additional features, including gene annotation and orthology inference.
In order to achieve high specificity, this pipeline implements a number of steps that overcome assembly and alignment issues, and address evolutionary exon-intron structure changes in genes that are conserved.
The starting point of the pipeline is a multiple genome alignment. The output is a list of lost genes.
Genome assemblies can contain regions of poor sequence quality, where the probability of sequencing errors is substantially elevated. These sequencing errors can mimic gene-inactivating mutations. To avoid falsely calling these errors as inactivating mutations, we employ a quality masking procedure. The script mafQualityMasking.pl masks every base of low sequencing quality by replacing it with an “N” character.
mafQualityMasking.pl -input Input_maf_file -output Output_maf_file
Genome assembly gaps indicate regions where parts of the real genome are missing in the given assembly. Missing sequence can comprise exons or even entire genes, which mimics exon or gene deletions. To avoid falsely calling such regions as deletions, mafFilterElines.pl uses the assembly gap annotation to remove all deletions or regions of unaligning sequence from the alignment that overlap assembly gaps.
mafFilterElines.pl -input Input_maf_file -output Output_maf_file -reference ReferenceSpecies
-alignment alignment
Genome alignments can incorrectly align a gene to a processed pseudogene or paralog in the query species in case the ortholog is missing in the query assembly, which is especially problematic for lower-quality assemblies. Therefore, we filter the genome alignment by removing aligning blocks that likely come from a paralog or processed pseudogene alignment. This is achieved by first identifying chain blocks that are likely orthologous (we call them whiteListed chains) versus chain blocks that likely represent paralogs or processed pseudogene chains (we call them blackListed chains). This cleaning is achieved by a two step process: First, we convert the coordinates of these chains to bed format:
filterMafsForWhiteListedChainsPPS.pl ReferenceSpecies QuerySpecies geneChainFilteringDir
outputDirectoryWhite outputDirectoryBlack
where geneChainFilteringDir is a directory containing chain blocks that have been classified as whiteListed or blackListed, outputDirectoryWhite is a directory containing bed files of whiteListed chains and outputDirectoryBlack is a directory containing bed files of blackListed chains.
Subsequently, we overlap the coordinates of these chains with the coordinates of aligning sequence in the maf blocks and filter out sequences that come from paralogs/processed pseudogenes
filterMafsForWhiteListedChains.pl -in Input_maf_file -out Output_maf_file -ref ReferenceSpecies
-whiteListedChainsDir WhiteListedChainsDirectory -blackListedChainsDir BlackListedChainsDirectory
where WhiteListedChainsDirectory BlackListedChainsDirectory is produced by filterMafsForWhiteListedChainsPPS.pl above.
Ambiguous alignments, where several suboptimal solutions exist, can result in spurious frameshifting indels and splice site mutations. Additionally, splice site positions can shift in evolution. To avoid these sequence alignment issues and to detect shifted splice sites, we use CESAR (Coding Exon-Structure Aware Realigner) to realign the query sequence that is aligned to an exon of the reference, taking reading frame and splice site information into account. This step involves determining intron length for every exon in the genome alignment to identify cases of intron deletion. Subsequently, exons where the intervening introns are deleted are aligned together as a single unit. All other exons are aligned individually.
cesarRealign.pl speciesList allTranscriptsList ReferenceSpecies bigBedIndex clade [human|mouse] jobsPrefix
where speciesList is a file listing query species (one species per line), allTranscriptsList is a file listing all the genes/transcripts in modified genePred format (full path needs to be specified for this file), bigBedIndex is the indexed alignment (.bb) file, and jobsPrefix specifies a prefix of the generated job files.
cesarRealign.pl will produce three jobs file: $jobsPrefix_short, $jobsPrefix_medium, $jobsPrefix_long. Each job in the $jobsPrefix_short typically finish within an hour, $jobsPrefix_medium need something between 4 to 8 hours, and $jobsPrefix_long run longer.
These jobs need to be executed on a compute cluster. Afterwards, merging the alignment files produced by CESAR to produce one file that contains alignments for all exons:
MergeBDB.perl realignedExons/realigned. allExons_CESAR.BDB
The tabular format is a data structure that is used by geneLossPipeline to output mutations in the pairwise alignments (for reference-query species pair for every gene). createTabFiles.pl does this
createTabFiles.pl -input geneList -ref RefSpecies -out OutputDirectory -log logFile
-index allExons_CESAR.BDB
where geneList is a file listing the genes in modified genePred format.
Parameters:
-v|verbose
-species [Optional: instead of all species in the genome alignment, restrict to one
or few species: a comma separated string]
-pMammals [run only for placental mammals in the input alignment. This information
is contained in GeneLoss.config]
-alignment [Optional but mandatory if you specify pMammals]
-sameSS [Only process genes where the entire gene in the query is on one
scaffold/chromosome and a single strand]
-excludeGenes [Optional: a list of genes that should be excluded, one gene per line]
This step involves detection of mutations in pairwise alignments (for reference-query species pair for every gene)
geneLossPipeline.pl -input gene/geneListFile -out Output_Directory -alignment Alignment
-ref ReferenceSpecies -tabDir DirectoryWithTabularFiles
Parameters:
-fmt [Optional, if set to 'lst', an input file needs to be specified where each line
is one gene/transcript]
-species [Optional: a comma separated file listing the species (the default behaviour is
to look for every species for which there is a tabular file)]
-pMammals [run only for placental mammals in the input alignment. This information
is contained in GeneLoss.config]
-EDE [Exclude Dubious Events - basically frameshifts that are flanked by a base of
low quality or frameshifting insertions where inserts have a base of low quality,
default is report all events]
-excludeFPI [exclude FramePreserving events]
-all [Run 3 modules - Indel, InframeStopCodon and SpliceSite, otherwise you could specify
-indel for frameshift, -ifsc for inflame stop codons and -ssm for splice site mutations]
Afterwards merge the mutation profile for every gene to produce one file:
MergeBDB.perl Output_Directory/geneLossPipe. geneLossPipeAllGenes.BDB
geneLossCaller.pl genesList GeneLossPipeBDB annotationFile species
ReadingFrameThreshold exonGroupsDir u12IntronList outFile
where genesList is a file listing one gene per line (the first field is the gene name, the second is a string of principal isoforms which are separated by comma), GeneLossPipeBDB is produced in Step 6, annotationFile is a the file with the modified genePred format, ReadingFrameThreshold is the fraction of the remaining intact reading frame (0.6 was used for our study), exonGroupsDir is produced by cesarRealign.pl, and u12IntronList is a file listing coordinates of U12 introns in bed format.
Optional parameters:
-compFS [Count frameshifts as inactivating mutations even if they are compensated]
-verbose
-longIndel [Treat every indel longer than 50 bp as an inactivating mutation]