-
Notifications
You must be signed in to change notification settings - Fork 23
Mutalyzer FAQ
Mutalyzer is described in:
Lefter M et al. Mutalyzer 2: Next Generation HGVS Nomenclature Checker. Bioinformatics (2021).
Mutalyzer checks sequence variant descriptions given a certain reference sequence and, if necessary, tries to correct them according to the standard human sequence variant nomenclature of the Human Genome Sequence Variation Society (HGVS). Descriptions of its functionality can be found in the Mutalyzer Documentation.
Most Mutalyzer web pages show an example of the input data accepted. Additional descriptions of input data formats can be found in the Mutalyzer Documentation.
No, Mutalyzer only checks sequence variant descriptions. Sequence variant descriptions can be generated from sequence traces using third party software, e.g., MutationSurveyor.
Why does Mutalyzer only accept GenBank Accession Numbers, other files in GenBank format or LRG Accession Numbers?
Mutalyzer's Reference Sequence Parser has been developed to extract sequence and annotation information from reference sequence records in GenBank or Locus Reference Genomic (LRG) format. The Reference Sequence Parser will not work properly with other formats.
Why does the Mutalyzer Name Checker not work directly with GenBank Accession Numbers starting with NC_ or NT_?
GenBank Accession Numbers starting with NC_ or NT_ refer to contigs of smaller sequences, potentially interspaced with gaps. Mutalyzer will try to retrieve the underlying sequences to check the sequence variant, but it may lose track of the corresponding positions due to the different levels of assembly and return errors. Users are advised circumvent this problem by using the Reference File Loader when (part of) these NC_ or NT references are used. The Mutalyzer Exercise provides more detailed information.
Why does the Mutalyzer Name Checker not accept positions outside exons when using a transcript reference sequence?
The Mutalyzer Name Checker uses the reference sequence to check for the presence of the nucleotides at the positions specified. Since promoter sequences, intron sequences and intergenic sequences are not included in a coding DNA reference sequence (e.g., RefSeq NM_ or XM_ records) or non-coding DNA reference sequence (e.g., RefSeq NR_ or XR_ records), Mutalyzer is unable to check these and will issue an "Out of bounds" or "Intronic position given for a non-genomic reference sequence" error. We strongly suggest to use genomic reference sequences (e.g., RefSeq NG_ or Locus Reference Genomic (LRG) records) to describe changes in promoter sequences, intron sequences and intergenic sequences. Use the Reference File Loader with the appropriate gene symbol, organism and flanking sequences to retrieve a suitable genomic reference sequence.
The Mutalyzer Name Checker checks the annotation of the genomic reference sequence for information about the genes, their exons and protein coding sequence. When these features are not annotated, Mutalyzer is unable to use the [VariantDescriptions/PositionNumbering coding DNA numbering scheme] and will return an error. Please note that Mutalyzer also checks for non-coding transcripts, for which the non-coding DNA numbering scheme will be applied in the absence of a start codon. The HGVS sequence variation nomenclature guidelines do not yet provide guidance on this issue.
Although any file in GenBank format can be used, curated RefSeq sequences are preferred (See HGVS Reference Sequence discussion). Most gene variant databases (LSDBs) specify reference sequences for the genes of interest. In many cases, these will be coding DNA reference sequences. If you want to check changes in promoter sequences, intron sequences and intergenic sequences, you should ask the curator of the gene variant database to provide a suitable genomic DNA reference sequence.
If you are the curator of a gene variant database in need of an appropriate genomic reference sequence, you can use the options of the Reference File Loader to select a genomic reference sequence. More information about the selection and modification of reference sequences can be found in the [MutalyzerExercise Mutalyzer Exercise].
I am using the correct RefSeq Accession number. Why is the position of most or all sequence variants on transcript or protein level different from what I expected?
Every GenBank file has an Accession number and a version number (e.g. AB026906.1). If the version number has not been specified, Mutalyzer will retrieve the last version of this file. The sequence annotation of the last version may differ from that of an earlier version, leading to new/changed transcripts and protein sequence information, which is automatically used by Mutalyzer to describe the variants. Most gene variant databases (LSDBs) specify the accession numbers of reference sequences for the genes of interest, but they should also include the version number to prevent unexpected Mutalyzer results. You can check the influence of the version number on Mutalyzer's analysis by specifying a previous version number. If this solves the problem, please ask the curator of the gene variant database to specify the correct version of the reference sequence.
The Mutalyzer Position Converter uses a mapping database, which only contains RefSeq transcript identifiers. We are working on including RefSeq Gene and LRG mapping coordinates in this database.
Why does the Mutalyzer Position Converter not use the last version of a gene-specific RefSeq transcript?
The [PositionConverter Mutalyzer Position Converter] uses a mapping database, which only contains RefSeq transcript identifiers from a file provided by the NCBI. In general, the release of new RefSeq transcript versions is independent from their incorporation in this file or in new RefSeq Gene versions. A delay of several weeks can be expected.
The Mutalyzer Position Converter checks a mapping database containing the exon and CDS coordinates for all RefSeq transcript identifiers available from NCBI's MapViewer. If the chromosomal position of the variant is more than 5000 nucleotides from the nearest transcript, Mutalyzer will return: "No transcripts found in mutation region".
The [SnpConverter Mutalyzer 2 SNP Converter] converts a dbSNP rsId to HGVS notation. The HGVS sequence variation description listed in dbSNP will be returned. The sequence variation description has not been checked by Mutalyzer's Name Checker. The SNP Converter does not map SNPs to any specified reference sequence.
Mutalyzer checks if it gets a GenBank Flat file in both cases. Locus Reference Genomic (LRG) are not supposed to be modified by users and therefore not accepted from other sources. Other formats can not be processed correctly and will generate an error.
Why do large deletions seem shorter using coding DNA position numbering than genomic position numbering?
The difference in deletion size is caused by our intention to reflect the effect of variations on transcript level, when coding DNA position numbering is used. Therefore, ranges of deleted nucleotides are limited to positions present in the coding DNA reference sequence. A warning listing the number of splice sites affected is shown to alert the user.
Why is a C>T substitution using coding DNA position numbering reported as a G>A substitution using genomic position numbering?
According to the standard human sequence variant nomenclature, numbering of a genomic reference sequence starts at the first nucleotide of the forward (+) strand and coding DNA numbering starts with the A of the start codon ATG. A genomic reference sequence may contain one or more genes transcribed in the opposite orientation. For those genes, coding DNA numbering will use the reverse (-) strand of the genomic sequence. As a consequence, coding DNA descriptions will use the reverse complement of those nucleotides specified in genomic descriptions.
Yes, Mutalyzer can check sequence variant descriptions from other organisms as long as a proper reference sequence (e.g., GenBank) is provided and the HGVS sequence variation nomenclature guidelines are applied. Mutalyzer will check the reference sequence annotation to determine which codon table should be used for proper translation of coding sequences.
Click Toggle File Format Help to get more information. The easiest way to create a [BatchCheckers batch checker] file is to download one of the Example files. Please right-click the link and select "Save as" to download the example file for modification. Open the batchtestold.txt or batchtestnew.txtfile with your favorite spreadsheet program. In Excel, the Text Import Wizard window will open to guide you through the import procedure. Click "Finish" to finalize the import. The Excel spreadsheet created from batchtestold.txt should have three columns and a header row containing the column names. You can add your data to the batchtest.txt file by typing or pasting the required information into the appropriate fields. When you are finished select "Save as" from the File menu and save your file using a different name (without spaces!) as type: Text (tab delimited)(*.txt). In case of problems with Excel follow the separate import steps by verifying the correct selections before clicking the "Next" button to go to step 2. Click the "Next" button to go to step 3, click "Finish" to finalize the import. If you experience problems with new Excel, OpenOffice or LibreOffice spreadsheet formats, please let us know.
The previous [BatchCheckers batch checker] was relatively sensitive to unexpected file and file name formats. Using the new file format should solve most problems. Users of the old file format are advised to check the following:
File format: The tab-delimited textfile should contain a header row (the first line) containing the words AccNo Genesymbol Mutation separated by a single tab.
File format: The file should be a tab-delimited text file. Excel users can import the file to check its format. The file should have three columns with the appropriate column names, AccNo Genesymbol Mutation, respectively. Th evariant information should be present in the corresponding columns.
The Genesymbol column/field may be left blank, when the reference sequence contains only one gene, but a tab should be present. Please note that Mutalyzer does not check the correctness of the gene symbol in this case.
If you have any comments or suggestions be sure to let us know!
[email protected]