-
Notifications
You must be signed in to change notification settings - Fork 9
6: Sequence Selection
- Overview
- Selecting Sequences
- Enforcing Minimum Sequence/Taxon Requirement
- Creating a Table of Accession Numbers From All Loci
- Calculating the Possible Number of Supermatrix Combinations
The sequence selection stage of SuperCRUNCH allows you to construct different types of datasets, including traditional species-level supermatrices as well as population-level datasets. There are a variety of methods for sorting and filtering sequences. These methods are perhaps most important for constructing supermatrices, where a single representative sequence often must be chosen from a pool of available sequences for each taxon for each locus. For example, the longest sequence can be selected or a random selection can be made, and a minimum length filter can be applied. Additional filters include requiring an error-free reading frame to retain a coding sequence, or only including sequences which have associated voucher information. The latter option, which only retains vouchered sequences, is particularly useful for constructing phylogeographic datasets. Extensive documentation for sequence selection is available in the Selecting Sequences section below.
Following the sequence selection step, there will likely be a reduction in the number of sequences in the filtered locus-specific fasta files (especially for supermatrices). At this point, there may be loci for which there simply aren't enough data to make them worth including. For a supermatrix, this translates to the number of species present for a locus, but for population-level datasets this is the actual number of retained sequences. It is easy to set a minimum threshold for the number of taxa/sequences and apply it to all loci, removing those which fall below the threshold. More details are provided in the Enforcing Minimum Sequence/Taxon Requirement section below.
A major challenge of working with GenBank data is organizing the accession numbers for publication, which allows the dataset to be reproduced. This process can be performed automatically using SuperCRUNCH, which will produce a table containing all taxa and their corresponding accession numbers for all included sequences. More information is provided in the Creating a Table of Accession Numbers From All Loci below.
Finally, one of the main motivations behind SuperCRUNCH was to construct supermatrix datasets from GenBank sequences in an objective, repeatable manner. For most organisms, it would be impossible to build identical supermatrix datasets using subjective criteria. When multiple sequences are available for taxa across loci, the number of possible supermatrices is staggering. As a fun exercise, I created a tool which calculates this number based on the dataset used to generate the supermatrix. I encourage you to try it out! Details are provided in the Calculating the Possible Number of Supermatrix Combinations below.
The selection of sequences ultimately determines the type of dataset produced. There are two fundamentally different datasets produced by this step:
This type of dataset results from selecting one representative sequence per taxon per locus. The taxa in these datasets can include species and/or subspecies. For a given taxon, this requires selecting a representative sequence for each locus if multiple sequences are available. For a complete supermatrix, this would mean each locus would contain the same number of sequences, which is equal to the number of species included. Species-level supermatrices are typically used for phylogenetic analyses, to reconstruct the evolutionary relationships among the species included in the supermatrix.
This type of dataset results from retaining all available sequences per taxon. It is population-level or phylogeographic sampling. These datasets can contain high variation in the number of sequences per locus. A population-level dataset can also be composed of vouchered samples of a single species or a group of closely related species. Although the samples may represent the same species, they come from distinct vouchered specimens. The voucher code will result in an identical label for that sample across loci. This allows sequences to be assigned to the same sample, which can be used to produce phylogeographic supermatrices. Population-level datasets can be used for phylogenetic, phylogeographic, and population genetic analyses.
The Filter_Seqs_and_Species.py
module can be used to produce both of the above datasets. The filtering options associated with both types of datasets is explained in greater detail below.
The Filter_Seqs_and_Species.py
module is used to filter sequences to produce distinct types of datasets, including species-level supermatrices and population-level datasets. This decision is set using the -s
flag, with -s oneseq
producing a species-level supermatrix and -s allseqs
producing a population-level dataset. A list of taxa to include must be provided (-t
). By default, subspecies names are included in the searches, but subspecies can be 'turned off' using the --no_subspecies
flag. For more information on how taxon searches with and without subspecies work, please see the section here. Filter_Seqs_and_Species.py
is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores. A subset of files in the input directory can be analyzed using the --onlyinclude
flag, which requires the full path to a simple text file containing the names of the files to include (one file name per line).
The available filtering options differ depending on which dataset is selected, so they will be discussed separately to prevent confusion.
For the species-level supermatrix, a representative sequence must be selected for each taxon for each locus. There are several options for choosing a sequence, which are controlled by a combination of the required -f
and -m
flags, and the optional --randomize
and --vouchered
flags.
-
-f
: The main options for choosing a sequence are by length (-f length
) or by translation (-f translate
). For any type of locus, the sequences can be sorted by length and the longest sequence selected (-f length
). For coding loci, the sequences can be sorted by length and then translated into all reading frames (-f translate
). The longest sequence with a valid reading frame (no internal stop codons) will be selected. If no sequences have a valid reading frame, the longest sequence will be selected (rather than exclude the taxon for that locus). For the-f translate
option, the translation table can be specified using the--table
flag. If your dataset contains a mix of loci that need to be filtered differently (e.g., coding mtDNA vs. coding nucDNA vs. noncoding), the--onlyinclude
flag can be used. This requires the full path to a text file containing the names of the fasta files (one file name per line) to be processed for a particular run. A separate output directory (-o
) should be specified for each distinct run. -
-m
: The required-m
flag sets a minimum base pair length required to keep sequences. That is, for any sequence to be selected using the-f length
or-f translate
method, it must be longer than the base pair value set by the-m
flag. If no sequences are long enough, the taxon will be excluded for that locus. -
--randomize
: This option will change how sequences are initially sorted, resulting in a random choice for sequence selection. When used with-f length
, a random sequence will be selected from the set available, but that sequence must be greater than the minimum base pair length set by the-m
flag. When used with-f translate
, a random sequence will be selected from only the sequences which pass translation, but that sequence must be greater than the minimum base pair length set by the-m
flag. The randomize option can be used to generate different permutations of the supermatrix, otherwise the same supermatrix will be recovered every time using a specific combination of the-f
and-m
flags. These different sequence combinations for the supermatrix can be used to test the robustness of species relationships. -
--vouchered
: If this flag is used, only sequences that have a voucher tag will be considered for the above filtering options. The voucher field is present in the original description line (voucher
,isolate
, orstrain
), and produces aVoucher_[ID]
tag in the description line during theParse_Loci.py
module. This is generally not recommended for creating species-level supermatrices, but could be useful under certain circumstances.
For population-level datasets, there is no need to select one sequence per taxon per locus - all sequences are retained. However, there are still some filters that can be used to ensure only high-quality sequences are retained. The required -f
and -m
flags, as well as the optional --vouchered
flag, can be used to control this filtering. The optional --randomize
flag will have no effect on population-level datasets.
-
-f
: The main options for choosing a sequence are by length (-f length
) or by translation (-f translate
). For population-level datasets, only the translation option will have any effect. If the-f length
option is used, all sequences will be retained. If the-f translate
option is used, only sequences that pass translation will be retained. For the-f translate
option, the translation table can be specified using the--table
flag. If your dataset contains a mix of loci that need to be filtered differently (e.g., coding mtDNA vs. coding nucDNA vs. noncoding), the--onlyinclude
flag can be used. This requires the full path to a text file containing the names of the fasta files (one file name per line) to be processed for a particular run. A separate output directory (-o
) should be specified for each distinct run. -
-m
: The required-m
flag sets a minimum base pair length required to keep sequences. For population-level datasets, this will simply eliminate sequences that are shorter than the base pair value set by the-m
flag. -
--vouchered
: If this flag is used, only sequences that have a voucher tag will be retained. The voucher field is present in the original description line (voucher
,isolate
, orstrain
), and produces aVoucher_[ID]
tag in the description line during theParse_Loci.py
module. This option is extremely useful for reconstructing phylogeographic datasets, especially if the starting sequences come from a single study. The voucher code will result in an identical label for that sample across loci. This allows sequences to be assigned to the same sample, which can be used to produce cohesive phylogeographic datasets and phylogeographic supermatrices.
python Filter_Seqs_and_Species.py -i <input directory> -o <output directory> -s <choice> -f <dataset choice> -m <minimum base pair length> -t <taxon file>
Required: The full path to a directory which contains the fasta files to filter by species and sequence.
Required: The full path to an existing directory to write output files.
Required: Select one representative sequence per taxon or select all sequences available per taxon. Choices = oneseq, allseqs.
Required: Strategy for filtering sequence data, particularly for selecting one representative sequence using the -s 'oneseq' option. If the
-s allseqs
option is desired, use-f length
. Choices = translate, length.
Required: An integer for the minimum number of base pairs required to keep a sequence (ex. 150).
Required: The full path to a text file containing all taxon names to cross-reference in the fasta file(s).
Optional: Ignore subspecies labels in both the taxon names file and the fasta file.
Optional: For taxa with multiple sequences, shuffle order randomly. Overrides initial sorting by length for sequence selection step.
Optional: Select only sequences that contain a Voucher_[ID] tag in the description line (inserted by
Parse_Loci.py
module).
Required for
-f translate
: Specifies translation table. Choices = standard, vertmtdna, invertmtdna, yeastmtdna, plastid, or any integer 1-31.
Optional: The full path to a text file containing the names of the fasta files to process for this run.
Optional: Show less output while running (useful when filtering many, many loci).
python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s oneseq -f length -m 400 -t Filter-seqs/Taxon-List.txt
Above command will create a species-level supermatrix using the length method with a minimum base pair length of 400, using the taxon list
Taxon-List.txt
, with subspecies allowed.
python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s oneseq -f length -m 400 -t Filter-seqs/Taxon-List.txt --randomize --no_subspecies
Above command will create a species-level supermatrix using the length + randomize method (a true random selection) with a minimum base pair length of 400, using the taxon list
Taxon-List.txt
with the--no_subspecies
option.
python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s oneseq -f translate -m 200 -t Filter-seqs/Taxon-List.txt --no_subspecies --table standard --onlyinclude Filter-seqs/Nuc_File_List.txt --quiet
Above command will create a species-level supermatrix using the translation method with standard code and a minimum base pair length of 200, using the taxon list
Taxon-List.txt
with the--no_subspecies
option. It will only run for the fasta files included in theNuc_File_List.txt
file. The--quiet
flag will make the screen output less verbose.
python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s allseqs -f length -m 400 -t Filter-seqs/Taxon-List.txt
Above command will create a population-level dataset using the length method with a minimum base pair length of 400, using the taxon list
Taxon-List.txt
, with subspecies allowed.
python Filter_Seqs_and_Species.py -i Filter-seqs/Input/ -o Filter-seqs/Output/ -s allseqs -f translate -m 200 -t Filter-seqs/Taxon-List.txt --no_subspecies --table vertmtdna --onlyinclude Filter-seqs/mtDNA_File_List.txt
Above command will create a population-level dataset using the translation method with vertebrate mitochondrial code and a minimum base pair length of 200, using the taxon list
Taxon-List.txt
with the--no_subspecies
option. It will only run for the fasta files included in themtDNA_File_List.txt
file.
Several outputs are created in the output directory specified. A directory called /Results
is created, which contains several additional directories.
-
Directory
/Results/Accession-Number-Files
- For each locus, this directory contains two files:NAME_accession_list_by_species.txt
andNAME_accessions_for_BatchEntrez.txt
.
The NAME_accession_list_by_species.txt
file contains a list of accession numbers for all the sequences available for a given taxon. Here are the partial contents of such a file:
Acanthosaura armata AB266452.1, NC_014175.1
Acanthosaura crucigera AB031963.1
Acanthosaura lepidogaster KR092427.1
Agama impalearis AB271238.1, KT005791.1
Amblyrhynchus cristatus KR350814.1, KR350815.1, KR350816.1, KR350817.1, KR350823.1, KR350824.1, KR350825.1, KR350826.1, KR350827.1, KR350828.1, KR350829.1, KR350830.1, KR350831.1, KR350832.1, KR350833.1, KR350834.1, KR350835.1, KR350836.1, KR350837.1, KT277937.1, NC_028031.1
The LOCUS_accessions_for_BatchEntrez.txt
file contains a simple list of all accession numbers for all the sequences available for all taxa. Here are the partial contents of such a file:
AB266452.1
NC_014175.1
AB031963.1
KR092427.1
AB271238.1
KT005791.1
KR350814.1
KR350815.1
KR350816.1
KR350817.1
KR350823.1
KR350824.1
KR350825.1
KR350826.1
KR350827.1
KR350828.1
KR350829.1
KR350830.1
KR350831.1
KR350832.1
KR350833.1
KR350834.1
KR350835.1
KR350836.1
KR350837.1
KT277937.1
NC_028031.1
-
Directory
/Results/Taxon-Log-Files
- For each locus, this directory contains a summary fileNAME_species_log.txt
that contains the following columns:-
Taxon - The taxon name for this sequence.
-
Acession - The accession number of the sequence selected for this taxon (species-level supermatrix) or simply the sequence in that row (population-level dataset).
-
SeqLength - The length of the sequence corresponding to this accession number.
-
Vouchered - A
yes
orno
as to whether the sequence contained a Voucher_[ID] tag. -
PassedTranslation - A
yes
orno
as to whether the sequence passed translation for-f translate
or aNA
for-f length
. -
SeqsAvailable - The number of alternative sequences available for this taxon.
-
Here is an example from using the -s oneseq
and -f length
options:
Taxon Acession SeqLength Vouchered PassedTranslation SeqsAvailable
Acanthosaura armata AB266452.1 833 no NA 2
Acanthosaura crucigera AB031963.1 360 no NA 1
Acanthosaura lepidogaster KR092427.1 834 no NA 1
Agama impalearis KT005791.1 338 yes NA 2
Amblyrhynchus cristatus KT277937.1 944 no NA 21
Here is an example from using the -s oneseq
and -f length
options:
Taxon Acession SeqLength Vouchered PassedTranslation SeqsAvailable
Acanthosaura lepidogaster JF806003.1 630 no yes 1
Agama agama DQ340698.1 688 yes yes 4
Amblyrhynchus cristatus KR350716.1 620 yes yes 3
Amphibolurus muricatus KF871470.1 718 yes yes 61
Amphibolurus norrisi KF871494.1 718 yes yes 1
Here is an example from using the -s allseqs
and -f length
options:
Callisaurus draconoides bogerti DQ001804.1 1087 yes NA 2
Callisaurus draconoides bogerti EU543757.1 1087 no NA 2
Callisaurus draconoides brevipes DQ001803.1 1082 yes NA 1
Callisaurus draconoides carmenensis DQ001780.1 1087 yes NA 17
Callisaurus draconoides carmenensis DQ001782.1 1087 yes NA 17
Callisaurus draconoides carmenensis DQ001784.1 1087 yes NA 17
Callisaurus draconoides carmenensis DQ001785.1 1087 yes NA 17
Callisaurus draconoides carmenensis DQ001790.1 1087 yes NA 17
Callisaurus draconoides carmenensis DQ001793.1 1087 yes NA 17
Callisaurus draconoides carmenensis DQ001781.1 1086 yes NA 17
Callisaurus draconoides carmenensis EU543755.1 1083 no NA 17
Callisaurus draconoides carmenensis DQ001783.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001786.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001787.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001788.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001789.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001791.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001792.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001794.1 1076 yes NA 17
Callisaurus draconoides carmenensis DQ001795.1 1076 yes NA 17
-
Directory
/Results/Filtered-Fasta-Files
- For each locus, this directory contains a fasta file with the sequences that passed the sequence selection filtering. If the-s oneseq
was used, the file will be calledNAME_oneseq.fasta
. If the-s allseqs
option was used, the file will be calledNAME_allseqs.fasta
. These files should be used for the downstream steps.
Following the sequence selection step, there will likely be a reduction in the number of sequences in the filtered locus-specific fasta files (especially for supermatrices). At this point, there may be loci for which there simply aren't enough data to make them worth including. For a supermatrix, this translates to the number of species present for a locus, but for population-level datasets this is the actual number of retained sequences. It is easy to set a minimum threshold for the number of taxa/sequences and apply it to all loci, removing those which fall below the threshold.
The purpose of the Fasta_Filter_by_Min_Seqs.py
module is to examine fasta files for the number of sequences contained within. If taxa are represented by a single sequence, this is equivalent to the number of taxa. Running the script with only the -i
and --min_seqs
flags will simply print the number of fasta files passing and failing the minimum sequence filter. The minimum number of sequences is an integer specified using the --min_seqs
flag. Running this module with the -i
, --min_seqs
, and the optional -o
flags will copy the fasta files that pass the minimum sequence filter to the output directory specified with the -o
flag.
Fasta_Filter_by_Min_Seqs.py
is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
To simply show results on screen (no files copied):
python Fasta_Filter_by_Min_Seqs.py -i <input directory> --min_seqs <minimum sequences required>
To show results and copy files passing filter to output directory:
python Fasta_Filter_by_Min_Seqs.py -i <input directory> --min_seqs <minimum sequences required> -o <output directory>
Required: The full path to a directory of fasta files.
Required: The minimum number of taxa/sequences required.
Optional: The full path to an existing directory to copy files passing the filter.
python Fasta_Filter_by_Min_Seqs.py -i bin/Min-Seqs/Input --min_seqs 50
Above command will assess how many fasta files present in the input directory
bin/Min-Seqs/Input
have greater than or equal to 50 sequences, and print this to the screen.
python Fasta_Filter_by_Min_Seqs.py -i bin/Min-Seqs/Input --min_seqs 50 -o bin/Min-Seqs/Output
Above command will assess how many fasta files present in the input directory
bin/Min-Seqs/Input
have greater than or equal to 50 sequences, print this to the screen, and copy the files that passed to the designated output directorybin/Min-Seqs/Output
.
If the module is run without the output directory flag (-o
), no outputs are produced. The information is simply printed to the screen. If the output directory flag (-o
) is specified, the following directory will be created in the output directory:
-
Directory
/Filtered-Fasta-Files
- This directory will contain all the fasta files that passed the minimum sequence threshold.
Here is an example of the information printed to the screen:
python Fasta_Filter_by_Min_Seqs.py -i bin/07-Filter-seqs/Output/Results/Filtered-Fasta-Files --min_seqs 100
Found 67 total fasta files.
Processing...
56 fasta files passed the minimum sequence filter (>=100 seqs).
11 fasta files failed the minimum sequence filter (>=100 seqs).
--------------------------------------------------------------------------------------
Finished. Elapsed time : 0:00:02.740741 (H:M:S)
--------------------------------------------------------------------------------------
If the number of loci retained is too low, it is trivial to re-run the module with different minimum sequence thresholds to find an appropriate value. Here is the same dataset with a lower threshold:
python Fasta_Filter_by_Min_Seqs.py -i bin/07-Filter-seqs/Output/Results/Filtered-Fasta-Files --min_seqs 30
Found 67 total fasta files.
Processing...
61 fasta files passed the minimum sequence filter (>=30 seqs).
6 fasta files failed the minimum sequence filter (>=30 seqs).
--------------------------------------------------------------------------------------
Finished. Elapsed time : 0:00:00.030574 (H:M:S)
--------------------------------------------------------------------------------------
After the ideal value is found, the -o
flag can be used to copy the fasta files passing the filter to the relevant output directory.
Organizing GenBank accession numbers for large projects can be exceptionally challenging. SuperCRUNCH includes a module called Make_Acc_Table.py
that automates this process, resulting in a formatted table with rows representing species, subspecies, or vouchered specimens, and the columns representing loci. This table will only be created from a dataset which contains one sequence per locus for each taxon. If multiple sequences exist for a given taxon (species or vouchered specimen) for a given locus (as with some population-level datasets), this will result in an error.
The Make_Acc_Table.py
module is a tool for extracting taxon and accession number information from a directory of fasta files, for the purpose of creating a table of GenBank accession numbers.
A comprehensive list of taxa is collected across all fasta files in the directory. The taxon names recovered are two-part by default (e.g., Genus species). To include three-part names (e.g., Genus species subspecies), the optional -s
flag can be used. This requires the full path to a taxon names file. This taxon names file can contain a mix of two-part and three-part names, and can be the same file used for earlier steps, but only the three-part names are used to aid the searches in Make_Acc_Table.py
. If this is a vouchered data set (produced from the Filter_Seqs_and_Species.py
module using the --voucherize
option), the --voucherize
flag should also be used here to construct the taxon labels with the voucher information present. This will produce an accession table where each entry is a taxon + voucher label, rather than just a taxon label. Examples of both types are shown below. If a vouchered data set is run without the --voucherize
flag, an error will be thrown indicating duplicate taxon labels are present within a single fasta file, or that multiple sequences are available for the taxon.
The final table is written with species, subspecies, or vouchered specimen labels in the first column (sorted alphabetically) and the remaining columns represent the loci (also sorted alphabetically by fasta file name). The rows will either be filled with accession numbers if the sequence is present, or with dashes if there is missing data.
This tool can be used on unaligned or aligned fasta files, as long as they contain full description lines for sequences.
Make_Acc_Table.py
is designed to work with a directory of locus-specific fasta files. To be recognized, the input fasta files should be labeled as NAME.fasta
or NAME.fa
. The NAME
portion should not contain any periods or spaces, but can contain underscores.
python -i <input directory> -o <output directory>
Required: The full path to a directory which contains the input fasta files.
Required: The full path to an existing directory to write output files.
Optional: If subspecies names are to be included, this should be the full path to a text file containing all subspecies names to cross-reference in the fasta file. This can be the same taxon list used in earlier steps.
Optional: If a 'Voucher_[ID]' entry is included in the sequence description, append the ID component to the end of the species/subspecies label (e.g., Hyperolius_nitidulus_MVZ236473). The Voucher_[ID] field is generated by the
Parse_Loci.py
module.
python Make_Acc_Table.py -i bin/Acc-table/Input -o bin/Acc-table/Output
Above command will create an accession table from all fasta files present in
bin/Acc-table/Input
and write the outputs to the designated output directory. By default, subspecies labels will be ignored.
python Make_Acc_Table.py -i bin/Acc-table/Input -o bin/Acc-table/Output -s bin/Acc-table/Taxon-list.txt
Above command will create an accession table from all fasta files present in
bin/Acc-table/Input
and write the outputs to the designated output directory. The taxon labels will include any subspecies labels present in the fileTaxon-list.txt
.
python Make_Acc_Table.py -i bin/Acc-table/Input -o bin/Acc-table/Output --voucherize
Above command will create an accession table from all fasta files present in
bin/Acc-table/Input
and write the outputs to the designated output directory. The taxon labels will include the voucher codes, so that the same species can be represented by many uniquely labeled samples.
Several outputs are created in the output directory specified:
-
File
Taxon_Summary.txt
- A two-column file which contains the taxon label and the number of sequences present in the supermatrix. In other words, the taxon and the number of loci for that taxon.
Example output for a species-level supermatrix dataset:
Taxon Loci
Acanthocercus_adramitanus 4
Acanthocercus_annectens 6
Acanthocercus_atricollis 6
Acanthocercus_cyanogaster 6
Acanthocercus_yemensis 8
Acanthosaura_armata 7
Acanthosaura_capra 2
Acanthosaura_crucigera 5
Acanthosaura_lepidogaster 44
Example output for a population-level vouchered dataset:
Taxon Loci
Trachylepis_aurata_ZFMK75837 4
Trachylepis_aurata_ZFMK84085 4
Trachylepis_punctulata_MBUR00322 4
Trachylepis_punctulata_MCZA38906 4
Trachylepis_punctulata_MCZA38924 4
Trachylepis_sulcata_AMB4288 4
Trachylepis_sulcata_AMB4291 1
Trachylepis_sulcata_AMB4596 4
Trachylepis_sulcata_AMB4620 4
Trachylepis_sulcata_AMB4693 3
Trachylepis_sulcata_AMB4766 4
Trachylepis_sulcata_AMB4767 4
Trachylepis_sulcata_AMB4782 4
-
File
GenBank_Accession_Table.txt
- The actual table of taxa and accession numbers.
Example output for a species-level supermatrix dataset:
Taxon 12S 16S ADNP AHR AKAP9 BACH1
Acanthocercus adramitanus - KU097507.1 - - - -
Acanthocercus annectens - JX668129.1 - - - -
Acanthocercus atricollis - JX668132.1 - - - -
Acanthocercus cyanogaster - JX668135.1 - - - -
Acanthocercus yemensis - JX668140.1 - - - -
Acanthosaura armata AB266452.1 AB266452.1 - - - -
Acanthosaura capra - - - - - -
Acanthosaura crucigera AB031963.1 MG935713.1 - - - -
Acanthosaura lepidogaster KR092427.1 KR092427.1 JF818219.1 JF818274.1 JF805815.1 JF805886.1
Example output for a population-level vouchered dataset:
Taxon EXPH5_extracted_allseqs KIF24_extracted_allseqs ND2_extracted_allseqs RAG1_extracted_allseqs
Trachylepis aurata ZFMK75837 GU931406.1 GU931476.1 GU931545.1 GU931613.1
Trachylepis aurata ZFMK84085 GU931407.1 GU931477.1 GU931546.1 GU931614.1
Trachylepis punctulata MBUR00322 GU931414.1 GU931543.1 GU931612.1 GU931679.1
Trachylepis punctulata MCZA38906 GU931413.1 GU931542.1 GU931611.1 GU931678.1
Trachylepis punctulata MCZA38924 GU931415.1 GU931541.1 GU931610.1 GU931677.1
Trachylepis sulcata AMB4288 HQ829667.1 HQ829699.1 HQ829735.1 HQ829771.1
Trachylepis sulcata AMB4291 - - HQ829736.1 -
Trachylepis sulcata AMB4596 GU931439.1 GU931483.1 GU931552.1 GU931620.1
Trachylepis sulcata AMB4620 GU931440.1 GU931484.1 GU931553.1 GU931621.1
Trachylepis sulcata AMB4693 - HQ829700.1 HQ829737.1 HQ829772.1
Trachylepis sulcata AMB4766 GU931437.1 GU931485.1 GU931554.1 GU931622.1
Trachylepis sulcata AMB4767 GU931438.1 GU931486.1 GU931555.1 GU931623.1
Trachylepis sulcata AMB4782 GU931442.1 GU931487.1 GU931556.1 GU931624.1
When creating a supermatrix, if each taxon had only one sequence per locus then the possible number of supermatrix combinations is one. If there are multiple sequences available per locus, then the number of possible combinations can be incredibly high. The Infer_Supermatrix_Combinations.py
module can be used to calculate the exact number of sequence combinations for a given supermatrix, given a set of sequences available for filtering. In other words, it will tell you how many possible supermatrices can be constructed based on the starting sequences.
A tool for calculating how many supermatrix combinations are available, given the number of filtered sequences available for each taxon for each locus. If all taxa have only one sequence available, the answer is one, but if taxa have multiple sequences available, this number will be extremely large. Relies on the NAME_species_log.txt
files produced from the Filter_Seqs_and_Species.py
module, which are located in the Results/Taxon-Log-Files/
output directory, to calculate the number of sequences available per taxon. The log files for all loci should be present in the input directory for the calculation to be accurate. No output files are created, rather the information is logged to the screen. Note - Infer_Supermatrix_Combinations.py
is unable to process population-level vouchered datasets, it only works for species/subspecies level supermatrices.
python -i <input directory>
Required: The full path to a directory which contains all the
NAME_species_log.txt
files. This should be theResults/Taxon-Log-Files/
output directory from theFilter_Seqs_and_Species.py
module.
python Infer_Supermatrix_Combinations.py -i bin/07-Filter-seqs/Output/Results/Taxon-Log-Files
Above command will use the
NAME_species_log.txt
files in the input directory to calculate the number of supermatrix combinations.
No outputs are created - the results are written to the screen.
Here is an example from a very large supermatrix:
Found 67 loci to examine.
Parsing information in 12S_extracted_contfiltered_species_log.txt
Parsing information in 16S_extracted_contfiltered_species_log.txt
Parsing information in ADNP_extracted_species_log.txt
Parsing information in AHR_extracted_species_log.txt
Parsing information in AKAP9_extracted_species_log.txt
Parsing information in AMEL_extracted_species_log.txt
Parsing information in BACH1_extracted_species_log.txt
Parsing information in BACH2_extracted_species_log.txt
Parsing information in BDNF_extracted_species_log.txt
Parsing information in BHLHB2_extracted_species_log.txt
Parsing information in BMP2_extracted_species_log.txt
Parsing information in CAND1_extracted_species_log.txt
Parsing information in CARD4_extracted_species_log.txt
Parsing information in CILP_extracted_species_log.txt
Parsing information in CMOS_extracted_species_log.txt
Parsing information in CO1_extracted_species_log.txt
Parsing information in CXCR4_extracted_species_log.txt
Parsing information in CYTB_extracted_species_log.txt
Parsing information in DLL1_extracted_species_log.txt
Parsing information in DNAH3_extracted_species_log.txt
Parsing information in ECEL1_extracted_species_log.txt
Parsing information in ENC1_extracted_species_log.txt
Parsing information in EXPH5_extracted_species_log.txt
Parsing information in FSHR_extracted_species_log.txt
Parsing information in FSTL5_extracted_species_log.txt
Parsing information in GALR1_extracted_species_log.txt
Parsing information in GHSR_extracted_species_log.txt
Parsing information in GPR37_extracted_species_log.txt
Parsing information in HLCS_extracted_species_log.txt
Parsing information in INHIBA_extracted_species_log.txt
Parsing information in KIAA2018_extracted_species_log.txt
Parsing information in KIF24_extracted_species_log.txt
Parsing information in LRRN1_extracted_species_log.txt
Parsing information in LZTS1_extracted_species_log.txt
Parsing information in MC1R_extracted_species_log.txt
Parsing information in MKL1_extracted_species_log.txt
Parsing information in MLL3_extracted_species_log.txt
Parsing information in MSH6_extracted_species_log.txt
Parsing information in MXRA5_extracted_species_log.txt
Parsing information in ND1_extracted_species_log.txt
Parsing information in ND2_extracted_species_log.txt
Parsing information in ND4_extracted_species_log.txt
Parsing information in NGFB_extracted_species_log.txt
Parsing information in NKTR_extracted_species_log.txt
Parsing information in NOS1_extracted_species_log.txt
Parsing information in NT3_extracted_species_log.txt
Parsing information in PDC_extracted_species_log.txt
Parsing information in PNN_extracted_species_log.txt
Parsing information in PRLR_extracted_species_log.txt
Parsing information in PTGER4_extracted_species_log.txt
Parsing information in PTPN_extracted_species_log.txt
Parsing information in R35_extracted_species_log.txt
Parsing information in RAG1p1_extracted_species_log.txt
Parsing information in RAG1p2_extracted_species_log.txt
Parsing information in RAG2_extracted_species_log.txt
Parsing information in REV3L_extracted_species_log.txt
Parsing information in RHO_extracted_species_log.txt
Parsing information in SLC30A1_extracted_species_log.txt
Parsing information in SLC8A1_extracted_species_log.txt
Parsing information in SLC8A3_extracted_species_log.txt
Parsing information in SNCAIP_extracted_species_log.txt
Parsing information in SOCS5_extracted_species_log.txt
Parsing information in TRAF6_extracted_species_log.txt
Parsing information in UBN1_extracted_species_log.txt
Parsing information in VCPIP_extracted_species_log.txt
Parsing information in ZEB2_extracted_species_log.txt
Parsing information in ZFP36L1_extracted_species_log.txt
Found 59,219 total sequences for 1,426 taxa.
Number of possible supermatrix combinations (unwieldy integer) =
9,155,211,937,028,784,139,400,729,807,279,775,488,020,851,133,423,680,100,069,848,449,615,
916,157,495,787,793,718,608,810,311,404,937,233,000,214,247,836,252,455,768,476,628,861,
064,620,200,711,350,158,661,811,743,694,197,397,688,387,145,249,835,198,884,996,018,769,
083,230,608,987,218,568,872,312,142,336,848,250,842,612,365,321,530,772,903,030,234,569,
416,127,858,213,723,379,748,832,537,328,917,920,565,791,037,157,172,944,443,004,391,099,
462,043,175,267,678,305,264,769,042,131,976,141,366,968,096,832,712,391,298,765,975,927,
494,157,940,235,330,313,905,291,769,437,463,196,103,934,323,818,989,284,191,519,710,411,
868,781,740,939,402,237,791,189,661,891,854,367,398,490,891,577,453,810,697,124,756,345,
828,950,561,902,614,727,247,462,339,434,248,981,829,916,772,790,684,231,918,401,408,679,
804,593,518,264,530,597,506,699,156,559,180,838,083,697,452,342,693,019,916,934,095,427,
926,284,239,186,778,760,537,631,648,689,778,685,028,875,780,934,811,195,903,383,297,325,
404,674,321,132,173,191,772,706,976,318,676,694,222,478,165,122,633,462,198,760,623,367,
676,347,049,039,942,766,779,185,599,670,242,800,590,746,443,028,479,628,344,624,129,788,
986,052,267,044,293,545,467,128,252,676,303,437,334,660,672,457,382,059,030,736,971,425,
711,493,163,346,019,711,361,032,544,725,739,717,915,551,166,357,417,654,655,771,689,510,
332,660,513,586,187,033,518,591,298,795,741,174,413,719,973,333,787,649,349,315,999,314,
507,241,608,217,798,993,354,659,849,179,712,089,626,837,565,629,250,942,040,077,001,508,
191,915,175,670,077,378,115,091,776,926,928,482,860,747,377,191,043,304,997,916,940,735,
024,551,373,227,109,328,366,639,344,090,890,543,448,467,140,380,695,265,233,745,430,615,
945,648,409,727,981,469,031,706,286,879,836,149,901,114,757,957,593,600,527,674,727,143,
180,796,022,929,116,545,586,123,339,554,137,359,960,085,564,193,538,396,534,200,757,633,
184,200,669,679,266,085,615,357,765,708,228,346,907,859,684,495,445,211,346,974,768,841,
479,305,103,785,350,888,613,151,991,051,292,954,691,705,073,714,299,347,256,083,000,951,
211,085,722,425,097,834,216,271,773,699,005,376,686,144,260,411,771,234,840,626,714,359,
270,153,124,657,096,088,661,332,547,307,732,121,596,668,803,070,074,096,449,199,006,045,
247,706,986,924,718,606,717,585,336,204,217,255,444,742,327,718,898,432,886,194,233,760,
704,248,970,931,681,940,763,022,191,757,491,596,631,731,120,222,921,779,768,634,166,891,
835,935,791,116,498,217,263,382,315,692,521,267,910,732,306,133,095,900,014,405,539,127,
600,704,170,287,276,712,651,906,937,442,866,872,818,288,101,368,989,453,287,779,969,318,
819,898,191,343,359,904,988,879,067,810,763,926,073,380,355,658,042,061,087,634,040,016,
432,163,443,052,335,929,991,158,397,475,416,986,689,641,305,337,653,559,321,620,263,933,
413,094,202,049,855,727,193,018,278,271,867,141,665,326,736,449,502,598,153,467,584,350,
105,687,056,319,894,338,714,617,001,215,119,927,747,047,282,846,350,464,508,644,061,072,
057,356,052,375,610,897,167,524,000,976,693,378,953,815,577,177,018,947,825,769,466,688,
174,297,878,405,414,929,748,731,885,698,125,477,079,134,891,381,149,885,821,579,355,476,
251,275,796,728,042,602,310,223,150,781,666,995,693,268,999,220,016,697,340,239,326,593,
524,625,965,198,899,500,437,113,743,375,571,212,933,192,821,896,601,719,186,689,705,541,
886,040,385,798,403,663,116,845,223,343,872,067,947,547,542,713,178,802,957,834,364,493,
489,291,999,989,116,740,589,281,208,790,782,373,145,513,713,699,734,975,346,637,798,772,
073,453,971,876,231,105,322,349,886,484,465,571,212,532,870,605,116,947,646,621,165,966,
290,219,238,774,388,367,531,756,483,014,845,991,343,688,549,851,631,099,727,693,531,221,
675,011,436,766,560,256,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000.
Number of possible supermatrix combinations (scientific notation) = 9.16E+3219, or 9.16*10^3219.
That's an insane number.
Last updated: September, 2019
For SuperCRUNCH v1.2