-
Notifications
You must be signed in to change notification settings - Fork 11
3 Batch download sequence data!
Youtube video tutorial: Batch download sequence data with PrimerMiner
Batch downloading sequence data in only 3 simple steps!
- The taxa of interest have to be specified in a .csv table (comma separated)
- Download parameters are specified in a txt file which is generated by running
batch_config("config.txt")
. I. e. you can specify for wich marker to download for and if trimming should be applied. - Now, you can start the download by running
batch_download("taxa_table.csv", "config.txt")
. Get a coffee, this might take a while =)
You have to specify the taxonomic groups you would like to download sequences for. Its recommended to only download data for Order and Family level, using the latin names. Other taxonomic levels might work, but might download to much data (millions of sequences) or are not unique like Genus names. You can take a look at the taxa_small.csv table that comes with the package tutorial!
The first column Order specifies the group you want to download data for. Typically this is an Order, but you can also enter family names or subfamily if you are interested in these. All data and processed files will be downloaded in folders with the group names given in the Order column. If you want, you can specify a subset of taxa for every given Group in the second column Family. For example, if you are interested in sequences fro the Order Coleoptera (Beetles) but only care about aquatic beetles, you can specify only aquatic families in the second column. All sequences will be downloaded into the folder "Coleoptera" but only data for the groups specified in the second column will be downloaded and used.
Note: There is currently a bug, causing the script to crash if the second column is empty. Thus just add a taxon that does not exist to work around this bug.
All download parameters are configured using a .txt file which can be generated running batch_config("config.txt")
in R. Please generate a new configuration file, if you are using a newer version of PrimerMiner (backward compatibility is not guaranteed).
The configuration file contains short comments, explaining each function. The default settings should be fine for typical use scenarios, and sequences are downloaded from NCBI and BOLD for the cytochrome oxidase I (COI) gene which is the standard for DNA barcoding of animals.
The following parameters could be particularly interesting:
-
T = TRUE
andF = False
, change these parameters to turn functions off and on. -
Marker = c("COi", "CO1", "COXi", "COX1")
here you specify for which marker you want to download data. Make sure to consider different writing styles and keep brackets likec("maker name 1", "marker name 2")
while separating each style with a comma. To download data for the ribosomal 16S marker use for exampleMarker = c("16S", "large ribosomal subunit RNA", "s-rRNA")
. Make sure to turn of downloading of data from BOLD should you not download COI sequence data usingdownload_bold = F
. Also should your marker not be present on mitochondrial genomes you can turn off downloading mitogenomes as well usingdownload_mt = F
. -
Download = T
andMerge_and_Cluster_data = T
can be used to first download data without clustering (e.g. to manually check or modify sequences). Then you can turn downloading off and cluster the changed sequences. This way you can also add your own sequences, just add them into a file in the respective raw sequence file like "Coloptera_mito.fasta". Please note that all files will be over written if you download data a second time into the same location! -
Skip_if_complete = T
for each group, once all data has been downloaded and clustered, the data will not be re-downloaded and processed again for the respective groups. This is useful if you download a lot of data and the script gets interrupted by a time out on the BOLD or NCBI servers (which can happen sometimes). In such a case you can just startbatch_download()
again and the script will continue downloading where it was stopped. -
clipping_left_GB = 0
,clipping_right_GB = 0
,clipping_right_bold = 0
andclipping_left_bold = 0
when downloading partial sequences from NCBI and BOLD you can apply clipping on both sides, as some people are not trimming away the primer regions in their Sanger sequences! For example, most COI sequences are generated for the Folmer region. If you are developing primers within the Folmer-region, you can set the clipping to= 0
. However, is your goal to optimise Folmer based primers, make sure to add at least 26 base pairs clipping, to remove untrimmed primers from your dataset. For mitochondrial genomes this is not necessary, as a 100 bp buffer is added to the flanking regions of your marker of interest (add_mt = 100
). NEW: With PrimerMiner 0.8 trimming can be applied at a later stage withselectivetrim()
(which does not trimm the region amplified by the primers), and thus trimming is set to 0 bp on default when batch downloading sequences. - When using the Vsearch which is provided with the PrimerMiner package
vsearchpath = "integrated"
you want to make sure that the right operating system was detected (operating_system= "MacOSX" or "Linux"
). If you want to use the local installation of your Vsearch here (not recommended and usually not necessary) provide the path to the executable here or the name you would use to run Vsearch in your Terminal. -
id = 0.97
This gives you the clustering threshold. On default settings everything with 3% sequence similarity is clustered together. In most cases one OTU represents one species (One operational Taxonomic Unit). You can change this to0.98
or even0.99
if you want to account for more diversity within a species. However, I highly recommend leaving this on the default of0.97
! -
threshold = "Majority"
The consensus sequence of each OTU alignment is generated, with using the most abundant base at each position. If you want to account form more variability within the OTU, you could set tis to for example= 0.1
to consider all bases with > 10% abundance at each position. However, I recommend to keep this on "Majority" especially if you want to use the resulting OTUs for primer evaluation! Athreshold = 0.1
is a better way of accounting for additional sequence variability than setting e.g.id = 0.98
.
To start the batch download, simply run the command batch_download("taxa_table.csv", "config.txt")
, giving the path to your taxa table and configuration file. PrimerMiner will create folders for the respective taxa in your current working directory (set this in R with setwd("path/to/folder")
and getwd()
to display your current wd). Primer miner will inform you about what it is currently doing and report when it's finished with downloading and clustering of all requested groups.
The most interesting files in the folder for each group is the Groupname_all_cons_cluster_Majority.fasta
file which then can be used to generate alignments / plots in e.g. Geneious or other software specifically developed for primer design. See the next chapter on how to use the output files for primer generation.
Note: Please don't run PrimerMiner in the Download folder on Mac OSX as it will crash for an unknown reason. Any other place on your computer, PrimerMiner should run fine (for example your Desktop or document Folder)