-
Notifications
You must be signed in to change notification settings - Fork 2
SegataLab/repophlan
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is pretty much a work in progress. The pipeline currently consists of the following scripts (reported with default settings): *** Genereate the underlying taxonomy $ ./generate_taxonomy.py --output taxonomy.txt --output_red taxonomy_reduced.txt / --pickle taxonomy.pkl | tee generate_taxonomy.log [ ~3 hours ]. This script downloads, processes, and saves the NCBI taxonomy in a standard format with a fixed number of taxonomic levels. *** Download the viral genomes (fna,ffn,faa) with corresponding taxonomy $ ./repophlan_get_viruses.py --taxonomy taxonomy_reduced.txt --out_dir / repophlan_viruses --out_summary repophlan_viruses.txt | tee repophlan_viruses.log [ ~1 hour ]. Downloads and save all sequences available for viruses (from RefSeq). The files are saved in the 'repophlan_viruses' folder and the taxonomy of each downloaded set of files is in 'repophlan_viruses.txt' *** Download the microbial (bacteria+archaea) genomes (fna,ffn,faa,frn) with corresponding taxonomy $ ./repophlan_get_microbes.py --taxonomy taxonomy_reduced.txt --out_dir / microbes --nproc 20 --out_summary repophlan_microbes.txt / | tee repophlan_microbes.log [ ~6 hours ]. Downloads and save all available sequences for microbes. It uses multiple parallel connections to speed up the retrieval: common exceptions including issues with the NCBI ftp which temporarily rejects connections for ftp load problems are handled gracefully with a 'delay and retry' policy. IMPORTANT: specifying more than 20 processors to run in parallel causes serious problems in terms of exceeding the allowed number of connections by NCBI causing long delays. *** Download the single-celled Eukaryotes genomes (i.e. fungi and protozoa, with fna,ffn,faa) with corresponding taxonomy $ ./repophlan_get_euks.py --taxonomy taxonomy_reduced.txt --out_dir_fungi fungi / --out_dir_protozoa protozoa --out_summary_fungi repophlan_fungi.txt / --out_summary_protozoa repophlan_protozoa.txt | tee repophlan_euks.log As of Nov 8 2013, RepoPhlAn retrieves: * 4958 viruses (each with fna, ffn, faa files) * 12277 microbes (each with fna, ffn, frn, faa files) For microbes, four files are generated for each genome: * .fna : the genome in multifasta format (one or more contigs) * .ffn : all the protein coding genes (no rRNA, tRNA) * .faa : all the proteins * .frn : all the non-coding genes (rRNA, tRNA) Several aspects can be optimized and with a bit of refining about 20% more microbes could be retrieved. Known issues: * For almost all the assembly additional informative file can be automatically downloaded: - .asn (probably not useful) - .gbk (genbank file for the assembly and for each scaffold) - .gff (annotations in almost free text format) - .ptt (tab-separated file including COG assignments and product names) - .rnt (annotation of rRNA and tRNAs) - .rpt (few metadata) - .val (binary file, not sure what's inside) * about ~30 strains are missing in the generated taxonomy file. These are the representatives of sets of strains with multiple assemblies. It should be an easy fix in generate_taxonomy.py * for some microbes and euks only a subset of the four types of files are present. These files are downloaded but the assembly will not appear in the repophlan_microbes.txt for consistency. However, for some downstream analyses not all the four files are actually needed. * several inconsistencies are present in the following file: ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt entries with inconsistencies are skipped (with a 'try catch' policy), but few of these inconsistencies could actually be handled. * it would be great to plug in into RePoPhlAn a gene-to-function mapping (e.g. KEGG, CAZY, ...) * need extensive testing of both the taxonomy and the retrieved files http://www.standardsingenomics.com/content/9/1/20#B11
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published