-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathreadme.txt
81 lines (65 loc) · 3.79 KB
/
readme.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
This is pretty much a work in progress. The pipeline currently consists of the
following scripts (reported with default settings):
*** Genereate the underlying taxonomy
$ ./generate_taxonomy.py --output taxonomy.txt --output_red taxonomy_reduced.txt /
--pickle taxonomy.pkl | tee generate_taxonomy.log
[ ~3 hours ]. This script downloads, processes, and saves the NCBI taxonomy in
a standard format with a fixed number of taxonomic levels.
*** Download the viral genomes (fna,ffn,faa) with corresponding taxonomy
$ ./repophlan_get_viruses.py --taxonomy taxonomy_reduced.txt --out_dir /
repophlan_viruses --out_summary repophlan_viruses.txt | tee repophlan_viruses.log
[ ~1 hour ]. Downloads and save all sequences available for viruses (from
RefSeq). The files are saved in the 'repophlan_viruses' folder and the taxonomy
of each downloaded set of files is in 'repophlan_viruses.txt'
*** Download the microbial (bacteria+archaea) genomes (fna,ffn,faa,frn) with
corresponding taxonomy
$ ./repophlan_get_microbes.py --taxonomy taxonomy_reduced.txt --out_dir /
microbes --nproc 20 --out_summary repophlan_microbes.txt /
| tee repophlan_microbes.log
[ ~6 hours ]. Downloads and save all available sequences for microbes. It uses
multiple parallel connections to speed up the retrieval: common exceptions
including issues with the NCBI ftp which temporarily rejects connections for
ftp load problems are handled gracefully with a 'delay and retry' policy.
IMPORTANT: specifying more than 20 processors to run in parallel causes serious
problems in terms of exceeding the allowed number of connections by NCBI causing
long delays.
*** Download the single-celled Eukaryotes genomes (i.e. fungi and protozoa, with
fna,ffn,faa) with corresponding taxonomy
$ ./repophlan_get_euks.py --taxonomy taxonomy_reduced.txt --out_dir_fungi fungi /
--out_dir_protozoa protozoa --out_summary_fungi repophlan_fungi.txt /
--out_summary_protozoa repophlan_protozoa.txt | tee repophlan_euks.log
As of Nov 8 2013, RepoPhlAn retrieves:
* 4958 viruses (each with fna, ffn, faa files)
* 12277 microbes (each with fna, ffn, frn, faa files)
For microbes, four files are generated for each genome:
* .fna : the genome in multifasta format (one or more contigs)
* .ffn : all the protein coding genes (no rRNA, tRNA)
* .faa : all the proteins
* .frn : all the non-coding genes (rRNA, tRNA)
Several aspects can be optimized and with a bit of refining about 20% more
microbes could be retrieved.
Known issues:
* For almost all the assembly additional informative file can be automatically
downloaded:
- .asn (probably not useful)
- .gbk (genbank file for the assembly and for each scaffold)
- .gff (annotations in almost free text format)
- .ptt (tab-separated file including COG assignments and product names)
- .rnt (annotation of rRNA and tRNAs)
- .rpt (few metadata)
- .val (binary file, not sure what's inside)
* about ~30 strains are missing in the generated taxonomy file. These are
the representatives of sets of strains with multiple assemblies. It should
be an easy fix in generate_taxonomy.py
* for some microbes and euks only a subset of the four types of files are present. These
files are downloaded but the assembly will not appear in the
repophlan_microbes.txt for consistency. However, for some downstream analyses
not all the four files are actually needed.
* several inconsistencies are present in the following file:
ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt
entries with inconsistencies are skipped (with a 'try catch' policy), but few
of these inconsistencies could actually be handled.
* it would be great to plug in into RePoPhlAn a gene-to-function mapping (e.g.
KEGG, CAZY, ...)
* need extensive testing of both the taxonomy and the retrieved files
http://www.standardsingenomics.com/content/9/1/20#B11