-
Notifications
You must be signed in to change notification settings - Fork 26
SCG sets
This will be formatted and all code added soon, but here is the gist of it for now.
-
I took all 17,929 Pfams from the current release and filtered out those that didn’t cover more than 50% (on average) of the underlying protein sequences that went into building that Pfams HMM (this ensure there will be no two Pfams from the same source protein included). This left 8,924 Pfams.
-
I then downloaded all “complete” bacterial genomes currently available from NCBI (11,405 as accessed 09-DEC-2018) and ran the HMM search counting how many hits were reported for each pfam, for each genome.
-
I then only kept Pfams with only 1 hit in greater than or equal to 90% of all genomes to make the Bacterial SCG-set, and did the same for all Phyla with greater than 99 genomes, and all Proteobacterial classes with greater than 99 genomes.
So, for instance, if I were making a tree of all the Staphylococcus genomes I could get my hands on, I would use the Firmicutes SCG-set.
The same was done for Archaea (320 genomes as accessed 15-DEC-2018).
To make the "Universal" set in order to have an HMM set for treeing across all 3 domains, pfams were retained for each domain if they were detected in exactly 1 copy in greater than 50% of the total genomes for that domain. Then the intersection of those pfams from all 3 domains were combined. This left 19 HMMs, and is useful for treeing, but not useful for estimating completeness/redundancy.
Here are the currently available HMM-sets:
HMM set | Number of gene targets |
---|---|
Actinobacteria | 138 |
Alphaproteobacteria | 117 |
Archaea | 76 |
Bacteria | 74 |
Bacteroidetes | 90 |
Betaproteobacteria | 203 |
Chlamydiae | 286 |
Cyanobacteria | 251 |
Epsilonproteobacteria | 260 |
Firmicutes | 119 |
Gammaproteobacteria | 172 |
Proteobacteria | 119 |
Tenericutes | 99 |
Universal | 19 |
Here are presented all of the steps taken to generate the bacterial gene-set.
###### This log holds the command-line steps take to generate the Pfam list that was used to search all genomes. ######
## downloading Pfam release 32.0
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
## Filtering Pfam-HMMs by coverage of underlying proteins that built the model
# some unique Pfam-HMMs hold distinct domains from the same proteins
# this is not ideal when the purpose is to identify single-copy genes
# for completeness/redundancy estimates or for phylogenomics
# To help alleviate that (or most likely eliminate it completely), i only
# considered Pfam-HMMs that covered on average more than 50% of the
# underlying proteins that built the model.
# Pfam provides this information in the table downloaded here:
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files/pfamA.txt.gz
gunzip pfamA.txt.gz
# cutting down to just pfam ID and coverage
cut -f 1,36 pfamA.txt > All_pfam_avg_covs.tsv
# filtering to keep only those with greater than 50% coverage
awk ' $2 > 50 { print $1 } ' All_pfam_avg_covs.tsv > coverage-filtered-pfam-IDs-no-version-info
## making subset hmm file
# hmmfetch won't work without the version numbers attached, which are not in the Pfam IDs from the "pfamA.txt" summary table
# so creating a list of all with the current version numbers attached:
grep "^ACC" Pfam-A.hmm | tr -s " " "\t" | cut -f2 > all-pfam-IDs
# pulling out targets:
for i in $(cat coverage-filtered-pfam-IDs-no-version-info); do grep -m1 "$i" all-pfam-IDs; done > coverage-filtered-pfam-IDs
# now generating subset:
hmmfetch -f Pfam-A.hmm coverage-filtered-pfam-IDs > Pfam-A_coverage_filtered.hmm
###### That subset, "Pfam-A_coverage_filtered.hmm", is what was used to search the bacterial and archaeal genomes. ######
## downloading bacterial genome info with E-Direct (this returned 11,405 with this search on 09-DEC-2018) ##
esearch -query '"bacteria"[filter] AND "Complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt > Complete_genome_bacteria_ncbi_info.tsv
# adding header
cat <(echo "AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt" | tr " " "\t") Complete_genome_bacteria_ncbi_info.tsv > t && mv t Complete_genome_bacteria_ncbi_info.tsv
# getting all genome accs
tail -n +2 Complete_genome_bacteria_ncbi_info.tsv | cut -f1 > all_genome_accs
### building up links to download genes in parallel, splitting which have genbank accs and which have refseq accs because that determines which ftp path to use ###
grep "^GCA_" Complete_genome_bacteria_ncbi_info.tsv > genbanks
grep "^GCF_" Complete_genome_bacteria_ncbi_info.tsv > refseqs
cut -f1,7 genbanks > genbank_accs_and_base_links # making 2-column table of accessions and base links
cut -f 2 genbank_accs_and_base_links > genbank_targets_p1 # getting base link alone
cut -f10 -d "/" genbank_targets_p1 > genbank_targets_p2 # to build the filename to the genes explicitly, need this component
paste -d "/" genbank_targets_p1 genbank_targets_p2 | sed 's/$/_protein.faa.gz/' > genbank_targets # finishing the full path to the genes download
cut -f1 genbank_accs_and_base_links | sed 's/$/_protein.faa.gz/' > genbank_names # creating file names to store the downloads
# doing the same as above but for the refseq accessions
cut -f1,8 refseqs > refseq_accs_and_base_links
cut -f 2 refseq_accs_and_base_links > refseq_targets_p1
cut -f10 -d "/" refseq_targets_p1 > refseq_targets_p2
paste -d "/" refseq_targets_p1 refseq_targets_p2 | sed 's/$/_protein.faa.gz/' > refseq_targets
cut -f1 refseq_accs_and_base_links | sed 's/$/_protein.faa.gz/' > refseq_names
# sticking both sets together
cat genbank_names refseq_names > all_names
cat genbank_targets refseq_targets > all_targets
### downloading in parallel ###
parallel --xapply -j 30 curl -L {1} -o {2} :::: all_targets :::: all_names
### unzipping, renaming headers of each to have the assembly accession for easy parsing after hmm search, and catting all genes together into one file ###
bash unzip_rename_cat_genes.sh all_genome_accs
# total genes
grep -c ">" all_genes.faa
42829397
### splitting to be able to search in parallel ###
mkdir split_seqs
cd split_seqs
split -d -l 856588 ../all_genes.faa sub_
cd ../
### running hmmsearch with >= 50% coverage-filtered Pfams
time find split_seqs/ -name "sub*" | parallel -I% --max-args 1 -j 50 hmmsearch --cut_ga --cpu 2 --tblout %_pfam_hits.tab Pfam-A_coverage_filtered.hmm % > /dev/null
### combining all hmm-results files ###
cat split_seqs/*.tab > All_bacterial_pfam_hmm_results.tab
### counting how many hits to each pfam for each genome ###
# getting all coverage-filtered pfam IDs
grep "^ACC" Pfam-A_coverage_filtered.hmm | tr -s " " "\t" | cut -f2 > coverage_filtered_pfam_ids
python pfam_counting.py -p coverage_filtered_pfam_ids -g all_genome_accs -H All_bacterial_pfam_hmm_results.tab -o All_bacterial_pfam_hmm_counts.tsv
## this output table holds all genomes as rows, all include pfams as columns, and counts in the cells
###### "All_bacterial_pfam_hmm_counts.tsv" read into R for parsing ######
### in R ###
tab <- read.table("All_bacterial_pfam_hmm_counts.tsv", header=T, row.names=1, sep="\t")
head(tab[,1:10])
dim(tab)
all_genomes <- row.names(tab)
# getting how many genomes have exactly 1 hit for each pfam
num_genomes_with_1_hit_for_each_pfam <- apply(tab, 2, function(x) sum(x == 1))
# getting percentages of genomes with exactly 1 hit for each pfam
perc_genomes_with_1_hit_for_each_pfam <- round(num_genomes_with_1_hit_for_each_pfam / length(all_genomes) * 100, 2)
# getting pfam ids for only those where 90% or greater of genomes have exactly 1 hit
target_pfam_ids <- names(perc_genomes_with_1_hit_for_each_pfam[perc_genomes_with_1_hit_for_each_pfam >= 90])
# writing out
write(target_pfam_ids, "Bacterial_SCG_pfam_IDs.txt")
### out of R ###
### then the hmm file was filtered down to these target IDs
hmmfetch -f Pfam-A_coverage_filtered.hmm Bacterial_SCG_pfam_IDs.txt > Bacteria.hmm
Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider
- Home
- What is GToTree?
- Installation
- Example usage
- User Guide
- SCG-sets
- Things to consider