SCG sets

This will be formatted and all code added soon, but here is the gist of it for now.

I took all 17,929 Pfams from the current release and filtered out those that didn’t cover more than 50% (on average) of the underlying protein sequences that went into building that Pfams HMM (this ensure there will be no two Pfams from the same source protein included). This left 8,924 Pfams.
I then downloaded all “complete” bacterial genomes currently available from NCBI (11,405 as accessed 09-DEC-2018) and ran the HMM search counting how many hits were reported for each pfam, for each genome.
I then only kept Pfams with only 1 hit in greater than or equal to 90% of all genomes to make the Bacterial SCG-set, and did the same for all Phyla with greater than 99 genomes, and all Proteobacterial classes with greater than 99 genomes.

So, for instance, if I were making a tree of all the Staphylococcus genomes I could get my hands on, I would use the Firmicutes SCG-set.

The same was done for Archaea (320 genomes as accessed 15-DEC-2018).

To make the "Universal" set in order to have an HMM set for treeing across all 3 domains, pfams were retained for each domain if they were detected in exactly 1 copy in greater than 50% of the total genomes for that domain. Then the intersection of those pfams from all 3 domains were combined. This left 19 HMMs, and is useful for treeing, but not useful for estimating completeness/redundancy.

Here are the currently available HMM-sets:

HMM set	Number of gene targets
Actinobacteria	138
Alphaproteobacteria	117
Archaea	76
Bacteria	74
Bacteroidetes	90
Betaproteobacteria	203
Chlamydiae	286
Cyanobacteria	251
Epsilonproteobacteria	260
Firmicutes	119
Gammaproteobacteria	172
Proteobacteria	119
Tenericutes	99
Universal	19

Code

Here are presented all of the steps taken to generate the bacterial gene-set.

Downloading and filtering Pfams

###### This log holds the command-line steps take to generate the Pfam list that was used to search all genomes. ######
## downloading Pfam release 32.0
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz

## Filtering Pfam-HMMs by coverage of underlying proteins that built the model
# some unique Pfam-HMMs hold distinct domains from the same proteins
# this is not ideal when the purpose is to identify single-copy genes
# for completeness/redundancy estimates or for phylogenomics
# To help alleviate that (or most likely eliminate it completely), i only
# considered Pfam-HMMs that covered on average more than 50% of the
# underlying proteins that built the model.
# Pfam provides this information in the table downloaded here:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files/pfamA.txt.gz
gunzip pfamA.txt.gz

# cutting down to just pfam ID and coverage
cut -f 1,36 pfamA.txt > All_pfam_avg_covs.tsv

# filtering to keep only those with greater than 50% coverage
awk ' $2 > 50 { print $1 } ' All_pfam_avg_covs.tsv > coverage-filtered-pfam-IDs-no-version-info

## making subset hmm file
# hmmfetch won't work without the version numbers attached, which are not in the Pfam IDs from the "pfamA.txt" summary table

# so creating a list of all with the current version numbers attached:
grep "^ACC" Pfam-A.hmm | tr -s " " "\t" | cut -f2 > all-pfam-IDs

# pulling out targets:
for i in $(cat coverage-filtered-pfam-IDs-no-version-info); do grep -m1 "$i" all-pfam-IDs; done > coverage-filtered-pfam-IDs

# now generating subset:
hmmfetch -f Pfam-A.hmm coverage-filtered-pfam-IDs > Pfam-A_coverage_filtered.hmm

###### That subset, "Pfam-A_coverage_filtered.hmm", is what was used to search the bacterial and archaeal genomes. ######

Scanning all complete bacterial genomes from NCBI

## downloading bacterial genome info with E-Direct (this returned 11,405 with this search on 09-DEC-2018) ##
esearch -query '"bacteria"[filter] AND "Complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt > Complete_genome_bacteria_ncbi_info.tsv

 # adding header
cat <(echo "AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt" | tr " " "\t") Complete_genome_bacteria_ncbi_info.tsv > t && mv t Complete_genome_bacteria_ncbi_info.tsv

 # getting all genome accs
tail -n +2 Complete_genome_bacteria_ncbi_info.tsv | cut -f1 > all_genome_accs

### building up links to download genes in parallel, splitting which have genbank accs and which have refseq accs because that determines which ftp path to use ###
grep "^GCA_" Complete_genome_bacteria_ncbi_info.tsv > genbanks
grep "^GCF_" Complete_genome_bacteria_ncbi_info.tsv > refseqs

cut -f1,7 genbanks > genbank_accs_and_base_links # making 2-column table of accessions and base links
cut -f 2 genbank_accs_and_base_links > genbank_targets_p1 # getting base link alone
cut -f10 -d "/" genbank_targets_p1 > genbank_targets_p2 # to build the filename to the genes explicitly, need this component
paste -d "/" genbank_targets_p1 genbank_targets_p2 | sed 's/$/_protein.faa.gz/' > genbank_targets # finishing the full path to the genes download
cut -f1 genbank_accs_and_base_links | sed 's/$/_protein.faa.gz/' > genbank_names # creating file names to store the downloads


# doing the same as above but for the refseq accessions
cut -f1,8 refseqs > refseq_accs_and_base_links
cut -f 2 refseq_accs_and_base_links > refseq_targets_p1
cut -f10 -d "/" refseq_targets_p1 > refseq_targets_p2
paste -d "/" refseq_targets_p1 refseq_targets_p2 | sed 's/$/_protein.faa.gz/' > refseq_targets
cut -f1 refseq_accs_and_base_links | sed 's/$/_protein.faa.gz/' > refseq_names


# sticking both sets together
cat genbank_names refseq_names > all_names
cat genbank_targets refseq_targets > all_targets


### downloading in parallel ###
parallel --xapply -j 30 curl -L {1} -o {2} :::: all_targets :::: all_names

###  unzipping, renaming headers of each to have the assembly accession for easy parsing after hmm search, and catting all genes together into one file ###
bash unzip_rename_cat_genes.sh all_genome_accs

# total genes
grep -c ">" all_genes.faa
42829397

### splitting to be able to search in parallel ###
mkdir split_seqs
cd split_seqs
split -d -l 856588 ../all_genes.faa sub_
cd ../

### running hmmsearch with >= 50% coverage-filtered Pfams
time find split_seqs/ -name "sub*" | parallel -I% --max-args 1 -j 50 hmmsearch --cut_ga --cpu 2 --tblout %_pfam_hits.tab Pfam-A_coverage_filtered.hmm % > /dev/null


### combining all hmm-results files ###
cat split_seqs/*.tab > All_bacterial_pfam_hmm_results.tab

### counting how many hits to each pfam for each genome ###

# getting all coverage-filtered pfam IDs
grep "^ACC" Pfam-A_coverage_filtered.hmm | tr -s " " "\t" | cut -f2 > coverage_filtered_pfam_ids

python pfam_counting.py -p coverage_filtered_pfam_ids -g all_genome_accs -H All_bacterial_pfam_hmm_results.tab -o All_bacterial_pfam_hmm_counts.tsv
  ## this output table holds all genomes as rows, all include pfams as columns, and counts in the cells

###### "All_bacterial_pfam_hmm_counts.tsv" read into R for parsing  ######
### in R ###
tab <- read.table("All_bacterial_pfam_hmm_counts.tsv", header=T, row.names=1, sep="\t")

head(tab[,1:10])
dim(tab)

all_genomes <- row.names(tab)

# getting how many genomes have exactly 1 hit for each pfam
num_genomes_with_1_hit_for_each_pfam <- apply(tab, 2, function(x) sum(x == 1))

# getting percentages of genomes with exactly 1 hit for each pfam
perc_genomes_with_1_hit_for_each_pfam <- round(num_genomes_with_1_hit_for_each_pfam / length(all_genomes) * 100, 2)

# getting pfam ids for only those where 90% or greater of genomes have exactly 1 hit
target_pfam_ids <- names(perc_genomes_with_1_hit_for_each_pfam[perc_genomes_with_1_hit_for_each_pfam >= 90])

# writing out
write(target_pfam_ids, "Bacterial_SCG_pfam_IDs.txt")
### out of R ###

### then the hmm file was filtered down to these target IDs
hmmfetch -f Pfam-A_coverage_filtered.hmm Bacterial_SCG_pfam_IDs.txt > Bacteria.hmm

Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCG sets

Code

Downloading and filtering Pfams

Scanning all complete bacterial genomes from NCBI

Clone this wiki locally