SCG sets

This will be formatted and all code added soon, but here is the gist of it for now.

I took all 17,929 Pfams from the current release and filtered out those that didn’t cover more than 50% (on average) of the underlying protein sequences that went into building that Pfams HMM (this ensure there will be no two Pfams from the same source protein included). This left 8,924 Pfams.
I then downloaded all “complete” bacterial genomes currently available from NCBI (11,405 as accessed 09-DEC-2018) and ran the HMM search counting how many hits were reported for each pfam, for each genome.
I then only kept Pfams with only 1 hit in greater than or equal to 90% of all genomes to make the Bacterial SCG-set, and did the same for all Phyla with greater than 99 genomes, and all Proteobacterial classes with greater than 99 genomes.

So, for instance, if I am making a tree of all the Staphylococcus genomes I can get my hands on, I would use the Firmicutes SCG-set.

Here are the currently available HMM-sets:

   Actinobacteria.hmm                (138 genes)
   Alphaproteobacteria.hmm           (117 genes)
   Archaea.hmm                        (76 genes)
   Bacteria.hmm                       (74 genes)
   Bacteroidetes.hmm                  (90 genes)
   Betaproteobacteria.hmm            (203 genes)
   Chlamydiae.hmm                    (286 genes)
   Cyanobacteria.hmm                 (251 genes)
   Epsilonproteobacteria.hmm         (260 genes)
   Firmicutes.hmm                    (119 genes)
   Gammaproteobacteria.hmm           (172 genes)
   Proteobacteria.hmm                (119 genes)
   Tenericutes.hmm                    (99 genes)
   Universal_Hug_et_al.hmm            (15 genes)

Code

Downloading and filtering Pfams

###### This log holds the command-line steps take to generate the Pfam list that was used to search all genomes. ######
## downloading Pfam release 32.0
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz

## Filtering Pfam-HMMs by coverage of underlying proteins that built the model
# some unique Pfam-HMMs hold distinct domains from the same proteins
# this is not ideal when the purpose is to identify single-copy genes
# for completeness/redundancy estimates or for phylogenomics
# To help alleviate that (or most likely eliminate it completely), i only
# considered Pfam-HMMs that covered on average more than 50% of the
# underlying proteins that built the model.
# Pfam provides this information in the table downloaded here:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files/pfamA.txt.gz
gunzip pfamA.txt.gz

# cutting down to just pfam ID and coverage
cut -f 1,36 pfamA.txt > All_pfam_avg_covs.tsv

# filtering to keep only those with greater than 50% coverage
awk ' $2 > 50 { print $1 } ' All_pfam_avg_covs.tsv > coverage-filtered-pfam-IDs-no-version-info

## making subset hmm file
# hmmfetch won't work without the version numbers attached, which are not in the Pfam IDs from the "pfamA.txt" summary table

# so creating a list of all with the current version numbers attached:
grep "^ACC" Pfam-A.hmm | tr -s " " "\t" | cut -f2 > all-pfam-IDs

# pulling out targets:
for i in $(cat coverage-filtered-pfam-IDs-no-version-info); do grep -m1 "$i" all-pfam-IDs; done > coverage-filtered-pfam-IDs

# now generating subset:
hmmfetch -f Pfam-A.hmm coverage-filtered-pfam-IDs > Pfam-A_coverage_filtered.hmm

###### That subset, "Pfam-A_coverage_filtered.hmm", is what was used to search the bacterial and archaeal genomes. ######

Scanning all complete bacterial genomes from NCBI

## downloading bacterial genome info with E-Direct (this returned 11,405 with this search on 09-DEC-2018) ##
esearch -query '"bacteria"[filter] AND "Complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt > Complete_genome_bacteria_ncbi_info.tsv

 # adding header
cat <(echo "AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt" | tr " " "\t") Complete_genome_bacteria_ncbi_info.tsv > t && mv t Complete_genome_bacteria_ncbi_info.tsv

 # getting all genome accs
tail -n +2 Complete_genome_bacteria_ncbi_info.tsv | cut -f1 > all_genome_accs

### building up links to download genes in parallel, splitting which have genbank accs and which have refseq accs because that determines which ftp path to use ###
grep "^GCA_" Complete_genome_bacteria_ncbi_info.tsv > genbanks
grep "^GCF_" Complete_genome_bacteria_ncbi_info.tsv > refseqs

cut -f1,7 genbanks > genbank_accs_and_base_links # making 2-column table of accessions and base links
cut -f 2 genbank_accs_and_base_links > genbank_targets_p1 # getting base link alone
cut -f10 -d "/" genbank_targets_p1 > genbank_targets_p2 # to build the filename to the genes explicitly, need this component
paste -d "/" genbank_targets_p1 genbank_targets_p2 | sed 's/$/_protein.faa.gz/' > genbank_targets # finishing the full path to the genes download
cut -f1 genbank_accs_and_base_links | sed 's/$/_protein.faa.gz/' > genbank_names # creating file names to store the downloads


# doing the same as above but for the refseq accessions
cut -f1,8 refseqs > refseq_accs_and_base_links
cut -f 2 refseq_accs_and_base_links > refseq_targets_p1
cut -f10 -d "/" refseq_targets_p1 > refseq_targets_p2
paste -d "/" refseq_targets_p1 refseq_targets_p2 | sed 's/$/_protein.faa.gz/' > refseq_targets
cut -f1 refseq_accs_and_base_links | sed 's/$/_protein.faa.gz/' > refseq_names


# sticking both sets together
cat genbank_names refseq_names > all_names
cat genbank_targets refseq_targets > all_targets


### downloading in parallel ###
parallel --xapply -j 30 curl -L {1} -o {2} :::: all_targets :::: all_names

###  unzipping, renaming headers of each to have the assembly accession for easy parsing after hmm search, and catting all genes together into one file ###
bash unzip_rename_cat_genes.sh all_genome_accs

# total genes
grep -c ">" all_genes.faa
42829397

### splitting to be able to search in parallel ###
mkdir split_seqs
cd split_seqs
split -d -l 856588 ../all_genes.faa sub_
cd ../

### running hmmsearch with >= 50% coverage-filtered Pfams
time find split_seqs/ -name "sub*" | parallel -I% --max-args 1 -j 50 hmmsearch --cut_ga --cpu 2 --tblout %_pfam_hits.tab Pfam-A_coverage_filtered.hmm % > /dev/null


### combining all hmm-results files ###
cat split_seqs/*.tab > All_bacterial_pfam_hmm_results.tab

### counting how many hits to each pfam for each genome ###

# getting all coverage-filtered pfam IDs
grep "^ACC" Pfam-A_coverage_filtered.hmm | tr -s " " "\t" | cut -f2 > coverage_filtered_pfam_ids

python pfam_counting.py -p coverage_filtered_pfam_ids -g all_genome_accs -H All_bacterial_pfam_hmm_results.tab -o All_bacterial_pfam_hmm_counts.tsv
  ## this output table holds all genomes as rows, all include pfams as columns, and counts in the cells

###### "All_bacterial_pfam_hmm_counts.tsv" read into R for parsing, see "Pfam-HMM-parsing.R" ######

### then the hmm file was filtered down to these target IDs
hmmfetch -f Pfam-A_coverage_filtered.hmm Bacterial_SCG_pfam_IDs.txt > Lee_Bacterial_SCGs.hmm

Home -- What is GToTree? -- Installation -- Example Usage -- User Guide -- SCG-sets -- Things to Consider

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SCG sets

Code

Downloading and filtering Pfams

Scanning all complete bacterial genomes from NCBI

Clone this wiki locally