Copyright Lee H. Bergstrand and Jackson M. Tsuji, 2025
backblast
automates the use of NCBI BLASTP to search for genes or gene clusters within bacterial genomes.
Non-orthologous genes are filtered out by identifying and extracting only bidirectional best BLASTP hits using a graph-based algorithm.
backblast
then visualizes the results of bidirectional BLASTP in a convenient gene heatmap coupled to a genome phylogeny.
The bidirectional BLASTP-based filtering algorithm is illustrated below:
Example gene heatmap visualization:
(From Spasov, Tsuji, et al., 2020, doi:10.1038/s41396-020-0650-2)
- OS: runs on linux (e.g., Ubuntu) and MacOS (tested on Sonoma 14 and Sequoia 15)
- Hardware: most modern computers (e.g., with basic CPU, >=4 GB RAM, and >=4 GB of free storage) should be able to run BackBLAST without issue. The only exception is if you create a genome tree within the pipeline, in which case you'll need a fair amount of CPU and time to calculate large trees.
- Software: miniconda or miniforge needs to be installed (must be set to the
osx-64
channel for MacOS). (Because conda is used during the run to install additional tools for the workflow, you will need an internet connection the first time you run BackBLAST.)
Installation is currently semi-manual. We hope to make an automated conda install in the future.
Run the following code in your command line (e.g., Terminal) to install BackBLAST:
# Download the repo
git clone https://github.com/LeeBergstrand/BackBLAST_Reciprocal_BLAST.git
cd BackBLAST_Reciprocal_BLAST
# git checkout develop # optionally go to a specific branch
# git checkout v2.0.0-beta1 # optionally go to a specific release
# Create the conda env
conda env create -n backblast --file=environment.yml
# Copy the key repo contents into a conda share folder
conda activate backblast
mkdir -p ${CONDA_PREFIX}/share/backblast
cp -r * ${CONDA_PREFIX}/share/backblast
# Remove the original repo
cd ..
rm -rf BackBLAST_Reciprocal_BLAST
# Add instructions to export the BackBLAST folder to the PATH when the repo activates
mkdir -p ${CONDA_PREFIX}/etc/conda/activate.d
if [[ ! -f ${CONDA_PREFIX}/etc/conda/activate.d/env_vars.sh ]]; then
echo '#!/bin/sh' > ${CONDA_PREFIX}/etc/conda/activate.d/env_vars.sh
fi
echo "export PATH=\${PATH}:${CONDA_PREFIX}/share/backblast:${CONDA_PREFIX}/share/backblast/scripts" \
>> ${CONDA_PREFIX}/etc/conda/activate.d/env_vars.sh
chmod 755 ${CONDA_PREFIX}/etc/conda/activate.d/env_vars.sh
# Re-activate the repo to apply the changes
conda deactivate
conda activate backblast
Now you should be good to go! Run backblast -h
to get started.
# Set up the run
backblast setup query.faa query_genome.faa subject_dir output_dir
# Then edit output_dir/config.yaml
# You can also edit output_dir/gene_metadata.tsv and output_dir/genome_metadata.tsv to make the plot look better
# Start the run
backblast run output_dir/config.yaml output_dir
# All done! You can iteratively refine the plot from here as you'd like.
See more detailed instructions below.
Get beautiful heatmaps using BackBLAST by following the step-by-step instructions below.
Prepare the following files and folders as input for BackBLAST:
query.faa
: multi-fasta file containing the query proteins that you want to search for. These should be copied and pasted from thequery_genome.faa
file below (the FastA headers for each protein need to be the same as in that file).query_genome.faa
: multi-fasta file containing all proteins encoded by the query bacterial genome.subject_dir
: a folder containing the subjects for the search. Each subject should be the protein-coding genes (as amino acids) for a bacterial genome. Save each subject as multi-fasta file (likequery_genome.faa
above). By default, the extension of each file needs to be.faa
, but this can be changed in the tool settings if desired.output_dir
: an empty folder for saving the output of BackBLAST.
You are then ready to run BackBLAST, using either method 2a or 2b below.
Gives you the ability to customize the parameters of your run.
backblast setup query.faa query_genome.faa subject_dir output_dir
This makes several configuration files in the output_dir
.
Open and edit output_dir/config.yaml
(e.g., in a text editor) to customize the settings for your run. Some key settings include:
- Thresholds for bidirectional BLASTP: you can set the e-value, percent identity, and query coverage cutoffs for the BLASTP search.
- Phylogenetic tree: you can also set whether you want to auto-generate a genome-based phylogenetic tree during your run (takes time)
or if you want to supply a custom pre-generated phylogenetic tree.
- If you provide your own phylogenetic tree, the genome IDs in the tree need to match the file names of the genomes in the
subject_dir
. In addition, all genomes in the phylogenetic tree need to be present in thesubject_dir
, or else BackBLAST will fail. If some genomes are in thesubject_dir
but are missing in the tree, then BackBLAST will still proceed but will drop those genomes from the analysis.) - Alternatively, you can also choose to not add a phylogenetic tree to the heatmap.
- If you provide your own phylogenetic tree, the genome IDs in the tree need to match the file names of the genomes in the
- Tree rooting: one other helpful setting is related to tree rooting. You can choose to root the tree automatically by midpoint or provide the name of one of the genomes that you want to serve as the root of the tree. Alternatively, you can choose to use the current tree topology as-is.
- Plot width and height: you can adjust the final width and height of the heatmap as well. It might take a few rounds of running BackBLAST to find the "perfect" width and height for your heatmap (see below).
- Take a look at the full config file to see other advanced settings.
You also have the opportunity (optionally) to edit the tab-separeted tables output_dir/gene_metadata.tsv
and output_dir/genome_metadata.tsv
to improve data visualization. You can open these in a text editor or a table editor (like Excel).
- In
gene_metadata.tsv
, you can provide human-readable names for the query genes you want to search for. You can also customize the order that the genes will be shown in the heatmap by changing how they are sorted in this file. If you delete rows, then those genes will be removed removed from the heatmap during the run. - In
genome_metadata.tsv
, you can provide human-readable names for the genomes in the run. The sort order of the genomes in this file only impacts the order that the genomes are plotted in if you choose to not use a phylogenetic tree (in the config file above). Otherwise, the sort order is determined using the phylogenetic tree.
backblast run output_dir/config.yaml output_dir
This should generate a series of blast and heatmap-related files in the output folder.
Then, take a look at the key output files heatmap/BackBLAST_heatmap.tsv
and heatmap/BackBLAST_heatmap.pdf
.
The first of these files shows the final bidirectional BLASTP results as a tab-separated table. The second is the final heatmap.
If you aren't satisfied with the results, you can selectively delete run files from the output folder, tweak the settings files above, and then re-run BackBLAST (via step 2a-iii) to re-generate the desired files with the new run settings.
Three common things to change are:
- If your BLASTP results are too stringent or not stringent enough: delete the
blast
folder, tweak the key BLASTP settings (in the config file), and then re-runbackblast run
. This will redo everything from theblast
step and will overwrite the old heatmap files. - If your visualization doesn't look right (e.g., the height and width aren't optimal, the genome/gene names aren't ideal, or the genes aren't sorted
as you'd like), then delete the
heatmap
folder, tweak the config and metadata files (mentioned above), and re-runbackblast run
. BackBLAST will then use the oldblast
results and just generate a new heatmap. - If you want to use a different phylogenetic tree, then point to a new phylogenetic tree file in the config file, then re-run
backblast run
. BackBLAST will use the oldblast
results but should overwrite the old heatmap files with new ones using the updated tree.
The PDF heatmap will not be perfect (e.g., the dashed lines between the genome names and tree tips won't be perfectly aligned), but you can open the heatmap in a PDF editor like Inkscape and clean it up or customize it as desired.
Gets the job done without any custom settings. This command skips the setup setup (above) and just runs the pipeline. You can set some of the key config settings via optional flags.
backblast auto [OPTIONS] query.faa query_genome.faa subject_dir output_dir
# See the different OPTIONS for customizing your run in the full settings at the bottom of this README.
The level of fine-tuning possible using this method is much less than in Method 2a, but it is a nice way to quickly see some initial bidirectional BLASTP results.
In addition, you can iteratively tweak the run results and re-run using backblast run
just like described in step 2a-iv above.
Once done, you can then edit the PDF heatmap as described in step 2a-v above.
One disadvantage of BackBLAST is that it has trouble handling paralogous genes (e.g., gene duplicates).
- If the query genome contains paralogs of your query gene, this can sometimes cause BackBLAST to miss real functional gene hits
for your query gene in some subject genomes. If the gene in the subject genome happens to be a closer match to one of the paralogs
in the query genome than to your query, then even if the gene in the subject genome is a valid functional gene for the process you are
interested in, it will be screened out by the BackBLAST algorithm and not reported. This can lead to an under-estimation of the true
functional gene content of subject genomes.
- As a partial workaround, we recommend to check for paralogs of your query genes in your query genome before running BackBLAST so you are aware if there could be paralog-related issues. One easy way to check for paralogs is to run BLASTP for each of your query genes against the query genome and look for high-scoring hits. You can then watch for unexpected results when running BackBLAST. If you are not hitting real functional genes that you want to detect in the subject genomes due to having paralogs in the query genome, then you can mask out (i.e., delete) the paralogs of your query gene in the query genome file and re-run BackBLAST. Make sure you know what you are doing if you do this, though... in some use cases, you really want the paralogs to be there to filter out non-true hits to your query.
- If the subject genomes contain paralogs (or multiple copies) of the query, then BackBLAST will not report this information. It will just
show the best hit to your query gene and ignore the other hits. This is non-ideal if you want to know how many paralogs are in the subject genomes.
- There is no built-in workaround for this in BackBLAST at the moment.
Try a test run from inside the repo with:
mkdir -p testing/outputs
# Make sure backblast is added to your PATH before running the test
backblast run testing/inputs/config.yaml testing/outputs --notemp
# See if the output files looks as expected
cmp testing/outputs/blast/combine_blast_tables/blast_tables_combined.csv \
testing/outputs_expected/blast/combine_blast_tables/blast_tables_combined.csv
cmp testing/outputs/heatmap/BackBLAST_heatmap.tsv \
testing/outputs_expected/heatmap/BackBLAST_heatmap.tsv
# Clean up test if everything looks good
rm -r testing/outputs
We hope that BackBLAST is helpful for you! If you use BackBLAST in your research, please cite the following paper, which describes the initial version of BackBLAST:
Bergstrand LH, Cardenas E, Holert J, Van Hamme JD, Mohn WW. Delineation of steroid-degrading microorganisms through comparative genomic analysis. mBio 7:10.1128/mbio.00166-16.
Help messages (e.g., from running backblast -h
) are pasted below. These show some of the advanced options possible in
the command line (e.g., for backblast auto
).
backblast
backblast: pipeline to search for and visualize gene homologs across multiple genomes.
Please specify a run mode for the main workflow for further usage instructions:
1. setup: for setting up pre-run configuration files
2. run: to start a run using configuration files
3. auto: to skip setup and run the pipeline end-to-end with default values
Or run a specific step in the workflow manually:
search: performs reciprocal BLASTP on given inputs
remove_duplicates: removes duplicate hits from reciprocal BLASTP results
create_blank_results: creates a blank output BLAST table for entries with no reciprocal BLASTP hits
combine_tables: combines a set of input BLAST tables
generate_heatmap: generates a heatmap from input BLAST and phylogeny data
Advanced parameters (use with care):
-U utils_dir: Path to the BackBLAST utility script directory for running specific workflow steps
[default: /Users/jmtsuji/mambaforge/envs/backblast/share/backblast/scripts]
backblast setup
backblast setup: module for setting up a BackBLAST run.
Usage: backblast setup [OPTIONS] query_filepath query_genome_filepath subject_genome_directory output_directory
Positional arguments (required):
query_filepath: path to the query predicted protein sequences from the query genome, FastA format
query_genome_filepath: path to the predicted proteins of the entire query genome, FastA format
subject_genome_directory: directory containing predicted proteins of all subject genomes (FastA format).
One genome per file, with extension 'faa' (or specify -x)
output_directory: directory where config ('config.yaml') and metadata templates ('genome_metadata.tsv', 'gene_metadata.tsv') will be created
Optional arguments:
-t phylogenetic_tree_newick: path to the pre-calculated phylogenetic tree [default: 'subjects' - auto-calculate tree]
Specify 'NA' to skip tree generation and plot the heatmap alone.
-b bootstrap_cutoff: numerical value (e.g., 80) under which bootstrap values will not be displayed [default: NA]
-r root_name: Exact name of the tree tip label at the desired root of the tree [default: NA; will skip rooting]
-e evalue: e-value cutoff for reciprocal BLASTP [default: 1e-40]
-p pident: percent identity cutoff for reciprocal BLASTP [default: 25]
-c qcov: percent query coverage cutoff for reciprocal BLASTP [default: 50]
-x genome_extension: extension for predicted protein files of subject genomes [default: faa]
-@ threads: maximum threads to use for any process [default: 1]
Advanced parameters (use with care):
-T template_config: Path to the template config file used in setup [default: /Users/jmtsuji/mambaforge/envs/backblast/share/backblast/snakemake/template_config.yaml]
-U utils_dir: Path to the BackBLAST utility script directory [default: /Users/jmtsuji/mambaforge/envs/backblast/share/backblast/scripts]
Note: Currently does NOT support whitespaces in any input variables.
backblast run
backblast run: module for initializing a BackBLAST run.
Usage: backblast run [OPTIONS] config_filepath run_directory [SNAKEMAKE_ARGUMENTS]
Positional arguments (required):
config_filepath: path to the config file generated using the 'setup' module
run_directory: the directory where BackBLAST results out to be output
Optional arguments:
-P conda_prefix: path to where the conda envs should be stored [default: '.snakemake/conda' in the run_directory]
-j jobs: Number of processing threads available for the run [default: 1]
**Should be no lower than the 'threads' setting in the config file**
Advanced parameters (use with care):
-S snakefile: Path to the Snakefile used to run BackBLAST [default: /Users/jmtsuji/mambaforge/envs/backblast/share/backblast/snakemake/Snakefile]
-C use_conda: specify either 'True' or 'False' for whether or not each job should be run in
its own conda env [default: True]
If 'False' is set, then all dependencies need to be installed in the main environment
where BackBLAST is running. Could be tricky.
Snakemake arguments:
Any flags added at the end of the command will be passed directly to snakemake, e.g., --notemp
Note: Currently does NOT support whitespaces in any input variables.
backblast auto
backblast auto: module for setting up a BackBLAST run AND starting with some defaults.
Usage: backblast auto [OPTIONS] query_filepath query_genome_filepath subject_genome_directory output_directory [SNAKEMAKE_ARGUMENTS]
Positional arguments (required):
query_filepath: path to the query predicted protein sequences from the query genome, FastA format
query_genome_filepath: path to the predicted proteins of the entire query genome, FastA format
subject_genome_directory: directory containing predicted proteins of all subject genomes (FastA format).
One genome per file, with extension 'faa' (or specify -x)
output_directory: directory where config ('config.yaml') and metadata templates ('genome_metadata.tsv', 'gene_metadata.tsv') will be created
Optional arguments:
-t phylogenetic_tree_newick: path to the pre-calculated phylogenetic tree [default: 'subjects' - auto-calculate tree]
Specify 'NA' to skip tree generation and plot the heatmap alone.
-b bootstrap_cutoff: numerical value (e.g., 80) under which bootstrap values will not be displayed [default: NA]
-r root_name: Exact name of the tree tip label at the desired root of the tree [default: NA; will skip rooting]
-e evalue: e-value cutoff for reciprocal BLASTP [default: 1e-40]
-p pident: percent identity cutoff for reciprocal BLASTP [default: 25]
-c qcov: percent query coverage cutoff for reciprocal BLASTP [default: 50]
-x genome_extension: extension for predicted protein files of subject genomes [default: faa]
-P conda_prefix: path to where the conda envs should be stored [default: '.snakemake/conda' in the run_directory] -@ threads: maximum threads to use for any process [default: 1]
-j jobs: Number of processing threads available for the run [default: 1]
**Should be no lower than the 'threads' setting in the config file**
Advanced parameters (use with care):
-T template_config: Path to the template config file used in setup [default: /Users/jmtsuji/mambaforge/envs/backblast/share/backblast/snakemake/template_config.yaml]
-S snakefile: Path to the Snakefile used to run BackBLAST [default: /Users/jmtsuji/mambaforge/envs/backblast/share/backblast/snakemake/Snakefile]
-C use_conda: specify either 'True' or 'False' for whether or not each job should be run in
its own conda env [default: True]
If 'False' is set, then all dependencies need to be installed in the main environment
where BackBLAST is running. Could be tricky.
Snakemake arguments:
Any flags added at the end of the command will be passed directly to snakemake, e.g., --notemp
Note: Currently does NOT support whitespaces in any input variables.