This walkthrough goes through assembling metatranscriptomic reads, viral binning, clustering, and classification.
What you'll end up with at the end of this are genomes, proteins, gene models, clusters, and a lot more.
Please refer to the end-to-end metagenomics workflow for details on annotating viral genes.
- Preprocess reads and get directory set up
- Assemble reads, map reads to assembly, calculate assembly statistics, and index the assembly
- Recover viruses from metatranscriptomic assemblies
- Cluster genomes and proteins
- Classify viral genomes
Conda Environment: conda activate VEBA
. Use this for intermediate scripts.
Refer to the downloading and preprocessing reads workflow. At this point, it's assumed you have the following:
- A file with all of your identifiers on a separate line (e.g.,
identifiers.list
but you can call it whatever you want) - A directory to keep all your logs called
logs/
- A directory of preprocessed reads:
veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
andveba_output/preprocess/${ID}/output/cleaned_2.fastq.gz
where${ID}
represents the identifiers inidentifiers.list
.
Here we are going to assemble all of the reads using rnaSPAdes
.
# Set the number of threads to use for each sample. Let's use 4
N_JOBS=4
# This is not the default output directory and is customized for metatranscriptomics
OUT_DIR=veba_output/transcript_assembly
# Just making sure your log directory is available
mkdir -p logs
# Iterate through the identifiers and run the assembly command
for ID in $(cat identifiers.list); do
# Set a useful name for log files
N="assembly__${ID}";
rm -f logs/${N}.*
# Get forward and reverse reads
R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz
# Set up command
CMD="source activate VEBA && veba --module assembly --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P rnaspades.py\""
# Either run this command or use SunGridEnginge/SLURM
done
The following output files will produced for each sample:
- featurecounts.tsv.gz - featureCounts output for contig-level counts
- mapped.sorted.bam - Sorted BAM
- mapped.sorted.bam.bai - Sorted BAM index
- genes_to_transcripts.tsv - Gene identifier to transcript identifier
- transcripts.fasta - Transcript assemblies
- transcripts.fasta.*.bt2 - Bowtie2 index of transcripts
- transcripts.fasta.saf - SAF formatted file for transcript-level counts with featureCounts
- seqkit_stats.tsv.gz - Assembly statistics
The main one we need is transcripts.fasta
which we will use for binning.
rnaSPAdes transcript identifiers have the following format:
NODE_1_length_117535_cov_30.642667_g0_i0
Where g0
refers to the predicted gene and i0
refers to the isoform transcript for g0
. Node, length, and coverage refer to the backend node in the assembly graph, transcript length, and transcript coverage, respectively. This can be useful for tools like TransDecoder that take a --gene_trans_map
argument (TransDecoder.LongOrfs) and a --single_best_only
(TransDecoder.Predict) as we implemented when investigating coral holobiomes. Although, VEBA has a wrapper script for TransDecoder and automated homology searches, this tutorial will use Prodigal-GV instead.
We use a similar approach to the metagenomics with geNomad and CheckV but using the assembled transcripts instead. Again, the criteria for high-quality viral genomes are described by the CheckV author here in this Bitbucket Issue (#38).
N_JOBS=4
for ID in $(cat identifiers.list);
do N="binning-viral__${ID}";
rm -f logs/${N}.*
FASTA=veba_output/transcript_assembly/${ID}/output/transcripts.fasta
BAM=veba_output/transcript_assembly/${ID}/output/mapped.sorted.bam
CMD="source activate VEBA && veba --module binning-viral --params \"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -o veba_output/binning/viral -a genomad\""
# Either run this command or use SunGridEnginge/SLURM
done
The following output files will produced for each sample:
- binned.list - List of binned contigs
- bins.list - List of MAG identifiers
- checkv_results.filtered.tsv - Filtered CheckV output
- featurecounts.orfs.tsv.gz - ORF-level counts table (If --bam file is provided)
- genome_statistics.tsv - Genome assembly statistics
- gene_statistics.cds.tsv - Gene sequence statistics (CDS)
- genomes/ - MAG subdirectory
- genomes/[id_genome].fa - MAG assembly fasta
- genomes/[id_genome].faa - MAG protein fasta
- genomes/[id_genome].ffn - MAG CDS fasta
- genomes/[id_genome].gff - MAG gene models
- genomes/identifier_mapping.tsv - Identifier mapping between [id_orf, id_contig, id_mag]
- scaffolds_to_bins.tsv - Identifier mapping between [id_contig, id_mag]
- unbinned.fasta - Fasta of unbinned contigs that have passed length thresholding
- unbinned.list - List of unbinned contigs
To analyze these data, we are going to generate some counts tables and we want a single set of features to compare across all samples. To achieve this, we are going to cluster the genomes into species-level clusters (SLC) and the proteins into SLC-specific protein clusters (SSPC). Further, this clustering is dual purpose as it alleviates some of the bias from the curse(s) of dimensionality with dimensionality reduction via feature compression - a type of feature engineering.
# We need to generate a table with the following fields:
# [organism_type]<tab>[id_sample]<tab>[id_mag]<tab>[genome]<tab>[proteins]
# The table but can include additional columns to the right (e.g., [cds]<tab>[gene_models]) but the first 5 columns are mandatory.
compile_genomes_table.py -i veba_output/binning/ > veba_output/misc/genomes_table.tsv
# Let's set an intermediate amount of threads for these
N_JOBS=12
# Set up command
CMD="source activate VEBA && veba --module cluster --params \"-i veba_output/misc/genomes_table.tsv -o veba_output/cluster -p ${N_JOBS}\""
# Either run this command or use SunGridEnginge/SLURM
The following output files will produced for each domain:
Default genome cluster prefix: VSLC-1
where V
refers to the capitalized first letter of the domain (e.g., V
iral)
Default protein cluster prefix: VSLC-1_SSPC-1
veba_output/cluster/global
veba_output/cluster/local
- global/feature_compression_ratios.tsv - Feature compression ratios for each domain
- global/genome_clusters.tsv - Machine-readable table for genome clusters
[id_genome_cluster, number_of_components, number_of_samples_of_origin, components, samples_of_origin]
- global/identifier_mapping.genomes.tsv - Identifier mapping for genomes
[id_genome, organism_type, sample_of_origin, id_genome_cluster, number_of_proteins, number_of_singleton_protein_clusters, ratio_of_protein_cluster_are_singletons]
- global/identifier_mapping.proteins.tsv - Identifier mapping for proteins
[id_protein, organism_type, id_genome, sample_of_origin, id_genome_cluster, id_protein_cluster]
- global/identifier_mapping.scaffolds.tsv - Identifier mapping for contigs
[id_scaffold, organism_type, id_genome, sample_of_origin, id_genome_cluster]
- global/mags_to_slcs.tsv
- global/protein_clusters.tsv - Machine-readable table for protein clusters
[id_protein_cluster, number_of_components, number_of_samples_of_origin, components, samples_of_origin]
- global/proteins_to_orthogroups.tsv - Identifier mapping between proteins and protein clusters
[id_protein, id_protein-cluster]
- global/representative_sequences.faa - Protein sequences for cluster representatives. Header follows the following format:
id_protein-cluster id_original_protein
- global/scaffolds_to_mags.tsv - Identifier mapping between contigs and genomes
[id_contig, id_genome]
- global/scaffolds_to_slcs.tsv - Identifier mapping between contigs and genome clusters
[id_contig, id_genome-cluster]
- global/pangenome_tables/*.tsv.gz - Pangenome tables for each SLC with prevalence values
- global/serialization/*.dict.pkl - Python dictionaries for clusters
- global/serialization/*.networkx_graph.pkl - NetworkX graphs for clusters
- local/* - If
--no_local_clustering
is not selected then all of the files are generated for local clustering
Viral classification is performed using geNomad
. Classification can be performed using the intermediate binning results which is much quicker. Alternatively, if you have viruses identified elsewhere you can still classify using the --genomes
argument instead.
N=classify-viral
# Specify the binning directory
BINNING_DIRECTORY=veba_output/binning/viral
# Clusters are optional but if you have already clustered then use it
CLUSTERS=veba_output/cluster/viral/output/clusters.tsv
rm -rf logs/${N}.*
# Set up the command
CMD="source activate VEBA && veba --module classify-viral --params \"-i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/viral\""
# Either run this command or use SunGridEnginge/SLURM
The following output files will produced:
- viral_taxonomy.tsv - Viral genome classification based on geNomad classifications
- viral_taxonomy.clusters.tsv - Viral cluster classification (If
--clusters
are provided)
Now it's time to map the reads so we can analyze the data using compositional data analysis (CoDA). Please refer to the read mapping and counts tables workflow.