Skip to content

Files

Latest commit

6c36d46 · Feb 6, 2024

History

History
198 lines (133 loc) · 10.3 KB

recovering_viruses_from_metatranscriptomics.md

File metadata and controls

198 lines (133 loc) · 10.3 KB

Recovering viruses from metatranscriptomics

This walkthrough goes through assembling metatranscriptomic reads, viral binning, clustering, and classification.

What you'll end up with at the end of this are genomes, proteins, gene models, clusters, and a lot more.

Please refer to the end-to-end metagenomics workflow for details on annotating viral genes.


Steps:

  1. Preprocess reads and get directory set up
  2. Assemble reads, map reads to assembly, calculate assembly statistics, and index the assembly
  3. Recover viruses from metatranscriptomic assemblies
  4. Cluster genomes and proteins
  5. Classify viral genomes

Conda Environment: conda activate VEBA. Use this for intermediate scripts.

1. Preprocess reads and get directory set up

Refer to the downloading and preprocessing reads workflow. At this point, it's assumed you have the following:

  • A file with all of your identifiers on a separate line (e.g., identifiers.list but you can call it whatever you want)
  • A directory to keep all your logs called logs/
  • A directory of preprocessed reads: veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz and veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz where ${ID} represents the identifiers in identifiers.list.

2. Assemble reads, map reads to assembly, calculate assembly statistics, and index the assembly

Here we are going to assemble all of the reads using rnaSPAdes.

# Set the number of threads to use for each sample. Let's use 4
N_JOBS=4

# This is not the default output directory and is customized for metatranscriptomics
OUT_DIR=veba_output/transcript_assembly

# Just making sure your log directory is available
mkdir -p logs 

# Iterate through the identifiers and run the assembly command
for ID in $(cat identifiers.list); do
	
	# Set a useful name for log files
	N="assembly__${ID}";
	rm -f logs/${N}.*
	
	# Get forward and reverse reads
	R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
	R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz
	
	# Set up command
	CMD="source activate VEBA && veba --module assembly --params \"-1 ${R1} -2 ${R2} -n ${ID} -o ${OUT_DIR} -p ${N_JOBS} -P rnaspades.py\""
	
	# Either run this command or use SunGridEnginge/SLURM
	
	done

The following output files will produced for each sample:

  • featurecounts.tsv.gz - featureCounts output for contig-level counts
  • mapped.sorted.bam - Sorted BAM
  • mapped.sorted.bam.bai - Sorted BAM index
  • genes_to_transcripts.tsv - Gene identifier to transcript identifier
  • transcripts.fasta - Transcript assemblies
  • transcripts.fasta.*.bt2 - Bowtie2 index of transcripts
  • transcripts.fasta.saf - SAF formatted file for transcript-level counts with featureCounts
  • seqkit_stats.tsv.gz - Assembly statistics

The main one we need is transcripts.fasta which we will use for binning.

rnaSPAdes transcript identifiers have the following format:

NODE_1_length_117535_cov_30.642667_g0_i0

Where g0 refers to the predicted gene and i0 refers to the isoform transcript for g0. Node, length, and coverage refer to the backend node in the assembly graph, transcript length, and transcript coverage, respectively. This can be useful for tools like TransDecoder that take a --gene_trans_map argument (TransDecoder.LongOrfs) and a --single_best_only (TransDecoder.Predict) as we implemented when investigating coral holobiomes. Although, VEBA has a wrapper script for TransDecoder and automated homology searches, this tutorial will use Prodigal-GV instead.

3. Recover viruses from metatranscriptomic assemblies

We use a similar approach to the metagenomics with geNomad and CheckV but using the assembled transcripts instead. Again, the criteria for high-quality viral genomes are described by the CheckV author here in this Bitbucket Issue (#38).

N_JOBS=4

for ID in $(cat identifiers.list);
	do N="binning-viral__${ID}";
	rm -f logs/${N}.*
	FASTA=veba_output/transcript_assembly/${ID}/output/transcripts.fasta
	BAM=veba_output/transcript_assembly/${ID}/output/mapped.sorted.bam
	CMD="source activate VEBA && veba --module binning-viral --params \"-f ${FASTA} -b ${BAM} -n ${ID} -p ${N_JOBS} -m 1500 -o veba_output/binning/viral -a genomad\""
	
	# Either run this command or use SunGridEnginge/SLURM

	done

The following output files will produced for each sample:

  • binned.list - List of binned contigs
  • bins.list - List of MAG identifiers
  • checkv_results.filtered.tsv - Filtered CheckV output
  • featurecounts.orfs.tsv.gz - ORF-level counts table (If --bam file is provided)
  • genome_statistics.tsv - Genome assembly statistics
  • gene_statistics.cds.tsv - Gene sequence statistics (CDS)
  • genomes/ - MAG subdirectory
  • genomes/[id_genome].fa - MAG assembly fasta
  • genomes/[id_genome].faa - MAG protein fasta
  • genomes/[id_genome].ffn - MAG CDS fasta
  • genomes/[id_genome].gff - MAG gene models
  • genomes/identifier_mapping.tsv - Identifier mapping between [id_orf, id_contig, id_mag]
  • scaffolds_to_bins.tsv - Identifier mapping between [id_contig, id_mag]
  • unbinned.fasta - Fasta of unbinned contigs that have passed length thresholding
  • unbinned.list - List of unbinned contigs

4. Cluster genomes and proteins

To analyze these data, we are going to generate some counts tables and we want a single set of features to compare across all samples. To achieve this, we are going to cluster the genomes into species-level clusters (SLC) and the proteins into SLC-specific protein clusters (SSPC). Further, this clustering is dual purpose as it alleviates some of the bias from the curse(s) of dimensionality with dimensionality reduction via feature compression - a type of feature engineering.

# We need to generate a table with the following fields:
# [organism_type]<tab>[id_sample]<tab>[id_mag]<tab>[genome]<tab>[proteins]
# The table but can include additional columns to the right (e.g., [cds]<tab>[gene_models]) but the first 5 columns are mandatory. 

compile_genomes_table.py -i veba_output/binning/ > veba_output/misc/genomes_table.tsv
	
# Let's set an intermediate amount of threads for these
N_JOBS=12

# Set up command
CMD="source activate VEBA && veba --module cluster --params \"-i veba_output/misc/genomes_table.tsv -o veba_output/cluster -p ${N_JOBS}\""
	
# Either run this command or use SunGridEnginge/SLURM

The following output files will produced for each domain:

Default genome cluster prefix: VSLC-1 where V refers to the capitalized first letter of the domain (e.g., Viral)

Default protein cluster prefix: VSLC-1_SSPC-1

veba_output/cluster/global

veba_output/cluster/local

  • global/feature_compression_ratios.tsv - Feature compression ratios for each domain
  • global/genome_clusters.tsv - Machine-readable table for genome clusters [id_genome_cluster, number_of_components, number_of_samples_of_origin, components, samples_of_origin]
  • global/identifier_mapping.genomes.tsv - Identifier mapping for genomes [id_genome, organism_type, sample_of_origin, id_genome_cluster, number_of_proteins, number_of_singleton_protein_clusters, ratio_of_protein_cluster_are_singletons]
  • global/identifier_mapping.proteins.tsv - Identifier mapping for proteins [id_protein, organism_type, id_genome, sample_of_origin, id_genome_cluster, id_protein_cluster]
  • global/identifier_mapping.scaffolds.tsv - Identifier mapping for contigs [id_scaffold, organism_type, id_genome, sample_of_origin, id_genome_cluster]
  • global/mags_to_slcs.tsv
  • global/protein_clusters.tsv - Machine-readable table for protein clusters [id_protein_cluster, number_of_components, number_of_samples_of_origin, components, samples_of_origin]
  • global/proteins_to_orthogroups.tsv - Identifier mapping between proteins and protein clusters [id_protein, id_protein-cluster]
  • global/representative_sequences.faa - Protein sequences for cluster representatives. Header follows the following format: id_protein-cluster id_original_protein
  • global/scaffolds_to_mags.tsv - Identifier mapping between contigs and genomes [id_contig, id_genome]
  • global/scaffolds_to_slcs.tsv - Identifier mapping between contigs and genome clusters [id_contig, id_genome-cluster]
  • global/pangenome_tables/*.tsv.gz - Pangenome tables for each SLC with prevalence values
  • global/serialization/*.dict.pkl - Python dictionaries for clusters
  • global/serialization/*.networkx_graph.pkl - NetworkX graphs for clusters
  • local/* - If --no_local_clustering is not selected then all of the files are generated for local clustering

5. Classify viral genomes

Viral classification is performed using geNomad. Classification can be performed using the intermediate binning results which is much quicker. Alternatively, if you have viruses identified elsewhere you can still classify using the --genomes argument instead.

N=classify-viral

# Specify the binning directory
BINNING_DIRECTORY=veba_output/binning/viral

# Clusters are optional but if you have already clustered then use it
CLUSTERS=veba_output/cluster/viral/output/clusters.tsv

rm -rf logs/${N}.*

# Set up the command
CMD="source activate VEBA && veba --module classify-viral --params \"-i ${BINNING_DIRECTORY} -c ${CLUSTERS} -o veba_output/classify/viral\""

# Either run this command or use SunGridEnginge/SLURM

The following output files will produced:

  • viral_taxonomy.tsv - Viral genome classification based on geNomad classifications
  • viral_taxonomy.clusters.tsv - Viral cluster classification (If --clusters are provided)

Next steps:

Now it's time to map the reads so we can analyze the data using compositional data analysis (CoDA). Please refer to the read mapping and counts tables workflow.