Skip to content

How to run metaGOflow

steninidak edited this page May 5, 2023 · 17 revisions

To demonstrate how one may run the metaGOflow workflow, we will use a marine Illumina MiSeq shotgun metagenome with the ENA accession run id ERR855786.

Using local raw data files

In case you have your raw data locally, you can run metaGOflow by providing the forward and the reverse reads accordingly:

./run_wf.sh -f SAMPLE_R1.fastq.gz -r SAMPLE_R2.fastq.gz -n PREFIX_FOR_OUTPUT_FILES -d OUTPUT_FOLDER_NAME

Using an ENA run accession id

If your raw data are stored either as public or private data in ENA, then,

  • if public, you need to provide metaGOflow with their corresponding run accession number.
./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES
  • if private, you need to provide both the run accession number but also your credentials
./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -u ENA_USERNAME -k ENA_PASSWORD -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES

The config.yml file

The config.yml file includes all the parameters of the metaGOflow that the user can set.

  • The steps of the workflow to be performed. Remember! You can run certain steps of the workflow (e.g. the functional annotation of the reads) at a later point, but you always need to:
    • have run the previous steps previously
    • keep track of the required files (see last bullet)
  • The number of threads to be used.
  • Sequence filtering related parameters. You may check on the fastp documentation for that.
  • Assembly and functional related parameters that to a great extent define the computing time of their corresponding steps.
  • In case that later steps of the metaGOflow are to be performed, based on metaGOflow output files of previous runs for a given sample, you need to provide the path of certain files, depending on the step you want to perform. For example, assuming you are about to perform the assembly step while you have previously performed the taxonomy inventory one. In this case, your config.yml file has to look something like that:
qc_and_merge_step: false
taxonomic_inventory: false
cgc_step: false
reads_functional_annotation: false
assemble: true

processed_read_files: 
  - class: File
    path:  /PATH_TO_/SAMPLE_R1_clean.fastq.trimmed.fasta
  - class: File
    path:  /PATH_TO_/SAMPLE_R2_clean.fastq.trimmed.fasta

  • In case that you want to run metaGOflow only for the taxonomic inventory step, your config.yml file has to look something like that:
qc_and_merge_step: true
taxonomic_inventory: true
cgc_step: false
reads_functional_annotation: false
assemble: false

processed_read_files: 
  - class: File
    path:  /PATH_TO_/SAMPLE_R1.fastq.gz
  - class: File
    path:  /PATH_TO_/SAMPLE_R2.fastq.gz

Have a thorough look on the notes at the config.yml file regarding which files are required to perform the various steps.

Attention! Note that paths used in config.yml are relative paths to the output folder of your run (-d argument). So, for instance, if you need to access files in folders of the previous runs on the same level, you should add ../path/to/previous/run/files/.

Run in HPC

To run in a HPC environment, you need to build a batch script depending on the workload manager used in the HPC to be used.

For example, in case of SLURM you will have to make a file (e.g., metagoflow-job.sh) like the following:

#!/bin/bash

#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --nodelist=
#SBATCH --ntasks-per-node=NUM_OF_CORES
#SBATCH --mem=
#SBATCH --requeue
#SBATCH --job-name=metagoflow
#SBATCH --output=metagoflow.output

./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES

Apparently, you need to make sure that in the node metaGOflow will run, all its dependencies will be fulfilled. We advise you to contact your HPC system administrators, for guidance on how to submit jobs as each HPC may use a different workload manager.

Attention! For SLURM users, RuntimeError, slurm currently does not support shared caching, because it does not support cleaning up a worker after the last job finishes. Set the --disableCaching flag if you want to use this batch system.

Docker or Singularity

metaGOflow supports both Docker (default) and Singularity container technologies. By default, metaGOflow will use Docker; therefore, in this case Docker is a dependency for the workflow to run.

To enable Singularity, you need to add the -s argument when calling metaGOflow:

./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES -s

In case Singularity runs fail with an error message mentioning that a .sif file is missing, you need to force-pull the images to be used from metaGOflow. To make things easier, we have built the get_singularity_images.sh to do so.

cd Installation
bash get_singularity_images.sh

In case you are using Docker, it is strongly recommended to avoid installing it through snap.

Now, you are ready to run metaGOflow with Singularity!