-
Notifications
You must be signed in to change notification settings - Fork 8
How to run metaGOflow
To demonstrate how one may run the metaGOflow workflow, we will use a marine Illumina MiSeq shotgun metagenome with the ENA accession run id ERR855786.
In case you have your raw data locally, you can run metaGOflow by providing the forward and the reverse reads accordingly:
./run_wf.sh -f SAMPLE_R1.fastq.gz -r SAMPLE_R2.fastq.gz -n PREFIX_FOR_OUTPUT_FILES -d OUTPUT_FOLDER_NAME
If your raw data are stored either as public or private data in ENA, then,
- if public, you need to provide metaGOflow with their corresponding run accession number.
./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES
- if private, you need to provide both the run accession number but also your credentials
./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -u ENA_USERNAME -k ENA_PASSWORD -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES
The config.yml
file includes all the parameters of the metaGOflow that the user can set.
- The steps of the workflow to be performed. Remember! You can run certain steps of the workflow (e.g. the functional annotation of the reads) at a later point, but you always need to:
- have run the previous steps previously
- keep track of the required files (see last bullet)
- The number of threads to be used.
- Sequence filtering related parameters. You may check on the
fastp
documentation for that. - Assembly and functional related parameters that to a great extent define the computing time of their corresponding steps.
- In case that later steps of the metaGOflow are to be performed, based on metaGOflow output files of previous runs for a given sample, you need to provide the path of certain files, depending on the step you want to perform.
For example, assuming you are about to perform the assembly step while you have previously performed the taxonomy inventory one.
In this case, your
config.yml
file has to look something like that:
qc_and_merge_step: false
taxonomic_inventory: false
cgc_step: false
reads_functional_annotation: false
assemble: true
processed_read_files:
- class: File
path: /PATH_TO_/SAMPLE_R1_clean.fastq.trimmed.fasta
- class: File
path: /PATH_TO_/SAMPLE_R2_clean.fastq.trimmed.fasta
- In case that you want to run metaGOflow only for the taxonomic inventory step, your
config.yml
file has to look something like that:
qc_and_merge_step: true
taxonomic_inventory: true
cgc_step: false
reads_functional_annotation: false
assemble: false
processed_read_files:
- class: File
path: /PATH_TO_/SAMPLE_R1.fastq.gz
- class: File
path: /PATH_TO_/SAMPLE_R2.fastq.gz
Have a thorough look on the notes at the config.yml
file regarding which files are required to perform the various steps.
Attention! Note that paths used in config.yml are relative paths to the output folder of your run (-d argument). So, for instance, if you need to access files in folders of the previous runs on the same level, you should add ../path/to/previous/run/files/.
To run in a HPC environment, you need to build a batch script depending on the workload manager used in the HPC to be used.
For example, in case of SLURM you will have to make a file (e.g., metagoflow-job.sh
) like the following:
#!/bin/bash
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --nodelist=
#SBATCH --ntasks-per-node=NUM_OF_CORES
#SBATCH --mem=
#SBATCH --requeue
#SBATCH --job-name=metagoflow
#SBATCH --output=metagoflow.output
./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES
Apparently, you need to make sure that in the node metaGOflow will run, all its dependencies will be fulfilled. We advise you to contact your HPC system administrators, for guidance on how to submit jobs as each HPC may use a different workload manager.
Attention! For SLURM users,
RuntimeError
, slurm currently does not support shared caching, because it does not support cleaning up a worker after the last job finishes. Set the--disableCaching
flag if you want to use this batch system.
metaGOflow supports both Docker (default) and Singularity container technologies. By default, metaGOflow will use Docker; therefore, in this case Docker is a dependency for the workflow to run.
To enable Singularity, you need to add the -s
argument when calling metaGOflow:
./run_wf.sh -e ENA_RUN_ACCESSION_NUMBER -d OUTPUT_FOLDER_NAME -n PREFIX_FOR_OUTPUT_FILES -s
In case Singularity runs fail with an error message mentioning that a .sif
file is missing,
you need to force-pull the images to be used from metaGOflow.
To make things easier, we have built the get_singularity_images.sh to do so.
cd Installation
bash get_singularity_images.sh
In case you are using Docker, it is strongly recommended to avoid installing it through
snap
.
Now, you are ready to run metaGOflow with Singularity!
Anything unclear or inaccurate? Please open an issue or email Dr.Haris Zafeiropoulos ([email protected]).
With respect to EMO BON protocols, samples, analyses you may contact the Observation, Data and Service Development Officer of EMBRC, Dr. Ioulia Santi ([email protected])