Running pipeline

MetaRefSGB requires a set of mandatory and optional arguments in order to organise your genomes in species-, genus-, and family-level genome bins. You can inspect the whole set of arguments by typing the following command in your terminal:

MetaRefSGB --help

It will show the list of available arguments with a brief explanation about their meaning:

	MetaRefSGB -- organise genomes into species-level genome bins

	1.0 (20210929)

	MetaRefSGB [--work-dir=directory] [--label=value] [--release=value] [--mags=file] [--genomes=file]
	           [--input-dir=directory] [--extension=value] [--db=directory] [--nproc=num] [--xargs-nproc=num]
	           [--mash-threshold=num] [--checkm-completeness=num] [--checkm-contamination=num]

	MetaRefSGB is a scalable framework for organising genomes into species-level genome bins.
	Please visit the official Wiki for additional details:

	The following options are available:

		Path to the working directory in which results will be located.

		Label for the new release.

		Label of the reference release.
		Use --release=none to build a release from scratch.

		Path to the file with the list of input MAGs.

		Path to the file with the list of input Reference Genomes.

		MetaRefSGB Unique Genome Identifier.
		Must be used in conjunction with --inspect only.

		Sample ID.
		Must be used in conjunction with --inspect only.

		Dataset ID.
		Must be used in conjunction with --inspect only.

		Cluster ID (SGB, GGB, or FGB).
		Must be used in conjunction with --inspect only.

		Path to a one-column file with a list of MetaRefSGB Unique Genome Identifiers, samples, or datasets.
		Must be used in conjunction with --inspect only.

		MetaRefSGB Data Model schema (MAG, genome, or metadata).
		Must be used in conjunction with --inspect only.

		Path to the file with the output of the --inspect command.
		Must be used in conjunction with --inspect only.

		Retrieve information about genomes, samples, datasets, or clusters

		Path to the file with metadata about metagenomic samples.
		Must be used in conjunction with --validate-input only.

		Path to the folder with input genomes.

		File extension of the input genomes.

		Print the list of supported file extensions for input genomes.

		Directory with the MetaRefSGB framework and databases.

		Max nproc for parallel instructions.

		Max parallel xargs jobs for MASH disting.

		Filter threshold on the MASH distance.

		Filter threshold on CheckM completeness.

		Filter threshold on CheckM contamination.

		Automatically set --nproc=8, --xargs-nproc=1, --mash-threshold=0.001, --checkm-completeness=50.0, and --checkm-contamination=5.0.
		Remember to always use this flag before one of the above arguments, otherwise it will overwrite them with their default values.

		Used in conjunction with --mags, --genomes, and --metadata arguments. Check whether input files are properly formatted.
		Input GCAs will be tested against the RefSeq exclusion criteria to check whether they are Reference Genomes or MAGs.
		Data will be validated against the MetaRefSGB Data Model (MDM).

		Automatically correct Reference Genomes taxonomic labels and NCBI taxa IDs.
		Input file must be the same as the one passed with the --genomes argument.

		Skip the filtering process and insert all the input genomes into the clustering configuration.

		Skip the filtering process and use a precompiled list of genomes as the result of the filter.

		Skip the quality-control process with CheckM.

		Skip CheckM and use a precompiled CheckM output log file.

		Automatically check for external software dependencies and install required python modules.

	MetaRefSGB exits with one of the following values:

	0	The pipeline has been correctly applied.
	>0	An error occurred.

	Try running MetaRefSGB by typing:

		$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --genomes=~/genomes.txt
		             --input-dir=~/mygenomes --extension=fna --db=~/db --default

	To expand the --default flag and explicitly set --nproc, --xargs-nproc, --mash-threshold, --checkm-completeness, and --checkm-contamination:

		$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --genomes=~/genomes.txt
		             --input-dir=~/mygenomes --extension=fna --db=~/db --nproc=8 --xargs-nproc=1 --mash-threshold=0.001
		             --checkm-completeness=50.0 --checkm-contamination=5.0

	To explicitly change the value of just one of the above arguments, remember to always put the --default flag before specifying any of them.
	Otherwise, it will overwrite the explicitly assigned arguments with their default values:

		$ MetaRefSGB --work-dir=~/myrelease --label=Test --release=Jan21 --mags=~/MAGs.txt --genomes=~/genomes.txt
		             --input-dir=~/mygenomes --extension=fna --db=~/db --default --nproc=16

	In order to validate input data, try runnning:

		$ MetaRefSGB --mags=~/MAGs.txt --genomes=~/genomes.txt --metadata=~/metadata.txt --validate-input

	To validate just one input data:

		$ MetaRefSGB --mags=~/MAGs.txt --validate-input

	To correct or synchronise taxonomic labels and NCBI taxa IDs of the input Reference Genomes:

		$ MetaRefSGB --correct-taxa=~/genomes.txt

	To automatically check for external software dependencies and resolve required python modules:

		$ MetaRefSGB --resolve-dependencies

	To retrieve informations about genomes, samples, datasets, or clusters into the Jan21 release:

		$ MetaRefSGB --inspect --genome=M1663737656 --db=~/db --release=Jan21

		$ MetaRefSGB --inspect --sample=833 --db=~/db --release=Jan21

		$ MetaRefSGB --inspect --dataset=AsnicarF_2020 --db=~/db --release=Jan21

		$ MetaRefSGB --inspect --cluster=SGB5075 --db=~/db --release=Jan21

	To search for multiple genomes, samples, datasets, or clusters with a single run:

		$ MetaRefSGB --inspect --file=~/mygenomes.txt --db=~/db --release=Jan21

	In order to redirect the output of the --inspect command to a file:

		$ MetaRefSGB --inspect --genome=M1663737656 --db=~/db --release=Jan21 --output=~/M1663737656.json

	To inspect the MetaRefSGB Data Model schemas (MAG, genome, or metadata):

		$ MetaRefSGB --inspect --schema=MAG

You can also use the special character ? in order to expand the help of a specific command like:

MetaRefSGB --work-dir=?

This will output the following message:

MetaRefSGB helper: --work-dir=directory

    The --work-dir is a folder in which MetaRefSGB will put all the pipeline intermediate outputs
    up to the generation of the new clustering configuration.

    It must be empty at the beginning, otherwise the pipeline will try to resume a potentially interrupted
    run if some required intermediate results exist.

    Both relative and absolute paths are allowed.

Please remember to escape the question mark character in case your terminal will try to automatically interpret it:

MetaRefSGB --work-dir=\?

Organise your genomes

Before running MetaRefSGB, you should organise your genomes first. Both MAGs and Reference Genomes files must be located in the same folder and must have the same file extension. You can easily uniform your genome files extension by typing the following command in your terminal:

find ${INPUTS_DIR} \
        -type f -iname "*.${CURRENT_EXTENSION}" -follow | xargs -n 1 -i sh -c \
        'INPUT={}; \
         mv "$INPUT" "${INPUT%.'"${CURRENT_EXTENSION}"'}.'"${NEW_EXTENSION}"'";'

You should assign the path to the folder with your input genomes to the INPUTS_DIR variable in addition to the current and new file extension to the CURRENT_EXTENSION and NEW_EXTENSION respectively before running this code.

Make the genome files extension uniform is a mandatory step in order to properly run the CheckM step of the pipeline for the quality estimation of your input genomes.

Format your MAGs and Reference Genomes definition files

Arguments --mags and --genomes are both mandatory and must point to the MAGs and Reference Genomes definition files. They must be properly structured before running the pipeline.

Both of them must contain a column with the genome names (without their file extension). The Reference Genomes definition file must also contains two additional columns, one with the taxonomy labels and one with the NCBI taxa IDs.

The first line of both these files have to start with the # character and represents the header.

Here is an example of MAGs definition file that must be passed with the --mags argument:

# mag_id

And an example of Reference Genomes definition file that must be passed with the --genomes argument:

# genome_id	taxonomy	taxonomy_id
GCA_000003135	k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_longum|t__Bifidobacterium_longum_subsp_longum_ATCC_55813	2|201174|1760|85004|31953|1678|216816|548480
GCA_000003645	k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_cereus|t__Bacillus_cereus_m1293	2|1239|91061|1385|186817|1386|1396|526973
GCA_000003925	k__Bacteria|p__Firmicutes|c__Bacilli|o__Bacillales|f__Bacillaceae|g__Bacillus|s__Bacillus_mycoides|t__Bacillus_mycoides_DSM_2048	2|1239|91061|1385|186817|1386|1405|526997

Other columns in addition to the mandatory ones will not be considered.

To be sure that the taxonomy labels and NCBI IDs associated to your Reference Genomes are correct, you can run the following command in your terminal:

MetaRefSGB --correct-taxa=~/genomes.txt

It accepts both flat and uncompressed file in input as well as a BZ2 compressed file, but it will always produce a BZ2 compressed file in output with prefix corrected_.

Before running MetaRefSGB, you may want to finally check if both the MAGs and Reference Genomes definition files are properly formatted by typing:

MetaRefSGB --mags=~/MAGs.txt --genomes=~/genomes.txt --validate-input

Please note that you may want to change the paths specified with the --correct-taxa, --mags, and --genomes arguments in order to you files on your file system.

This will also validate your input data against the MetaRefSGB Data Model (MDM). Have a look at the models area on the GitHub repository or the dedicated wiki page for additional information about MDM.

It may results in a long list of errors in case your input does not respect the MDM specifications. In case you are building a private release, you can just ignore them but be sure that your inputs contain the minimum required columns before running the pipeline (mag_id for the MAGs definition file and genome_id, taxonomy, and taxonomy_id for the Reference Genomes definition file, as shown in the examples above).

Choose a reference release

In MetaRefSGB, new releases are always incremental. This means that they will be always generated starting from the clustering configuration of a previously built release as reference that will be updated by the addition of a new set of MAGs and/or Reference Genomes. You can choose the right release that better fit your needs by looking at the releases area of the repository.

We strongly recommend to use the last public available version of the MetaRefSGB releases.

Running MetaRefSGB

Now that you already organised your input genomes and you correctly formatted bot the MAGs and Reference Genomes definition files, you can finally run the MetaRefSGB pipeline by typing the following command in your terminal:

MetaRefSGB --work-dir=~/myrelease \
           --label=Test \
           --release=Jan21 \
           --mags=~/MAGs.txt \
           --genomes=~/genomes.txt \
           --input-dir=~/mygenomes \
           --extension=fna \
           --db=~/db \

In this specific examples, we selected Jan21 as a reference release. Input genomes are all located under ~/mygenomes folder and they all have the same fna file extension. The database directory specified with the --db argument can initially be empty and will be populated with data related to the version of the MetaRefSGB release specified with the --release argument. The working directory specified with the --work-dir argument can also be empty and will be populated while processing the new release.

Note that the --default flag is required in order to set the optional arguments with their default values. However, you can also expand it by explicitly set the optional arguments like in the following example:

MetaRefSGB --work-dir=~/myrelease \
           --label=Test \
           --release=Jan21 \
           --mags=~/MAGs.txt \
           --genomes=~/genomes.txt \
           --input-dir=~/mygenomes \
           --extension=fna \
           --db=~/db \
           --nproc=8 \
           --xargs-nproc=1 \
           --mash-threhsold=0.001 \
           --checkm-completeness=50.0 \

If you want to explicitly change the value of just one of the optional arguments, you can write something like the following line:

MetaRefSGB --work-dir=~/myrelease \
           --label=Test \
           --release=Jan21 \
           --mags=~/MAGs.txt \
           --genomes=~/genomes.txt \
           --input-dir=~/mygenomes \
           --extension=fna \
           --db=~/db \
           --default \

Remember to always use the --default flag in case you want to avoid setting the optional arguments with their default values. Also remember to always put the --default flag before the optional arguments, otherwise it will overwrite the explicitly assigned optional arguments with their default values.


Be careful while explicitly set the --xargs-nproc argument. It is used in conjunction with the --nproc argument to extremely parallelise the mash dist operations. In these particular cases, --nproc is used to parallelise the single MASH instance, while --xargs-nproc is used to determine how many MASH processes must be run in parallel. Thus, the total number of instanced processes is equals to --xargs-nproc * --nproc.