Skip to content

cabana-online/metagenomics-communicable-diseases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Metagenomics in Communicable Diseases: Tutorial #2

Metagenomics for the Study of Human Gut Microbiome in Health and Disease: Applications in Acute Diarrheal Diseases (ADD)

Authors

  • Angela Peña-González, PhD
  • Alejandro Reyes Muñoz, PhD

School of Biological Sciences, Universidad de los Andes

Description

This repository provides assistance for the student to complete Metagenomics in Communicable Diseases: Tutorial #2, by providing an already configured environment, allowing the student to focus on using the tools used throughout the document.

Requisites

  • Linux or OSX is advised as the environment to run this tutorial on. The following is a list of alternatives you could use if you don't have access to one of those environments.
    • On Windows using a virtual machine solution (VirtualBox or VMware should do it) with a Linux install.
    • Acquiring an AWS instance (depending on setup could be free or incur in a monthly fee).
    • Acquiring a Google compute engine instance (depending on setup could be free or incur in a monthly fee).
    • Acquiring an Azure Virtual Machine instance (depending on setup could be free or incur in a monthly fee).
  • Docker must be installed on the environment.
  • Docker Compose must be installed on the environment.
  • Make must be installed on the environment.

Commands

These must be run on the project folder. These commands depend on make being available.

General commands

  • make start - Will download (if not available) the docker images and start the docker containers neccessary to run the tutorial.
  • make stop - If the containers are running, it will shut them down.
  • make tutorial - If the containers are running, it will replicate the steps described in the tutorial file and store the results on the CABANA folder as described.

Specific commands

  • make prepare-tutorial - Will perform data preparation by downloading datasets used for control and testing.
    • make download-data - Downloads data that will be used on the tutorial.
    • make uncompress-data - Uncompresses the compressed datasets downloaded with the above command.
    • make seqtk-data - Takes the uncompressed data and pass it through seqtk.
  • make run-quality-control - Runs the quality control steps from the tutorial.
    • make qc-prepare - Prepares the seqtk data for the quality control process.
    • make qc-run-step-1 - Quality control step 1 from the tutorial.
    • make qc-run-step-2 - Quality control step 2 from the tutorial.
    • make qc-run-step-3 - Quality control step 3 from the tutorial.
    • make qc-run-step-4 - Quality control step 4 from the tutorial.
    • make qc-run-step-5 - Quality control step 5 from the tutorial.
  • make run-coverage - Runs the coverage steps from the tutorial.
    • make cv-step-1 - Coverage step 1 from the tutorial.
    • make cv-step-2 - Coverage step 2 from the tutorial.
  • make run-distance-estimation - Runs the distance estimation steps from the tutorial.
    • make distance-estimation-step-1 - Distance estimation step 1.
    • make distance-estimation-step-2 - Distance estimation step 2.
  • make run-metagenomic-assembly - Runs the metagenomics assembly steps from the tutorial.
    • make metagenomic-assembly-step-1 - Metagenomic assembly script.
  • make run-binning-clustering - Runs the binning and clustering steps from the tutorial.
    • make binning-step-1 - Binning and clustering step 1.
    • make binning-step-2 - Binning and clustering step 2.
  • make run-taxonomy - Runs the taxonomy steps from the tutorial.
    • make taxonomy-step-1 - Binning and clustering step 1.
  • make run-functional - Runs the functional steps from the tutorial.
    • make functional-step-1 - Functional step 1.
    • make functional-step-2 - Functional step 2.

Container shell access.

In order to ease the access to the containers the following commands were created.

  • make shell-blast - Shell access to the blast container.
  • make shell-bmtagger - Shell access ot the bmtagger container.
  • make shell-checkm - Shell access ot the checkm container.
  • make shell-enveomics - Shell access to the enveomics container.
  • make shell-mash - Shell access to the mash container.
  • make shell-maxbin2 - Shell access to the maxbin2 container.
  • make shell-megahit - Shell access to the megahit container.
  • make shell-metaphlan2 - Shell access to the metaphlan2 container.
  • make shell-multiqc - Shell access to the multiqc container.
  • make shell-seqtk - Shell access to the seqtk container.
  • make shell-r - Shell access to the r container with nonpareil.

Container tools access.

  • make r - Start an R session.

Repository structure

  • The CABANA folder will contain the resulting files from the different steps.
  • The DATA folder will contain some datasets downloaded from some tools.
  • The scripts folder will contain bash scripts that will be executed by the different docker containers in order to implement the same steps defined in the tutorial.

Notes

  • At least 110GB of available hard drive space (docker images, data sets).
  • At least 8GB of RAM on the environment where the tutorial will run.
  • The tasks must download data from different sources. Because of this, the time to complete some steps will depend on network connection speed.
  • The datasets are quite large. Because of this, the time to complete some tasks depends on the CPU and the available amount of RAM.
  • The average time to run the tutorial on a 100/100 MB optic fiber network, and a Lubuntu VMware instance running on Windows with 16GB of RAM assign and 3 cores of an Intel i7-9750CPU was 3 hours during a clean run (data downloaded for the first time).
  • The average time to run the tutorial on a 100/100 MB optic fiber network, and a Lubuntu VMware instance running on Windows with 16GB of allocated RAM assign and 3 cores of an Intel i7-9750CPU was 1 hours during a subsequent run (no data download).

Download size

Data download over a 100/100 MB network.

Data file Download time Data size
SRR8555090 - 1 0:03:11 544MB
SRR8555090 - 2 0:03:33 570MB
SRR8555091 - 1 0:04:24 660MB
SRR8555091 - 2 0:04:49 683MB
SRR9988196 - 1 0:03:46 610MB
SRR9988196 - 2 0:03:23 625MB
SRR9988190 - 1 0:01:14 90.9MB
SRR9988190 - 2 0:01:16 94.1MB
SRR9988205 - 1 0:08:12 615MB
SRR9988205 - 2 0:04:09 630MB
SRR8555113 - 1 0:14:53 730MB
SRR8555113 - 2 0:03:28 751MB
hg38.fa.gz 0:05:12 984MB
chocophlan 0:23:38 5370MB
uniref 0:25:12 5870MB

Approximate download size excluding docker containers: 18827MB

Contact

Feel free to add issue in case something is not working for you as intended.

Releases

No releases published

Packages

No packages published