Metagenomics for the Study of Human Gut Microbiome in Health and Disease: Applications in Acute Diarrheal Diseases (ADD)
- Angela Peña-González, PhD
- Alejandro Reyes Muñoz, PhD
School of Biological Sciences, Universidad de los Andes
This repository provides assistance for the student to complete Metagenomics in Communicable Diseases: Tutorial #2, by providing an already configured environment, allowing the student to focus on using the tools used throughout the document.
- Linux or OSX is advised as the environment to run this tutorial on. The following is a list of alternatives you could
use if you don't have access to one of those environments.
- On Windows using a virtual machine solution (VirtualBox or VMware should do it) with a Linux install.
- Acquiring an AWS instance (depending on setup could be free or incur in a monthly fee).
- Acquiring a Google compute engine instance (depending on setup could be free or incur in a monthly fee).
- Acquiring an Azure Virtual Machine instance (depending on setup could be free or incur in a monthly fee).
- Docker must be installed on the environment.
- Docker Compose must be installed on the environment.
- Make must be installed on the environment.
These must be run on the project folder. These commands depend on make
being available.
make start
- Will download (if not available) the docker images and start the docker containers neccessary to run the tutorial.make stop
- If the containers are running, it will shut them down.make tutorial
- If the containers are running, it will replicate the steps described in the tutorial file and store the results on theCABANA
folder as described.
make prepare-tutorial
- Will perform data preparation by downloading datasets used for control and testing.make download-data
- Downloads data that will be used on the tutorial.make uncompress-data
- Uncompresses the compressed datasets downloaded with the above command.make seqtk-data
- Takes the uncompressed data and pass it through seqtk.
make run-quality-control
- Runs the quality control steps from the tutorial.make qc-prepare
- Prepares the seqtk data for the quality control process.make qc-run-step-1
- Quality control step 1 from the tutorial.make qc-run-step-2
- Quality control step 2 from the tutorial.make qc-run-step-3
- Quality control step 3 from the tutorial.make qc-run-step-4
- Quality control step 4 from the tutorial.make qc-run-step-5
- Quality control step 5 from the tutorial.
make run-coverage
- Runs the coverage steps from the tutorial.make cv-step-1
- Coverage step 1 from the tutorial.make cv-step-2
- Coverage step 2 from the tutorial.
make run-distance-estimation
- Runs the distance estimation steps from the tutorial.make distance-estimation-step-1
- Distance estimation step 1.make distance-estimation-step-2
- Distance estimation step 2.
make run-metagenomic-assembly
- Runs the metagenomics assembly steps from the tutorial.make metagenomic-assembly-step-1
- Metagenomic assembly script.
make run-binning-clustering
- Runs the binning and clustering steps from the tutorial.make binning-step-1
- Binning and clustering step 1.make binning-step-2
- Binning and clustering step 2.
make run-taxonomy
- Runs the taxonomy steps from the tutorial.make taxonomy-step-1
- Binning and clustering step 1.
make run-functional
- Runs the functional steps from the tutorial.make functional-step-1
- Functional step 1.make functional-step-2
- Functional step 2.
In order to ease the access to the containers the following commands were created.
make shell-blast
- Shell access to the blast container.make shell-bmtagger
- Shell access ot the bmtagger container.make shell-checkm
- Shell access ot the checkm container.make shell-enveomics
- Shell access to the enveomics container.make shell-mash
- Shell access to the mash container.make shell-maxbin2
- Shell access to the maxbin2 container.make shell-megahit
- Shell access to the megahit container.make shell-metaphlan2
- Shell access to the metaphlan2 container.make shell-multiqc
- Shell access to the multiqc container.make shell-seqtk
- Shell access to the seqtk container.make shell-r
- Shell access to the r container with nonpareil.
make r
- Start an R session.
- The
CABANA
folder will contain the resulting files from the different steps. - The
DATA
folder will contain some datasets downloaded from some tools. - The
scripts
folder will contain bash scripts that will be executed by the different docker containers in order to implement the same steps defined in the tutorial.
- At least 110GB of available hard drive space (docker images, data sets).
- At least 8GB of RAM on the environment where the tutorial will run.
- The tasks must download data from different sources. Because of this, the time to complete some steps will depend on network connection speed.
- The datasets are quite large. Because of this, the time to complete some tasks depends on the CPU and the available amount of RAM.
- The average time to run the tutorial on a 100/100 MB optic fiber network, and a Lubuntu VMware instance running on Windows with 16GB of RAM assign and 3 cores of an Intel i7-9750CPU was 3 hours during a clean run (data downloaded for the first time).
- The average time to run the tutorial on a 100/100 MB optic fiber network, and a Lubuntu VMware instance running on Windows with 16GB of allocated RAM assign and 3 cores of an Intel i7-9750CPU was 1 hours during a subsequent run (no data download).
Data download over a 100/100 MB network.
Data file | Download time | Data size |
---|---|---|
SRR8555090 - 1 | 0:03:11 | 544MB |
SRR8555090 - 2 | 0:03:33 | 570MB |
SRR8555091 - 1 | 0:04:24 | 660MB |
SRR8555091 - 2 | 0:04:49 | 683MB |
SRR9988196 - 1 | 0:03:46 | 610MB |
SRR9988196 - 2 | 0:03:23 | 625MB |
SRR9988190 - 1 | 0:01:14 | 90.9MB |
SRR9988190 - 2 | 0:01:16 | 94.1MB |
SRR9988205 - 1 | 0:08:12 | 615MB |
SRR9988205 - 2 | 0:04:09 | 630MB |
SRR8555113 - 1 | 0:14:53 | 730MB |
SRR8555113 - 2 | 0:03:28 | 751MB |
hg38.fa.gz | 0:05:12 | 984MB |
chocophlan | 0:23:38 | 5370MB |
uniref | 0:25:12 | 5870MB |
Approximate download size excluding docker containers: 18827MB
Feel free to add issue in case something is not working for you as intended.