All the contents are ready for V3.0 release (#37)

* Autodock (#14) * Added Autodock workload * removed unnecessary contents * removed unnecessary contents * Removed old files and updated README and data download script * updated readme and data download script * updated readme and data download script * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated data download script for the test cases --------- Co-authored-by: arunvelliyangiri18 <[email protected]> * Autodock vina (#15) * Added Autodock workload * Added Autodock-Vina workload * removed unnecessary contents * removed unnecessary contents * removed unnecessary contents * removed unnecessary contents * removed Autodock * added data download script and removed 5wlo * Added updated README * updated README and data download script * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated data download script for the test cases * updated README * updated README --------- Co-authored-by: arunvelliyangiri18 <[email protected]> * Added ESM (#13) * add esm docker files and patch files * add esm docker files and patch files * add esm docker files and patch files * add esm docker files and patch files --------- * added STAR aligner (v2.7.11b) as a submodule in applications (#17) * Add ProtGPT2, RFdiffusion, and ProteinMPNN projects (#23) * adding protgpt2 * adding ProteinMPNN * adding RFdiffusion * update protgpt2.py file and README * update ProteinMPNN Dockerfile * In this commit, I have update the Dockerfile, RFdiffusion patch, symmetry file, and run_inference file. * Updated the Open Omics block diagram (#20) * Updated the Open Omics block diagram * Updated the block diagram * Update README.md to reflect 3.0 release * Uploaded the new Open Omics diagram for v3.0 * Update README.md * Updated the Open Omics diagram with better quality * Update path to Open Omics diagram * removed rfd, proteinmpnn, protgpt, will send the pr again * cleanup the files for Rfdiffusion and ProteinMPNN and Protgpt2 (#24) * adding protgpt2 * adding ProteinMPNN * adding RFdiffusion * update protgpt2.py file and README * update ProteinMPNN Dockerfile * In this commit, I have update the Dockerfile, RFdiffusion patch, symmetry file, and run_inference file. * removed RFdiffusion optimize code * removed ProteinMPNN optimize code * Update README.md * Update README.md * Update README.md * Update README.md * cleanup hidden files * cleanedup some stray hidden files * Added privacy notice (#16) (#19) Co-authored-by: sanchit-misra <[email protected]> * Autodock (#21) * Added Autodock workload * removed unnecessary contents * removed unnecessary contents * Removed old files and updated README and data download script * updated readme and data download script * updated readme and data download script * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated data download script for the test cases * updated README for proxy build command for docker --------- Co-authored-by: arunvelliyangiri18 <[email protected]> Co-authored-by: manasi-t24 <[email protected]> * Autodock vina (#22) * Added Autodock workload * Added Autodock-Vina workload * removed unnecessary contents * removed unnecessary contents * removed unnecessary contents * removed unnecessary contents * removed Autodock * added data download script and removed 5wlo * Added updated README * updated README and data download script * updated README * updated README * updated README * updated README * updated README * updated README * updated README * updated data download script for the test cases * updated README * updated README * updated README for proxy build command for docker --------- Co-authored-by: arunvelliyangiri18 <[email protected]> Co-authored-by: manasi-t24 <[email protected]> * updated ProteinMPNN and ProtGPT2 and RFdiffusion Dockerfiles (#28) * ESM: Dockerfiles updated to work independently (#26) * Add ESM changes * Add ESM changes * MoFlow workload added (#18) * Added privacy notice (#16) * Add MoFlow updates * Add MoFlow updates * Add MoFlow updates --------- Co-authored-by: sanchit-misra <[email protected]> * Update README.md (#29) * Updated ESM README changes (#31) Corrected LM design command line * Adding multimer support to v3.0-release branch (#33) * Adding support for AlphaFold2 Multimer * single docker files for handling monomer and multimer cases * changing to the main branch of open-omics-alphafold * Added privacy notice (#16) (#25) * removed the commented code --------- Co-authored-by: sanchit-misra <[email protected]> * git clone replaced with wget for release code (#34) * Update README.md to reflect v3.0 additions (#35) * fixed wget OO downloads urls in fq2bam and dv1 dockers for V3.0 release (#36) * updated docker files (fq2bam, dv1) with git download * Update README.md * Update README.md * clean up * clean up dockers --------- Co-authored-by: vasimuddin.md <[email protected]> Co-authored-by: vasimuddin.md <[email protected]> --------- Co-authored-by: sri480673 <[email protected]> Co-authored-by: arunvelliyangiri18 <[email protected]> Co-authored-by: Vasimuddin Md <[email protected]> Co-authored-by: Rahamathullah365 <[email protected]> Co-authored-by: sanchit-misra <[email protected]> Co-authored-by: Chirayu Haryan <[email protected]> Co-authored-by: manasi-t24 <[email protected]> Co-authored-by: Narendra Chaudhary <[email protected]> Co-authored-by: vasimuddin.md <[email protected]> Co-authored-by: vasimuddin.md <[email protected]>
IntelLabs · Dec 13, 2024 · 3cfa489 · 3cfa489
1 parent 9b6d869
commit 3cfa489
Show file tree

Hide file tree

Showing 50 changed files with 7,526 additions and 85 deletions.
diff --git a/.gitmodules b/.gitmodules
@@ -22,3 +22,6 @@
 [submodule "applications/bcftools"]
 	path = applications/bcftools
 	url = https://github.com/samtools/bcftools.git
+[submodule "applications/STAR"]
+	path = applications/STAR
+	url = https://github.com/alexdobin/STAR.git
diff --git a/README.md b/README.md
@@ -5,17 +5,17 @@ Intel lab's open sourced data science framework for accelerating digital biology
 # Introduction
 We are in the epoch of digital biology, that is fueled by the convergence of three revolutions: 1) Measurement of biological systems at high resolution resulting in massive multi-modal, multi-scale, unstructured, distributed data, 2) Novel data science (AI and data management) techniques on this data, and 3) Wide-spread cloud use enabling massive compute and public data repositories, large collaborative projects and consortia. It will require computing and data management at unprecedented scale and speed. However, performance alone would not suffice if it significantly compromised the productivity of biologists and data scientists who are at the forefront of this transformation. 
 
-With a goal to build a performant, cost effective and productive platform, we are building **Open Omics acceleration framework**: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
-* **Pipeline layer**: for users who are looking for one click solution to run standard pipelines. Currently, we support the following pipelines:
+With a goal to build a performant, cost effective and productive platform, we are building **Open Omics acceleration framework**: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. It provides tools and pipelines in the field of genomics, transcriptomics, proteomics, drug molecule search and De novo drug design. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
+* **Pipeline layer**: for users who are looking for one click solution to run standard pipelines. The pipelines can be accessed in the 'pipelines' subfolder. It provides instrcutions to build & run the docker images. Currently, we support the following pipelines:
   * [**fq2sortedbam**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/fq2sortedbam): Given gzipped fastq files of an individual, this workflow performs sequence mapping ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)) and sorting ([SAMtools](https://github.com/samtools/samtools) sort) to output the sorted BAM file.
   * [**DeepVariant based germline pipeline for variant calling (fq2vcf)**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/deepvariant-based-germline-variant-calling-fq2vcf): Given paired end gzipped fastq files of an individual, this workflow performs sequence mapping ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)), sorting ([SAMtools](https://github.com/samtools/samtools) sort) and variant calling ([Open Omics DeepVariant](https://github.com/IntelLabs/open-omics-deepvariant)) to call the variants in the genome of the individual.
-  * [**AlphaFold2-based protein folding**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding): Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics [HMMER](https://github.com/IntelLabs/hmmer) and [HH-suite](https://github.com/IntelLabs/hh-suite)) and structure prediction ([Open Omics AlphaFold2](https://github.com/IntelLabs/open-omics-alphafold)) to output the structure(s) of the protein sequences.
+  * [**AlphaFold2-based protein folding**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding): Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics [HMMER](https://github.com/IntelLabs/hmmer) and [HH-suite](https://github.com/IntelLabs/hh-suite)) and structure prediction ([Open Omics AlphaFold2](https://github.com/IntelLabs/open-omics-alphafold)) to output the structure(s) of the protein sequences. It has support for both AlphaFold2 monomer and AlphaFold2 multimer.
   * [**Single cell RNASeq analysis**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/single-cell-RNA-seq-analysis): Given a cell by gene matrix, this [scanpy](https://github.com/scverse/scanpy) based workflow performs data preprocessing (filter, linear regression and normalization), dimensionality reduction (PCA), clustering (Louvain/Leiden/kmeans) to cluster the cells into different cell types and visualize those clusters (UMAP/t-SNE).
-* **Toolkit (applications) layer**: for users who want to use individual tools or to create their own custom pipelines by combining various tools.
-* **Building blocks (lib) layer**: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools.
+* **Toolkit layer**: for users who want to use individual tools or to create their own custom pipelines by combining various tools. The toolkit layer can be accessed in the 'applications' subfolder. For each tool, we provide instructions to build and run it. Currently, the tools supported include: genomics (BWA-MEM, minimap2, bcftools, SAMtools, DeepVariant), transcriptomics (STAR aligner), protein folding (AlphaFold2, ESMFold), protein structure and sequence design (RFDiffusion, ProteinMPNN, LM-design, ESM2-inv, ProtGPT2, ESM2 embeddings), molecular docking (AutoDock, AutoDock-Vina), De novo molecule generation (MoFlow).
+* **Building blocks layer**: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools. This layer can be accessed in the 'lib' subfolder.
 
 <p align="center">
-<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/Open-Omics-Acceleration-Framework-v2.0.JPG" height="300"/a></br>
+<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/Open-Omics-Acceleration-Framework-v3.0.jpg" height="300"/a></br>
 </p> 
 
 With a goal of providing a one-stop platform, this framework brings our following repositories for digital biology under one umbrella:
@@ -37,7 +37,7 @@ In addition, we also use several existing AI libraries: oneDNN, oneDAL, oneCCL,
 # Getting Started
 ```sh
 # Download release
-wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/2.1/Source_code_with_submodules.tar.gz 
+wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz 
 tar -xzf Source_code_with_submodules.tar.gz
 
 # Clone master

diff --git a/applications/AutoDock-Vina/Dockerfile b/applications/AutoDock-Vina/Dockerfile
@@ -0,0 +1,31 @@
+FROM condaforge/miniforge3:4.10.2-0
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    libboost-all-dev \
+    swig \
+    vim \
+    gcc-8 \
+    g++-8 \
+    numactl \
+    time && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+ENV CC=gcc-8
+ENV CXX=g++-8
+WORKDIR /opt
+RUN git clone https://github.com/ccsb-scripps/AutoDock-Vina.git
+WORKDIR /opt/AutoDock-Vina
+RUN git checkout v1.2.2
+WORKDIR /opt/AutoDock-Vina/build/linux/release
+RUN make -j$(nproc)
+ENV SERVICE_NAME="autodock-vina-service"
+RUN groupadd --gid 1001 $SERVICE_NAME && \
+    useradd -m -g $SERVICE_NAME --shell /bin/false --uid 1001 $SERVICE_NAME
+RUN chown -R $SERVICE_NAME:$SERVICE_NAME /opt
+USER $SERVICE_NAME
+ENV PATH="/opt/AutoDock-Vina/build/linux/release:$PATH"
+WORKDIR /input
+HEALTHCHECK NONE
+CMD ["vina","--help"]
+
diff --git a/applications/AutoDock-Vina/README.md b/applications/AutoDock-Vina/README.md
@@ -0,0 +1,79 @@
+## Open-Omics-Autodock-Vina
+Open-Omics-Autodock-Vina is a fast, efficient molecular docking software used to predict ligand-protein binding poses and affinities. It features a refined scoring function, parallel execution on multicore CPUs and user-friendly configuration.
+
+## Docker Setup Instructions
+
+
+### 1. Build the Docker Image 
+To build the Docker image with the tag `docker_vina`, use the following commands based on your machine's proxy requirements:
+* For machine without a proxy:
+```bash
+docker build -t docker_vina .
+```
+* For machine with a proxy:
+```bash
+docker build --build-arg http_proxy=<http_proxy> --build-arg https_proxy=<https_proxy> --build-arg no_proxy=<no_proxy_ip> -t docker_vina .
+```
+
+
+### 2. Choose and Download Protein Complex Data
+Select any protein complex from the available dataset of **140** protein-ligand complexes(https://zenodo.org/records/4031961) which you can download from (https://zenodo.org/records/4031961/files/data.zip?download=1). This guide uses the **5wlo** protein as an example.
+
+1) Run the below commands to make data download script executable, download the complete dataset and extract the data for `5wlo`:
+
+```bash
+chmod +x data_download_script.sh
+bash data_download_script.sh 5wlo
+```
+**Note: You can replace 5wlo with any other complex name from the complete dataset available in `data_original/data` directory.**
+
+2) Create an output directory to store results specific to `5wlo`:
+```bash
+mkdir -p 5wlo_output                                                                                                               
+```
+
+3) Set the environment variables for the `5wlo` protein as follows:
+```bash                                                                                                                         
+export INPUT_VINA=$PWD/5wlo
+export OUTPUT_VINA=$PWD/5wlo_output
+```
+
+4) Add the necessary permissions to output folder for Docker to write to it:
+```bash
+sudo chmod -R a+w $OUTPUT_VINA
+```
+
+### 3. Run the Docker Container
+Verify that the Docker image was built successfully by listing Docker images:
+```bash
+docker images | grep docker_vina                                                                                                
+```
+If the image is listed, run AutoDock Vina with the following command:
+```bash                                                                                                                         
+docker run -it -v $INPUT_VINA:/input -v $OUTPUT_VINA:/output docker_vina:latest vina --receptor protein.pdbqt --ligand rand-1.pdbqt --out /output/rand-1_out.pdbqt --center_x 16.459 --center_y -19.946 --center_z -5.850 --size_x 18 --size_y 18 --size_z 18 --seed 1234 --exhaustiveness 64
+```
+This command will process your receptor and ligand files and place the results in the specified output directory.
+### 4. Expected Output                                                                                                           
+After running the above command, you should find the output file (`rand-1_out.pdbqt`) in the output directory, such as `5wlo_output` for this example.
+
+---
+The original README content of AutoDock-Vina follows:
+
+## AutoDock Vina: Docking and virtual screening program
+
+**AutoDock Vina** is one of the **fastest** and **most widely used** **open-source** docking engines. It is a turnkey computational docking program that is based on a simple scoring function and rapid gradient-optimization conformational search. It was originally designed and implemented by Dr. Oleg Trott in the Molecular Graphics Lab, and it is now being maintained and develop by the Forli Lab at The Scripps Research Institute.
+
+* AutoDock4.2 and Vina scoring functions
+* Support of simultaneous docking of multiple ligands and batch mode for virtual screening
+* Support of macrocycle molecules
+* Hydrated docking protocol
+* Can write and load external AutoDock maps
+* Python bindings for Python 3
+
+## Documentation
+
+The installation instructions, documentation and tutorials can be found on [readthedocs.org](https://autodock-vina.readthedocs.io/en/latest/).
+
+## Citations
+* [J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.](https://pubs.acs.org/doi/10.1021/acs.jcim.1c00203)
+* [O. Trott and A. J. Olson. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2), 455-461.](https://onlinelibrary.wiley.com/doi/10.1002/jcc.21334)
diff --git a/applications/AutoDock-Vina/data_download_script.sh b/applications/AutoDock-Vina/data_download_script.sh
@@ -0,0 +1,27 @@
+url="https://zenodo.org/records/4031961/files/data.zip?download=1"
+download_dir="./data_original"
+target_folder="$1"
+if [ ! -d "$download_dir/data" ]; then
+    echo "Downloading data.zip..."
+    mkdir -p "$download_dir"
+    wget -O "$download_dir/data.zip" "$url"
+
+    echo "Unzipping data.zip..."
+    unzip "$download_dir/data.zip" -d "$download_dir"
+    rm -f "$download_dir/data.zip"
+
+    echo "Data downloaded and extracted to $download_dir/data"
+else
+    echo "Data already exists in $download_dir/data. Skipping download and extraction."
+fi
+if [ -d "$target_folder" ]; then
+    echo "The folder '$target_folder' already exists in the current directory. Skipping copy."
+else
+    if [ -d "$download_dir/data/$target_folder" ]; then
+        cp -r "$download_dir/data/$target_folder" ./
+        echo "$target_folder folder successfully copied to the current directory."
+    else
+        echo "$target_folder folder not found inside '$download_dir/data'."
+    fi
+fi
+echo "'$target_folder' folder is now available in the current directory."
diff --git a/applications/Autodock/Dockerfile b/applications/Autodock/Dockerfile
@@ -0,0 +1,36 @@
+FROM condaforge/miniforge3:4.10.2-0
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    vim \
+    git \
+    build-essential \
+    ocl-icd-opencl-dev \
+    clinfo && \
+    apt-get clean && \
+    rm -rf /var/lib/apt/lists/*
+RUN conda install -c conda-forge \
+    python=3.10 \
+    requests=2.28.2 \
+    mkl=2023.1 \
+    dpcpp_linux-64=2023.1 \
+    dpcpp-cpp-rt=2023.1 \
+    mkl-devel=2023.1 && \
+    conda clean --all -f -y
+ENV LD_LIBRARY_PATH="/opt/conda/lib:${LD_LIBRARY_PATH}"
+WORKDIR /opt
+ENV SERVICE_NAME="autodock-service"
+RUN groupadd --gid 1001 $SERVICE_NAME && \
+    useradd -m -g $SERVICE_NAME --shell /bin/false --uid 1001 $SERVICE_NAME && \
+    mkdir -p /opt/AutoDock && \
+    chown -R $SERVICE_NAME:$SERVICE_NAME /opt/AutoDock
+USER $SERVICE_NAME
+WORKDIR /opt/AutoDock
+RUN git clone https://github.com/emascarenhas/AutoDock-GPU.git . && \
+    git checkout v1.4
+RUN make DEVICE=CPU NUMWI=64 && \
+    rm -rf .git build_temp
+ENV PATH="/opt/AutoDock/bin:${PATH}"
+HEALTHCHECK NONE
+WORKDIR /input
+CMD ["autodock_cpu_64wi","--help"]
+