Skip to content

Commit

Permalink
All the contents are ready for V3.0 release (#37)
Browse files Browse the repository at this point in the history
* Autodock (#14)

* Added Autodock workload

* removed unnecessary contents

* removed unnecessary contents

* Removed old files and updated README and data download script

* updated readme and data download script

* updated readme and data download script

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated data download script for the test cases

---------

Co-authored-by: arunvelliyangiri18 <[email protected]>

* Autodock vina (#15)

* Added Autodock workload

* Added Autodock-Vina workload

* removed unnecessary contents

* removed unnecessary contents

* removed unnecessary contents

* removed unnecessary contents

* removed Autodock

* added data download script and removed 5wlo

* Added updated README

* updated README and data download script

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated data download script for the test cases

* updated README

* updated README

---------

Co-authored-by: arunvelliyangiri18 <[email protected]>

* Added ESM (#13)

* add esm docker files and patch files

* add esm docker files and patch files

* add esm docker files and patch files

* add esm docker files and patch files

---------

* added STAR aligner (v2.7.11b) as a submodule in applications (#17)

* Add ProtGPT2, RFdiffusion, and ProteinMPNN projects (#23)

* adding protgpt2

* adding ProteinMPNN

* adding RFdiffusion

* update protgpt2.py file and README

* update ProteinMPNN Dockerfile

* In this commit, I have update the Dockerfile, RFdiffusion patch, symmetry file, and run_inference file.

* Updated the Open Omics block diagram (#20)

* Updated the Open Omics block diagram

* Updated the block diagram

* Update README.md to reflect 3.0 release

* Uploaded the new Open Omics diagram for v3.0

* Update README.md

* Updated the Open Omics diagram with better quality

* Update path to Open Omics diagram

* removed rfd, proteinmpnn, protgpt, will send the pr again

* cleanup the files for Rfdiffusion and ProteinMPNN and Protgpt2 (#24)

* adding protgpt2

* adding ProteinMPNN

* adding RFdiffusion

* update protgpt2.py file and README

* update ProteinMPNN Dockerfile

* In this commit, I have update the Dockerfile, RFdiffusion patch, symmetry file, and run_inference file.

* removed RFdiffusion optimize code

* removed ProteinMPNN optimize code

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* cleanup hidden files

* cleanedup some stray hidden files

* Added privacy notice (#16) (#19)

Co-authored-by: sanchit-misra <[email protected]>

* Autodock (#21)

* Added Autodock workload

* removed unnecessary contents

* removed unnecessary contents

* Removed old files and updated README and data download script

* updated readme and data download script

* updated readme and data download script

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated data download script for the test cases

* updated README for proxy build command for docker

---------

Co-authored-by: arunvelliyangiri18 <[email protected]>
Co-authored-by: manasi-t24 <[email protected]>

* Autodock vina (#22)

* Added Autodock workload

* Added Autodock-Vina workload

* removed unnecessary contents

* removed unnecessary contents

* removed unnecessary contents

* removed unnecessary contents

* removed Autodock

* added data download script and removed 5wlo

* Added updated README

* updated README and data download script

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated README

* updated data download script for the test cases

* updated README

* updated README

* updated README for proxy build command for docker

---------

Co-authored-by: arunvelliyangiri18 <[email protected]>
Co-authored-by: manasi-t24 <[email protected]>

* updated ProteinMPNN and ProtGPT2 and RFdiffusion Dockerfiles (#28)

* ESM: Dockerfiles updated to work independently (#26)

* Add ESM changes

* Add ESM changes

* MoFlow workload added (#18)

* Added privacy notice (#16)

* Add MoFlow updates

* Add MoFlow updates

* Add MoFlow updates

---------

Co-authored-by: sanchit-misra <[email protected]>

* Update README.md (#29)

* Updated ESM README changes (#31)

Corrected LM design command line

* Adding multimer support to v3.0-release branch (#33)

* Adding support for AlphaFold2 Multimer

* single docker files for handling monomer and multimer cases

* changing to the main branch of open-omics-alphafold

* Added privacy notice (#16) (#25)

* removed the commented code

---------

Co-authored-by: sanchit-misra <[email protected]>

* git clone replaced with wget for release code (#34)

* Update README.md to reflect v3.0 additions (#35)

* fixed wget OO downloads urls in fq2bam and dv1 dockers for V3.0 release (#36)

* updated docker files (fq2bam, dv1) with git download

* Update README.md

* Update README.md

* clean up

* clean up dockers

---------

Co-authored-by: vasimuddin.md <[email protected]>
Co-authored-by: vasimuddin.md <[email protected]>

---------

Co-authored-by: sri480673 <[email protected]>
Co-authored-by: arunvelliyangiri18 <[email protected]>
Co-authored-by: Vasimuddin Md <[email protected]>
Co-authored-by: Rahamathullah365 <[email protected]>
Co-authored-by: sanchit-misra <[email protected]>
Co-authored-by: Chirayu Haryan <[email protected]>
Co-authored-by: manasi-t24 <[email protected]>
Co-authored-by: Narendra Chaudhary <[email protected]>
Co-authored-by: vasimuddin.md <[email protected]>
Co-authored-by: vasimuddin.md <[email protected]>
  • Loading branch information
11 people authored Dec 13, 2024
1 parent 9b6d869 commit 3cfa489
Show file tree
Hide file tree
Showing 50 changed files with 7,526 additions and 85 deletions.
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,6 @@
[submodule "applications/bcftools"]
path = applications/bcftools
url = https://github.com/samtools/bcftools.git
[submodule "applications/STAR"]
path = applications/STAR
url = https://github.com/alexdobin/STAR.git
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ Intel lab's open sourced data science framework for accelerating digital biology
# Introduction
We are in the epoch of digital biology, that is fueled by the convergence of three revolutions: 1) Measurement of biological systems at high resolution resulting in massive multi-modal, multi-scale, unstructured, distributed data, 2) Novel data science (AI and data management) techniques on this data, and 3) Wide-spread cloud use enabling massive compute and public data repositories, large collaborative projects and consortia. It will require computing and data management at unprecedented scale and speed. However, performance alone would not suffice if it significantly compromised the productivity of biologists and data scientists who are at the forefront of this transformation.

With a goal to build a performant, cost effective and productive platform, we are building **Open Omics acceleration framework**: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
* **Pipeline layer**: for users who are looking for one click solution to run standard pipelines. Currently, we support the following pipelines:
With a goal to build a performant, cost effective and productive platform, we are building **Open Omics acceleration framework**: a one-click, containerized, customizable, open-sourced framework for accelerating digital biology research. It provides tools and pipelines in the field of genomics, transcriptomics, proteomics, drug molecule search and De novo drug design. The framework is being built with a modular design that keeps in mind the different ways the users would want to interact with it. As shown in the following block diagram, it consists of three layers:
* **Pipeline layer**: for users who are looking for one click solution to run standard pipelines. The pipelines can be accessed in the 'pipelines' subfolder. It provides instrcutions to build & run the docker images. Currently, we support the following pipelines:
* [**fq2sortedbam**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/fq2sortedbam): Given gzipped fastq files of an individual, this workflow performs sequence mapping ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)) and sorting ([SAMtools](https://github.com/samtools/samtools) sort) to output the sorted BAM file.
* [**DeepVariant based germline pipeline for variant calling (fq2vcf)**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/deepvariant-based-germline-variant-calling-fq2vcf): Given paired end gzipped fastq files of an individual, this workflow performs sequence mapping ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2)), sorting ([SAMtools](https://github.com/samtools/samtools) sort) and variant calling ([Open Omics DeepVariant](https://github.com/IntelLabs/open-omics-deepvariant)) to call the variants in the genome of the individual.
* [**AlphaFold2-based protein folding**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding): Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics [HMMER](https://github.com/IntelLabs/hmmer) and [HH-suite](https://github.com/IntelLabs/hh-suite)) and structure prediction ([Open Omics AlphaFold2](https://github.com/IntelLabs/open-omics-alphafold)) to output the structure(s) of the protein sequences.
* [**AlphaFold2-based protein folding**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/alphafold2-based-protein-folding): Given one or more protein sequences, this workflow performs preprocessing (database search and multiple sequence alignment using Open Omics [HMMER](https://github.com/IntelLabs/hmmer) and [HH-suite](https://github.com/IntelLabs/hh-suite)) and structure prediction ([Open Omics AlphaFold2](https://github.com/IntelLabs/open-omics-alphafold)) to output the structure(s) of the protein sequences. It has support for both AlphaFold2 monomer and AlphaFold2 multimer.
* [**Single cell RNASeq analysis**](https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/tree/main/pipelines/single-cell-RNA-seq-analysis): Given a cell by gene matrix, this [scanpy](https://github.com/scverse/scanpy) based workflow performs data preprocessing (filter, linear regression and normalization), dimensionality reduction (PCA), clustering (Louvain/Leiden/kmeans) to cluster the cells into different cell types and visualize those clusters (UMAP/t-SNE).
* **Toolkit (applications) layer**: for users who want to use individual tools or to create their own custom pipelines by combining various tools.
* **Building blocks (lib) layer**: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools.
* **Toolkit layer**: for users who want to use individual tools or to create their own custom pipelines by combining various tools. The toolkit layer can be accessed in the 'applications' subfolder. For each tool, we provide instructions to build and run it. Currently, the tools supported include: genomics (BWA-MEM, minimap2, bcftools, SAMtools, DeepVariant), transcriptomics (STAR aligner), protein folding (AlphaFold2, ESMFold), protein structure and sequence design (RFDiffusion, ProteinMPNN, LM-design, ESM2-inv, ProtGPT2, ESM2 embeddings), molecular docking (AutoDock, AutoDock-Vina), De novo molecule generation (MoFlow).
* **Building blocks layer**: for tool developers, this layer consists of key building blocks -- biology specific and generic AI algorithms and data structures -- that can replace ones used in existing tools to accelerate them or can be used as ingredients to build new efficient tools. This layer can be accessed in the 'lib' subfolder.

<p align="center">
<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/Open-Omics-Acceleration-Framework-v2.0.JPG" height="300"/a></br>
<img src="https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/blob/main/images/Open-Omics-Acceleration-Framework-v3.0.jpg" height="300"/a></br>
</p>

With a goal of providing a one-stop platform, this framework brings our following repositories for digital biology under one umbrella:
Expand All @@ -37,7 +37,7 @@ In addition, we also use several existing AI libraries: oneDNN, oneDAL, oneCCL,
# Getting Started
```sh
# Download release
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/2.1/Source_code_with_submodules.tar.gz
wget https://github.com/IntelLabs/Open-Omics-Acceleration-Framework/releases/download/3.0/Source_code_with_submodules.tar.gz
tar -xzf Source_code_with_submodules.tar.gz

# Clone master
Expand Down
31 changes: 31 additions & 0 deletions applications/AutoDock-Vina/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
FROM condaforge/miniforge3:4.10.2-0
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
libboost-all-dev \
swig \
vim \
gcc-8 \
g++-8 \
numactl \
time && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
ENV CC=gcc-8
ENV CXX=g++-8
WORKDIR /opt
RUN git clone https://github.com/ccsb-scripps/AutoDock-Vina.git
WORKDIR /opt/AutoDock-Vina
RUN git checkout v1.2.2
WORKDIR /opt/AutoDock-Vina/build/linux/release
RUN make -j$(nproc)
ENV SERVICE_NAME="autodock-vina-service"
RUN groupadd --gid 1001 $SERVICE_NAME && \
useradd -m -g $SERVICE_NAME --shell /bin/false --uid 1001 $SERVICE_NAME
RUN chown -R $SERVICE_NAME:$SERVICE_NAME /opt
USER $SERVICE_NAME
ENV PATH="/opt/AutoDock-Vina/build/linux/release:$PATH"
WORKDIR /input
HEALTHCHECK NONE
CMD ["vina","--help"]

79 changes: 79 additions & 0 deletions applications/AutoDock-Vina/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
## Open-Omics-Autodock-Vina
Open-Omics-Autodock-Vina is a fast, efficient molecular docking software used to predict ligand-protein binding poses and affinities. It features a refined scoring function, parallel execution on multicore CPUs and user-friendly configuration.

## Docker Setup Instructions


### 1. Build the Docker Image
To build the Docker image with the tag `docker_vina`, use the following commands based on your machine's proxy requirements:
* For machine without a proxy:
```bash
docker build -t docker_vina .
```
* For machine with a proxy:
```bash
docker build --build-arg http_proxy=<http_proxy> --build-arg https_proxy=<https_proxy> --build-arg no_proxy=<no_proxy_ip> -t docker_vina .
```


### 2. Choose and Download Protein Complex Data
Select any protein complex from the available dataset of **140** protein-ligand complexes(https://zenodo.org/records/4031961) which you can download from (https://zenodo.org/records/4031961/files/data.zip?download=1). This guide uses the **5wlo** protein as an example.

1) Run the below commands to make data download script executable, download the complete dataset and extract the data for `5wlo`:

```bash
chmod +x data_download_script.sh
bash data_download_script.sh 5wlo
```
**Note: You can replace 5wlo with any other complex name from the complete dataset available in `data_original/data` directory.**

2) Create an output directory to store results specific to `5wlo`:
```bash
mkdir -p 5wlo_output
```

3) Set the environment variables for the `5wlo` protein as follows:
```bash
export INPUT_VINA=$PWD/5wlo
export OUTPUT_VINA=$PWD/5wlo_output
```

4) Add the necessary permissions to output folder for Docker to write to it:
```bash
sudo chmod -R a+w $OUTPUT_VINA
```

### 3. Run the Docker Container
Verify that the Docker image was built successfully by listing Docker images:
```bash
docker images | grep docker_vina
```
If the image is listed, run AutoDock Vina with the following command:
```bash
docker run -it -v $INPUT_VINA:/input -v $OUTPUT_VINA:/output docker_vina:latest vina --receptor protein.pdbqt --ligand rand-1.pdbqt --out /output/rand-1_out.pdbqt --center_x 16.459 --center_y -19.946 --center_z -5.850 --size_x 18 --size_y 18 --size_z 18 --seed 1234 --exhaustiveness 64
```
This command will process your receptor and ligand files and place the results in the specified output directory.
### 4. Expected Output
After running the above command, you should find the output file (`rand-1_out.pdbqt`) in the output directory, such as `5wlo_output` for this example.

---
The original README content of AutoDock-Vina follows:

## AutoDock Vina: Docking and virtual screening program

**AutoDock Vina** is one of the **fastest** and **most widely used** **open-source** docking engines. It is a turnkey computational docking program that is based on a simple scoring function and rapid gradient-optimization conformational search. It was originally designed and implemented by Dr. Oleg Trott in the Molecular Graphics Lab, and it is now being maintained and develop by the Forli Lab at The Scripps Research Institute.

* AutoDock4.2 and Vina scoring functions
* Support of simultaneous docking of multiple ligands and batch mode for virtual screening
* Support of macrocycle molecules
* Hydrated docking protocol
* Can write and load external AutoDock maps
* Python bindings for Python 3

## Documentation

The installation instructions, documentation and tutorials can be found on [readthedocs.org](https://autodock-vina.readthedocs.io/en/latest/).

## Citations
* [J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli. (2021). AutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindings. Journal of Chemical Information and Modeling.](https://pubs.acs.org/doi/10.1021/acs.jcim.1c00203)
* [O. Trott and A. J. Olson. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2), 455-461.](https://onlinelibrary.wiley.com/doi/10.1002/jcc.21334)
27 changes: 27 additions & 0 deletions applications/AutoDock-Vina/data_download_script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
url="https://zenodo.org/records/4031961/files/data.zip?download=1"
download_dir="./data_original"
target_folder="$1"
if [ ! -d "$download_dir/data" ]; then
echo "Downloading data.zip..."
mkdir -p "$download_dir"
wget -O "$download_dir/data.zip" "$url"

echo "Unzipping data.zip..."
unzip "$download_dir/data.zip" -d "$download_dir"
rm -f "$download_dir/data.zip"

echo "Data downloaded and extracted to $download_dir/data"
else
echo "Data already exists in $download_dir/data. Skipping download and extraction."
fi
if [ -d "$target_folder" ]; then
echo "The folder '$target_folder' already exists in the current directory. Skipping copy."
else
if [ -d "$download_dir/data/$target_folder" ]; then
cp -r "$download_dir/data/$target_folder" ./
echo "$target_folder folder successfully copied to the current directory."
else
echo "$target_folder folder not found inside '$download_dir/data'."
fi
fi
echo "'$target_folder' folder is now available in the current directory."
36 changes: 36 additions & 0 deletions applications/Autodock/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
FROM condaforge/miniforge3:4.10.2-0
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
vim \
git \
build-essential \
ocl-icd-opencl-dev \
clinfo && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN conda install -c conda-forge \
python=3.10 \
requests=2.28.2 \
mkl=2023.1 \
dpcpp_linux-64=2023.1 \
dpcpp-cpp-rt=2023.1 \
mkl-devel=2023.1 && \
conda clean --all -f -y
ENV LD_LIBRARY_PATH="/opt/conda/lib:${LD_LIBRARY_PATH}"
WORKDIR /opt
ENV SERVICE_NAME="autodock-service"
RUN groupadd --gid 1001 $SERVICE_NAME && \
useradd -m -g $SERVICE_NAME --shell /bin/false --uid 1001 $SERVICE_NAME && \
mkdir -p /opt/AutoDock && \
chown -R $SERVICE_NAME:$SERVICE_NAME /opt/AutoDock
USER $SERVICE_NAME
WORKDIR /opt/AutoDock
RUN git clone https://github.com/emascarenhas/AutoDock-GPU.git . && \
git checkout v1.4
RUN make DEVICE=CPU NUMWI=64 && \
rm -rf .git build_temp
ENV PATH="/opt/AutoDock/bin:${PATH}"
HEALTHCHECK NONE
WORKDIR /input
CMD ["autodock_cpu_64wi","--help"]

Loading

0 comments on commit 3cfa489

Please sign in to comment.