Releases: IntelLabs/Open-Omics-Acceleration-Framework
Open-Omics-Acceleration-Framework-3.0
This v3.0 release expands the footprint of Open Omics to several drug discovery tasks – including protein design, molecular docking and De novo drug molecule design, while also adding more tools for transcriptomics and protein folding. In total, this adds nine GenAI based methods for various drug discovery tasks in addition to one aligner for RNA-seq reads and two molecular docking tools. To ensure smooth building and deployment, we provide Dockerfiles for all the workloads that rely on multiple packages. More specifically, this release adds the following new tools & pipelines:
- Transcriptomics:
- STAR aligner v2.7.11b: STAR is a popular RNASeq sequence aligner. It takes paired fastq(.gz) file(s) with RNA reads as input and aligns those reads to the reference genome. STAR outputs the alignments in a SAM file.
- Protein folding – containerized versions of:
- AlphaFold2 multimer v2.3.2: takes as input the sequences of one or more protein complexes in fasta files and outputs their predicted structures in pdb and auxiliary outputs in pkl files.
- ESMFold v1.0.3: takes one or more proteins in fasta files as input and outputs their predicted structures in pdb files.
- Protein Design – containerized versions of:
- RFDiffusion v1.1.0: this diffusion based computational tool generates de novo protein structure. It takes as input protein structure specifications in a pdb file and outputs the generated structure in a pdb file.
- ESM-2 embedding v1.0.3: takes one or more protein sequences in fasta files and outputs their generated embeddings in pt files for downstream analysis.
- LM-design v1.0.3: Language models trained on sequences of natural proteins to generate de novo proteins. This includes two tasks: (i) Fixed-backbone design: Generates protein sequences for a given structure provided in a pdb file, and (ii) Free-generation: Takes the sequence length as input and generates a sequence of that length.
- ESM2-inv v1.0.3: This inverse folding model is designed to predict protein sequences from protein structure backbone. This includes two tasks: (i) Sequence Design: Generates protein sequences for a given structure. The input can be either a pdb file or a cif file, and the output is saved as a fasta file. (ii) Sequence Scoring: Evaluates and scores sequences for compatibility with a given structure. The input requires protein structure in a pdb file and a sequence in a fasta file, and the output is saved as a csv file containing the scores.
- ProtGPT2: ProtGPT2 is a popular deep language model that generates de novo protein sequences. The code and the model are hosted on HuggingFace. We used commit #4425556 as a base for our optimizations. It generates the user-provided number of sequences with the specified sequence length as the output in a fasta file.
- ProteinMPNN v1.0.1: ProteinMPNN offers multiple functionalities, e.g. (i) generates the amino acid sequences given protein structure backbone, (ii) enables design of de novo proteins and optimizations of existing ones. It takes the protein structure in pdb file format and generates corresponding amino acid sequence(s) in fasta file format.
- Molecular Docking – containerized versions of:
- AutoDock v1.4: AutoDock is a tool used for predicting how ligand molecules bind to a protein receptor of known 3D structure. It takes protein map in fld file and ligand in pdbqt file as input and generates an output dlg file which contains the final docked pose and its energy value.
- AutoDock-Vina v1.2.2: AutoDock-Vina is an improved version of AutoDock that doesn’t require protein map fld files as input. It takes protein pdbqt file, ligand pdbqt file, and dimensions of the box where the docking is to be performed as input and generates a docking result pdbqt file that contains multiple ranked docked poses.
- De novo drug molecule search – containerized version of
- MoFlow v.1.0: MoFlow is a flow-based graph generative model designed to generate chemically valid molecular graphs efficiently and accurately. It supports following tasks: molecular graph generation and reconstruction, visualization of the continuous latent space, property optimization, and constrained property optimization. It takes as input a set of parame ters and gives the molecular graph as output.
Open-Omics-Acceleration-Framework-2.2
v2.2 release adds and updates to docker files of major pipelines
- Adds a docker file for the fq2sortedbam pipeline
- Adds support for minimap2 and extends fq2sortedbam pipeline for long reads analysis
- Provides standalone dockers for the pre-processing and inference stages of the AlphaFold2 pipeline
- Provides standalone dockers for fq2sortedbam and Deepvariant-based inference stages of the DeepVariant-based germline fq2vcf pipeline
Open-Omics-Acceleration-Framework-2.1
v2.1 release updates and adds fixes to AlphaFold2 pipeline and DeepVariant-based germline fq2vcf pipeline.
-
AlphaFold2-based protein folding pipeline:
- Enabled inference using different models
- Bug fixes for running Model 3, 4, & 5.
- Removed unnecessary paths from run script.
- Enabled use of contiguous tensor inside TPP PyTorch extension
-
DeepVariant-based germline variant calling (fq2vcf) pipeline :
- Enabled support for gzipped reference sequence file as input
- Enabled support for reads and reference sequence data files to be in different folders
- Enabled cleanup of all intermediate data generated during the pipeline's run
- Provided an option to keep the intermediate SAM files out of bwa-mem2
- Fixed the messaging to the user in case of a failed run
- Updated README with precise instructions to run on various types of compute environments
- In AWS parallel cluster environment: enabled index creation on a compute node instead of the master node
- Added details in README about memory and disk requirements for a run using a Human WGS dataset
Open-Omics-Acceleration-Framework-2.0
This v2.0 release adds the accelerated version of following new pipelines and corresponding tools.
-
Containerized AlphaFold2-based pipeline for protein folding that takes protein sequences as input and outputs predicted protein structures. It consists of
- Open-Omics-AlphaFold: a PyTorch implementation of AlphaFold2 (v.2.2.2) monomer accelerated using 4th generation Intel® Xeon® CPU.
- Hmmer and HH-suite accelerated through 256- and 512-bit SIMD instructions (AVX2, AVX512) available on modern x86 CPUs.
- Efficient load balanced folding of multiple proteins in parallel on a dual-socket CPU.
- A docker file for seamless installation and execution.
- Can perform folding on proteins of length up to ~9k residues on a 1 TB memory machine.
- To the best of our knowledge, faster than any prior CPU/GPU implementation for folding a set of proteins.
-
Containerized DeepVariant-based variant calling (fq2vcf) pipeline that takes paired fastq.gz files as input and outputs vcf file. It achieves efficient performance across multiple CPU nodes and consists of
- BWA-MEM2: an architecture-efficient version of BWA-MEM that is 1.8-3.0 times faster.
- SAMtools sort utility for sorting SAM/BAM files.
- Open-Omics-DeepVariant: A new version of DeepVariant v1.5.0 accelerated using 4th generation Intel® Xeon® CPU.
- A distributed memory framework that achieves excellent scaling for this pipeline across several CPU nodes.
- To the best of our knowledge, faster than any prior CPU/GPU implementation for 30x WGS dataset.
-
A fq2sortedbam pipeline accelerated using modern CPUs that takes read fastq files as input and outputs a sorted BAM file. It consists of
- BWA-MEM2: an architecture-efficient version of BWA-MEM that is 1.8-3.0 times faster.
- SAMtools sort utility for sorting SAM/BAM files.
v1.0
First version of Open Omics Acceleration Framework containing the accelerated versions of the following tools and pipelines of digital biology.
Tools:
- BWA-MEM2: accelerated version of BWA-MEM
- MM2-Fast: accelerated version of minimap2
- UMAP_fast: accelerated version of UMAP algorithm used for visualization
- Accelerated version of AtacWorks that performs denoising of ATAC-Seq data
Pipelines
- A Single cell RNA-Seq analysis pipeline for clustering cells starting with a cell-by-gene matrix.