Skip to content

Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.

License

Notifications You must be signed in to change notification settings

eneskuluk/GenomicsDB

 
 

Repository files navigation

License: MIT Maven Central

Master Develop
actions actions
codecov codecov

GenomicsDB, originally from Intel Health and Lifesciences, is built on top of a fork of htslib and a tile-based array storage system for importing, querying and transforming variant data. Variant data is sparse by nature (sparse relative to the whole genome) and using sparse array data stores is a perfect fit for storing such data. GenomicsDB is a highly performant scalable data storage written in C++ for importing, querying and transforming genomic variant data.

  • Supported platforms : Linux and MacOS.
  • Supported filesystems : POSIX, HDFS, EMRFS(S3), GCS and Azure Blob.

Included are

  • JVM/Spark wrappers that allow for streaming VariantContext buffers to/from the C++ layer among other functions. GenomicsDB jars with native libraries and only zlib dependencies are regularly published on Maven Central.
  • Native tools for incremental ingestion of variants in the form of VCF/BCF/CSV into GenomicsDB for performance.
  • MPI and Spark support for parallel querying of GenomicsDB.

GenomicsDB is packaged into gatk4 and benefits qualitatively from a large user base.

The GenomicsDB documentation for users is hosted as a Github wiki: https://github.com/GenomicsDB/GenomicsDB/wiki

External Contributions

GenomicsDB is open source and all participation is welcome. GenomicsDB is released under the MIT License and all external contributors are expected to grant an MIT License for their contributions.

Checklist before creating Pull Request

Please ensure that the code is well documented in Javadoc style for Java/Scala. For C/C++ code, roughly adhere to Google C++ Style for consistency/readabilty.

Use spaces instead of tabs.
Use 2 spaces for indenting.
Add brackets even for one line blocks e.g. 
        if (x>0)
           do_foo();
 should ideally be 
       if (x>0) {
         do_foo();
       }
Pad header e.g.
        if(x>0) should be if (x>0)
        while(x>0) should be while (x>0)
One half indent for class modifiers.

About

Highly performant data storage in C++ for importing, querying and transforming variant data with C/C++/Java/Spark bindings. Used in gatk4.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 68.6%
  • Java 20.2%
  • Python 5.0%
  • CMake 2.6%
  • Shell 2.6%
  • Scala 0.6%
  • Other 0.4%