Skip to content

Modelling the WormBase ACeDB database in datomic.

Notifications You must be signed in to change notification settings

WormBase/pseudoace

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
Matt Russell
Jun 14, 2016
98bc611 · Jun 14, 2016
Jun 14, 2016
Feb 12, 2016
Jun 14, 2016
Jun 10, 2016
Jun 14, 2016
Apr 28, 2016
Apr 25, 2016
Jun 14, 2016
Mar 16, 2014
Jun 14, 2016
Jun 14, 2016

Repository files navigation

pseudoace

Clojars Project

Provides a Clojure library for use by the Wormbase project.

Features include:

  • Model-driven import of ACeDB data into a Datomic database.

    • (Dynamic generation of an isomorphic Datomic schema from an annotated ACeDB models file)
  • Conversion of ACeDB database dump files into a datomic database

  • Routines for parsing and dumping ACeDB "dump files".

  • Utility functions and macros for querying WormBase data.

  • A command line interface for utilities described above (via lein run)

Installation

  • Java 1.8 (Prefer official oracle version)

  • leiningen

    • You will also need to specify which flavour and version of datomic you want use in your lein peer project configuration.

      Example:

      (defproject myproject-0.1-SNAPSHOT
         :dependencies [[com.datomic/datomic-free "0.9.5359"
                         :exclusions [joda-time]]
                        [wormbase/pseudoace "0.4.4"]])
  • Install datomic

Development

Follow the GitFlow mechanism for branching and committing changes:

  • Feature branches should be derived from the develop branch: i.e:. git checkout -b feature-x develop

Coding style

This project attempts to adhere to the Clojure coding-style conventions.

Testing & code QA

Run all tests regularly, but in particular:

  • before issuing a new pull request

  • after checking out a feature-branch

alias run-tests="lein with-profile dev,test do eastwood, test"
run-tests

Other useful leiningen plugins for development include:

kibit

Recommends idiomatic source code changes.

There is editor support in Emacs. e.g: M-x kibit-current-file

Command line examples:

# whole project
lein with-profile dev kibit
# single file
lein with-profile dev kibit src/pseudoace/core.clj

bikeshed

Reports on subjectively bad code. This tool checks for:

  1. "files ending in blank lines"

  2. redefined var roots in source directories"

  3. "whether you keep up with your docstrings"

  4. arguments colliding with clojure.core functions

Of the above, only 1. 2. and 3. are generally useful to fix, since 4. requires creative (short) naming that may not be intuitive for the reader. Use your discretion when choosing to "fix" any "violations" reported in category 4.

Releases

Initial setup

Configure leiningen credentials for clojars.

Test your setup by running:

# Ensure you are Using `gpg2`, and the `gpg-agent` is running.
# Here, gpg is a symbolic link to gpg2
gpg --quiet --batch --decrypt ~/.lein/credentials.clj.gpg

The output should look like (credentials elided):

;; my.datomic.com and clojars credentials
{#"my\.datomic\.com" {:username ...
                      :password ...}
 #"clojars" {:username ...
             :password ...}}

Releases

This process re-uses the leiningen deployment tools:

  • Checkout the develop branch if not already checked-out.

    • Update changes entries in the CHANGES.md file

    • Replace "un-released" in the latest version entry with the current date.

    • Change the version from MAJOR.MINOR.PATCH-SNAPSHOT to MAJOR.MINOR.PATCH in project.clj.

    • Commit and push all changes.

  • Checkout the master branch.

    • Merge the develop branch into to master (via a github pull request or directly using git)

    • Run:

      lein deploy

  • Checkout the develop branch.

    • Merge the master branch back into develop.

    • Change the version from MAJOR.MINOR.PATCH to MAJOR.MINOR.PATCH-SNAPSHOT in project.clj.

    • Update CHANGES.md with the next version number and a "back to development" stanza, e.g:

    ## 0.3.2 - (unreleased)
      - Nothing changed yet.

    Commit and push these changes, typically with the message:

    "Back to development"
    

As a standalone jar file for running the import peer on a server

# GIT_RELEASE_TAG should be the annotated git release tag, e.g:
#   GIT_RELEASE_TAG="0.3.2"
#
# If you want to use a local git tag, ensure it matches the version in
# projet.clj, e.g:
#  GIT_RELEASE_TAG="0.3.2-SNAPSHOT"
#
# LEIN_PROFILE can be any named lein profile (or multiple delimiter by comma),
# examples:
#   LEIN_PROFILE="aws"
#   LEIN_PROFILE="mysql"
#   LEIN_PROFILE="postgresql"
#   LEIN_PROFILE="dev
git checkout "${GIT_RELEASE_TAG}"
./scripts/bundle-release.sh $GIT_RELEASE_TAG $LEIN_PROFILE

An archive named pseudoace-$GIT_RELEASE_TAG.tar.gz will be created in the release-archives directory.

The archive contains two artefacts:

tar tvf pseudoace-$GIT_RELEASE_TAG.tar.gz
./pseudoace-$GIT_RELEASE_TAG.jar
./sort-edn-log.sh

To ensure we comply with the datomic license ensure this tar file, and specifically the jar file contained therein is never distributed to a public server for download, as this would violate the terms of any preparatory Congnitech Datomic license.

Usage

Development

A command line utility has been developed for ease of usage:

URL_OF_TRANSACTOR="datomic:dev://localhost:4334/*"

lein run --url $URL_OF_TRANSACTOR <command>

--url is a required option for most sub-commands, it should be of the form of:

datomic:<storage-backend-alias>://<hostname>:<port>/<db-name>

Alternatively, for extra speed, one can use the Clojure routines directly from a repl session:

# start the repl (Read Eval Print Loop)
lein repl

Example of invoking a sub-command:

(list-databases {:url (System/getenv "URL_OF_TRANSACTOR")})

Staging/Production

Run pseudoace with the same arguments as you would when using lein run:

java -jar pseudoace-$GIT_RELEASE_TAG.jar -v

Import process

Prepare import

Create the database and parse .ace dump-files into EDN.

Example:

java -jar pseudoace-$GIT_RELEASE_TAG.jar \
     --url $DATOMIC_URL \
	 --acedump-dir ACEDUMP_DIR \
	 --log-dir LOG_DIR \
	 -v prepare-import

The prepare-import sub-command:

  • Creates a new database at the specified --url
  • Converts .ace dump-files located in --acedump-dir into pseudo EDN files located in --log-dir.
  • Creates the database schema from the annotated ACeDB models file specified by --model.
  • Optionally dumps the newly created database schema to the file specified by --schema-filename.

Sort the generated log files

The format of the generated files is:

<ace-db-style_timestamp>

The EDN data is required to sorted by timestamp in order to preserve the time invariant of Datomic:

find $LOG_DIR \
    -type f \
	-name "*.edn.gz" \
	-exec ./sort-edn-log.sh {} +

Import the sorted logs into the database

Transacts the EDN sorted by timestamp in --log-dir to the database specified with --url:

java -jar pseudoace-$GIT_RELEASE_TAG.jar \
	 --log-dir LOG_DIR \
	 -v import-logs

Using a full dump of a recent release of Wormbase, you can expect the import process to take in the region of 8-12 hours depending on the platform you run it on.