This repository contains data, code, and other supplementary documents demonstrating application of the Provenance Backbone concept and underlying provenance model to a use case. Content of this repository consists of two parts:
- A textual description of the use case and application of the proposed provenance model to document the use case (“Manuscript_supplementary” folder),
- Implementation of provenance generation for computational parts of the use case, which is a machine learning (ML) workflow used for cancer detection research. The repository contains a portion of the ML workflow necessary to run the example. The implementation is described in this readme.
The ML workflow is implemented as a set of python scripts, and consists of units called Experiments. An Experiment defines a logic of a job to be run using a configuration file. A configuration file is a nested JSON file describing:
- Definitions - defining what components (Data, Generator, Model, Callbacks, etc) are to be used in the experiment, and
- Configurations - defining parameters of the components.
Sample configuration files can be found in rationai/config/
directory. The workflow can be run using the provided Makefile files.
The preprocessing script prepares the WSIs to be processed by the ML workflow – splits the WSIs into two datasets and partitions each WSI into smaller regions, called patches, which are filtered and labeled. This preprocessing script can process several directories of WSIs using the openslide-python package. Each slide is the processed in the following manner:
- A binary background mask is generated using Otsu's Thresholding method.
- If an XML annotation file is provided a binary label mask is generated by drawing polygons on a canvas.
- A sliding window technique is then applied on a background mask to generate patches. If a patch contains less tissue than a pre-defined threshold, the patch is discarded.
- If a patch is not filtered by a background filter, it is assigned label according to the binary label mask.
- Information about the patch (coordinates, label) is the added to a pandas table.
- After all patches of a slide are processed, slide metadata (slide filepath, annotation filepath, etc) are added to the pandas table and the entire table is inserted into an index file (pandas HDFStore file).
make -f Makefile.convert run \
CONFIG_FILE=rationai/config/prov_converter_config.json
Training script implements the ML model training. The training script first splits the training set represented as an index file into two disjunct sets: training set and validation set. For both the training and the validation set a Generator is constructed. The generator behaves as following:
- A sampling structure is built from the contents of an index file.
- During the training, the Generator samples a patch entry from the Sampler and passes it to an Extractor.
- The Extractor accesses the appropriate slide and extracts an RGBA image from the coordinates within the sampled entry.
- The extracted image is then augmented (if necessary) and normalized before being passed back to the Generator.
- The Generator repeats this process for each sampled entry in a batch before passing the batch to the Model.
During the training the model repeatedly alternates between two modes:
-
Training mode - the model updates its own parameters (weights) based on how well it manages to predict a correct label for the patches.
-
Validation mode - the model tracks its performance on the validation dataset, which has not been provided to the ML model before. It uses this information to create periodic checkpoints on every improvement or to stop the training process prematurely.
make -f Makefile.experiment setup train \
TRAIN_CONFIG=rationai/config/prov_train_config.json \
EID_PREFIX=PROV-TRAIN
The script loads a previously trained model and executes it to create predictions for test slides (slides used neither for training nor validation of the model). The predictions for each slide are appended to its corresponding table as to new column and saved to disk as a new predictions HDFStore file.
make -f Makefile.experiment setup test \
TEST_CONFIG=rationai/config/prov_test_config.json \
EID_PREFIX=PROV-PREDICT
During evaluation Evaluator objects are used to calculate metrics of interest (Accuracy, Precision, Recall, etc). Generator uses different Extractor during evaluation. Instead of accessing slides and retrieving images, the Extractor retrieves only those columns from the HDFStore tables that are required by the Evaluators.
make -f Makefile.experiment setup eval \
EVAL_CONFIG=rationai/config/prov_eval_config.json \
EID_PREFIX=PROV-EVAL
To run all steps (training, prediction and evaluation) run the following command:
make -f Makefile.experiment run \
TRAIN_CONFIG=rationai/config/prov_train_config.json \
TEST_CONFIG=rationai/config/prov_test_config.json \
EVAL_CONFIG=rationai/config/prov_eval_config.json \
EID_PREFIX=PROV
Each makefile call creates a new experiment directory <EID_PREFIX>-<EID_HASH>
, where EID_PREFIX
can be set during the Makefile call for easier experiment identification, and EID_HASH
is generated randomly to minimze experiment overwriting.
Due to the heavy focus on configuration-driven approach a significant portion of the experiments execution can be documented by preserving the inputs (configuration file, WSIs), the function (source code) and the outputs (output files). The configuration file details the modules and parameters used, whilst the source code defines the logic of individual modules and executed functions. In our example, the generated provenance documents the source code as a git commit hash, which is enough for deterministic parts of the computation. Both configuration filepath and output filepath (if any), and the commit hash are included in the final log file.
In case of a module including randomness (e.g. data splitting, data sampling), we need to retrieve and store the results of these random operations. For this purpose, we have decided to use a simple logging approach. We export key-value pairs of interest into a structured JSON log during the workflow execution, which is subsequently processed by a provenanace generation script to generate provenance in accordance to the proposed provenance model. The following summarize required logging functionality for each part of the ML workflof:
-
Preprocessing - no special logging is needed as the entire process is fully deterministic. As such, the configuration file, github repository URL, and the output file are necessary for provenanace generation. The output file is represented in the resulting provenance graph as a set of hashed tables it contains.
-
Training - since inputs for the model are generated randomly, we log the states of the following objects in order to validate the experiment: Datasource (hashed content of data split sets), Generator (hashed sampled entries for each epoch), Model (training and validation metric at the end of an epoch; checkpoints).
-
Predictions - fully deterministic process. We log the inputs (model checkpoint and dataset), logic (git commit hash) and outputs (HDF5 file with predictions).
-
Evaluations - fully deterministic process. We log the inputs (model checkpoint and dataset), logic (git commit hash) and outputs (results of Evaluators).
The corresponding log files and configuration files of an exemplary run of the ML workflow can be found in outputs/experiment_logs.
During a run of an experiment, a structured JSON log is being constructed using a custom SummaryWriter
object. Only a single copy with a given name can exist at any given time. Retrieveing a SummaryWriter
object with the same name from multiple locations results in the same object similarly to standard logging.Logger
.
Any key-value pair that we wish to keep track of must be set using the SummaryWriter
.set()
or .add()
functions. The utility package rationai.utils.provenance
contains additional helpful functions, for example generating SHA256 hashes of pandas tables, pandas HDFStore, filepaths and directories.
log = SummaryWriter.getLogger('provenance')
log.set('level1', 'level2, 'level3', value='value')
log.set('level1', 'key', value=5)
log.to_json(filepath)
# {
# 'level1': {
# 'level2': {
# 'level3': 'value'
# },
# 'key': 5
# }
# }
In order to parse the logs and generate resulting provenance according to the proposed model, we can call the Makefile.provenance
file.
Provenance Graph Generation
make -f Makefile.provenance run \
PREPROC_LOG=outputs/experiment_logs/prov_preprocess.log \
TRAIN_LOG=outputs/experiment_logs/prov_train.log \
TEST_LOG=outputs/experiment_logs/prov_test.log \
EVAL_LOG=outputs/experiment_logs/prov_eval.log \
The result of this command are three provenance bundles depicted in PNG images: prov-preprocessing
, prov-training
, and prov-evaluation
.
The resulting provenance graphs serialized into a graphical format can be found in outputs/provenance_graphs. The underlying library for provenance handling would enable serialization of provenance into PROV-O (RDF), PROV-XML and PROV-JSON formats.