The Data Observation Toolkit (DOT)

In 2019, the United Nations Statistical Commission highlighted the critical role of accurate health data, stating, “Every misclassified or unrecorded death is a lost opportunity to ensure other mothers and babies do not die in the same way. When it comes to health, better data can be a matter of life and death.” In response, DataKind developed DOT to increase trust in public health data, which is essential for equitable, data-driven health service delivery and optimized policy responses. DOT was created in collaboration with our global network of frontline health partners, including Ministries of Health, frontline health workers, and funders, all working together to strengthen health systems worldwide. You can read more of this initiative in the articles below:

The Data Observation Toolkit (DOT) is designed to monitor data and flag potential issues related to data integrity. It can identify problems such as missing or duplicate data, inconsistencies, outliers, and even domain-specific issues like missed follow-up medical treatments after diagnosis. DOT features a user-friendly interface for easily configuring powerful tools like the DBT and Great Expectations libraries, along with a built-in database for storing and classifying monitoring results. The primary goal of DOT is to make data monitoring more accessible, allowing users to ensure high-quality data without requiring extensive technical expertise. – Below is a high overview of the tool and how is architected:

DOT high overview

DOT Architecture

General Configuration Pre-requisites:

To run DOT you will need to:

Install Python 3.8.9
Install the necessary python packages by running the following commands in your terminal (Additional information Mac/Linux terminal, additional information Windows terminal):
- pip install gdown
- pip install python-on-whales
Install Docker desktop. First make sure you have checked the Docker prerequisites. We recommend using at least 4GB memory which can be set in the docker preferences, but this can vary depending on the volume of data being tested
If running on a Mac M1/M2 chip, install Rosetta and set export DOCKER_DEFAULT_PLATFORM=linux/amd64 in the terminal where you will run the instructions below
(Windows Users only) Need to install WSL for Linux on Windows Pcs

Alternatively, you can use the provided environment.yml if you have miniconda installed.

After completing the software prerequisites for your operating system, download or clone the DOT repository to your computer. You will need this repository for all the setups listed below.

Configuration

The following sections provide step-by-step instructions for configuring various components of DOT:

Sample data

Explore these comprehensive datasets, including global COVID-19 data, U.S. childhood obesity records, and datasets ranging from 1,000 to over a million patient entries, along with a synthetic dataset demonstrating DOT's capabilities with frontline health data.

Guidelines for adding new tests

Existing tests are at the self-tests folder
All tests extend the test base class that
- facilitates the import of modules under test
- recreates a directory in the file system for the test outputs
- provides a number of function for supporting tests that access the database, mocking the config files to point to the test dot_config.yml, (re)creates a schema for DOT configuration and loads it with test data, etc.

Code quality

We have instituted a pair of tools to ensure the code base will remain at an acceptable quality as it is shared and developed in the community.

The formulaic python formatter “black”. As described by its authors it is deterministic and fast but can be modified. We use the default settings, most notably formatting to a character limit of 88 per line.
The code linter pylint. This follows the PEP8 style standard. PEP8 formatting standards are taken care of in black, with the exception that the default pylint line length is 80. Pylint is also modifiable and a standard set of exclusion to the PEP8 standard we have chosen are found here. We chose the default score of 7 as the minimum score for pylint to be shared. The combination of black and pylint can be incorporated into the git process using a pre-commit hook by running setup_hooks.sh

For detailed information on advanced configuration options and guidelines for contributing to the project, please refer to the CONTRIBUTING.md document.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.github		.github
Synthetic_datasets		Synthetic_datasets
db		db
docker		docker
documentation_DOT		documentation_DOT
dot		dot
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
SECURITY.md		SECURITY.md
environment.yml		environment.yml
lint.py		lint.py
setup_hooks.sh		setup_hooks.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Data Observation Toolkit (DOT)

General Configuration Pre-requisites:

Configuration

Sample data

Guidelines for adding new tests

Code quality

About

Releases

Packages

Languages

License

wvelebanks/Data-Observation-Toolkit

Folders and files

Latest commit

History

Repository files navigation

The Data Observation Toolkit (DOT)

General Configuration Pre-requisites:

Configuration

Sample data

Guidelines for adding new tests

Code quality

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages