In 2019, the United Nations Statistical Commission highlighted the critical role of accurate health data, stating, “Every misclassified or unrecorded death is a lost opportunity to ensure other mothers and babies do not die in the same way. When it comes to health, better data can be a matter of life and death.” In response, DataKind developed DOT to increase trust in public health data, which is essential for equitable, data-driven health service delivery and optimized policy responses. DOT was created in collaboration with our global network of frontline health partners, including Ministries of Health, frontline health workers, and funders, all working together to strengthen health systems worldwide. You can read more of this initiative in the articles below:
- Pathways to Increasing Trust in Public Health Data
- Empowering Health Worker and Community Health Systems: Data Integrity and the Future of Intelligent Community Health Systems in Uganda
- Harnessing the power of data science in healthcare
- How Data Empowers Health Workers—and Powers Health Systems
The Data Observation Toolkit (DOT) is designed to monitor data and flag potential issues related to data integrity. It can identify problems such as missing or duplicate data, inconsistencies, outliers, and even domain-specific issues like missed follow-up medical treatments after diagnosis. DOT features a user-friendly interface for easily configuring powerful tools like the DBT and Great Expectations libraries, along with a built-in database for storing and classifying monitoring results. The primary goal of DOT is to make data monitoring more accessible, allowing users to ensure high-quality data without requiring extensive technical expertise. – Below is a high overview of the tool and how is architected:
DOT high overview DOT ArchitectureTo run DOT you will need to:
- Install Python 3.8.9
- Install the necessary python packages by running the following commands in your terminal (Additional information Mac/Linux terminal, additional information Windows terminal):
pip install gdown
pip install python-on-whales
- Install Docker desktop. First make sure you have checked the Docker prerequisites. We recommend using at least 4GB memory which can be set in the docker preferences, but this can vary depending on the volume of data being tested
- If running on a Mac M1/M2 chip, install Rosetta and set export
DOCKER_DEFAULT_PLATFORM=linux/amd64
in the terminal where you will run the instructions below - (Windows Users only) Need to install WSL for Linux on Windows Pcs
Alternatively, you can use the provided environment.yml if you have miniconda installed.
After completing the software prerequisites for your operating system, download or clone the DOT repository to your computer. You will need this repository for all the setups listed below.
The following sections provide step-by-step instructions for configuring various components of DOT:
- Getting Started with DOT
- Setting Up the Docker Environment and Running DOT
- Deploying DOT to Airflow
- Configuring the DOT Database
- DBT for DOT
- Configuring DOT
- Developing the Appsmith UI
- Advanced Topics
Explore these comprehensive datasets, including global COVID-19 data, U.S. childhood obesity records, and datasets ranging from 1,000 to over a million patient entries, along with a synthetic dataset demonstrating DOT's capabilities with frontline health data.
- Existing tests are at the self-tests folder
- All tests extend the test base class that
- facilitates the import of modules under test
- recreates a directory in the file system for the test outputs
- provides a number of function for supporting tests that access the database, mocking the config files to point to the test dot_config.yml, (re)creates a schema for DOT configuration and loads it with test data, etc.
We have instituted a pair of tools to ensure the code base will remain at an acceptable quality as it is shared and developed in the community.
- The formulaic python formatter “black”. As described by its authors it is deterministic and fast but can be modified. We use the default settings, most notably formatting to a character limit of 88 per line.
- The code linter pylint. This follows the PEP8 style standard. PEP8 formatting standards are taken care of in black, with the exception that the default pylint line length is 80. Pylint is also modifiable and a standard set of exclusion to the PEP8 standard we have chosen are found here. We chose the default score of 7 as the minimum score for pylint to be shared. The combination of black and pylint can be incorporated into the git process using a pre-commit hook by running setup_hooks.sh
For detailed information on advanced configuration options and guidelines for contributing to the project, please refer to the CONTRIBUTING.md document.