Skip to content

tguyet/medtrajectory_datagen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNDS synthetic data generator

The SNDS, formerly SNIIRAM, is a huge database (several Tb of data and about 700 tables) that contains information about healthcare reimbursements of about 60 million French insured patients.

This database is used to carry out epidemiological and medical-economic studies. Due to its sensitive medical content, identifying information (names, social security numbers) is removed or replaced by spurious information.

This repository contains a solution to generate a synthetic version of the database.

  1. it generates relational data compliant with the original database schema,
  2. it generates data with realistic distributions,
  3. it guarantees privacy preservation thanks to the use of open data only.

This work is currently under review to the AIME conference. The results of some more experiments can be found in the following notebook (accessible directly within the online GitLab visualisation: notebook ).

How to generate your own database

  • Download all the raw data available online
    • In the data repository, run the script load_opendata.py to download and unzip open data in the repository
    • Some open data links may require user interactions to start the download, in this case, datasets have to be manually downloaded (use the links that will be provided by load_opendata.py script)
  • Execute the Notebooks to generate intermediary files
    • Start by running the script prepare_data.py (ensure all the flags at the beginning of the script are set to True to generate all the required files)
    • Each step is detailed in corresponding Notebooks in the Data_Analysis repository
  • Run the script create_nomenclature.py to create the nomenclature part of the SNDS. This script create a large database made of the SNDS tables that can be provided without access restriction. This part of the database is provided by the Health Data Hub.
  • Move all the generated files in a repository corresponding to a simulation
  • Run the simulation35.py script to run the simulator based on Open Data. This script can be setup (choose the population size, the number of physicians, the administrative regions to mimic, etc.)

Open data resources

In this section, we give the exhaustive list of the open datasets we are currently using to feed realistically our database.

We invite the reader to have a look at the data preparation notebooks to have a flavour of the content of these datasets.

Requirements

The software is developed using python tools (script and notebooks). It used the following specific libraries:

  • sqlalchemy
  • sqlite3
  • tableschema
  • tableschema-sql
  • wget
  • unzip
  • gzip
  • pandas
  • jupyterlab

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published