Skip to content

Commit

Permalink
Rename MetaSynth to metasyn
Browse files Browse the repository at this point in the history
The reason for the rename is that there is a software package called MetaSynth. Although we are doing very different things, the package will be easier to find if we rename the package.

- Imports are now `metasyn` instead of `metasynth`.
- All documentation should be updated.
- Plugins should now use `metasyncontrib` instead of `metasynthcontrib`.
- Capitalization is now also changed to lower case, unless it starts the sentence.
  • Loading branch information
qubixes authored Sep 25, 2023
1 parent d32fd5b commit ede2337
Show file tree
Hide file tree
Showing 93 changed files with 571 additions and 482 deletions.
8 changes: 4 additions & 4 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,16 @@ jobs:
python -m pip install ".[test]"
- name: Check pep8 with flake8
run: |
flake8 metasynth --max-line-length 100
flake8 metasyn --max-line-length 100
- name: Lint with pylint
run: |
pylint metasynth
pylint metasyn
- name: Check docstrings with pydocstyle
run: |
pydocstyle metasynth --convention=numpy --add-select=D417 --add-ignore="D102,D105"
pydocstyle metasyn --convention=numpy --add-select=D417 --add-ignore="D102,D105"
- name: Check types with MyPy
run: |
mypy metasynth
mypy metasyn
- name: Check if documentation builds.
run: |
cd docs; make html SPHINXOPTS="-W --keep-going"
Expand Down
6 changes: 3 additions & 3 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
cff-version: 1.2.0
title: MetaSynth
title: metasyn
message: >-
If you use this software, please cite it using the
metadata from this file.
Expand All @@ -25,8 +25,8 @@ identifiers:
- type: doi
value: 10.5281/zenodo.7696031
description: Latest archived version of Metasynth
repository-code: 'https://github.com/sodascience/metasynth'
url: 'https://metasynth.readthedocs.io/'
repository-code: 'https://github.com/sodascience/metasyn'
url: 'https://metasyn.readthedocs.io/'
abstract: >-
A Python package for generating synthetic data from
tabular datasets, with a focus on privacy and
Expand Down
12 changes: 6 additions & 6 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ FROM python:3.11-slim
# git is used by versioneer to define the project version
RUN apt update && apt install -y git

# Install metasynth
COPY . metasynth/
RUN pip install metasynth/
# Install metasyn
COPY . metasyn/
RUN pip install metasyn/

# Remove metasynth folder
RUN rm -r metasynth/
# Remove metasyn folder
RUN rm -r metasyn/

# For excel output use optional XlsxWriter package
RUN pip install XlsxWriter

# Remove system dependencies
RUN apt remove -y git && apt autoremove -y

ENTRYPOINT [ "metasynth" ]
ENTRYPOINT [ "metasyn" ]
68 changes: 33 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/metasynth)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/sodascience/metasynth/HEAD?labpath=examples%2Fgetting_started.ipynb)
[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sodascience/metasynth/blob/main/examples/getting_started.ipynb)
[![docs](https://readthedocs.org/projects/metasynth/badge/?version=latest)](https://metasynth.readthedocs.io/en/latest/index.html)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/metasyn)
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/sodascience/metasyn/HEAD?labpath=examples%2Fgetting_started.ipynb)
[![Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sodascience/metasyn/blob/main/examples/getting_started.ipynb)
[![docs](https://readthedocs.org/projects/metasyn/badge/?version=latest)](https://metasyn.readthedocs.io/en/latest/index.html)

![MetaSynth Logo](docs/source/images/logos/blue.svg)
![Metasyn Logo](docs/source/images/logos/blue.svg)

# MetaSynth
MetaSynth is a Python package designed to generate tabular synthetic data for rigorous code testing and reproducibility.
Researchers and data owners can use MetaSynth to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, MetaSynth facilitates transparency and reproducibility, by allowing the underlying MetaFrames to be exported and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.
# Metasyn
Metasyn is a Python package designed to generate tabular synthetic data for rigorous code testing and reproducibility.
Researchers and data owners can use metasyn to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, metasyn facilitates transparency and reproducibility, by allowing the underlying MetaFrames to be exported and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.

The package has three main functionalities:

1. **Estimation**: MetaSynth can create a MetaFrame, from a dataset. A MetaFrame is essentially a fitted model that characterizes the structure of the original dataset without storing actual values. It captures individual distributions and features, enabling generation of synthetic data based on these MetaFrames and can be seen as (statistical) metadata.
2. **Serialization**: MetaSynth can export a MetaFrame into an easy to read JSON file, allowing users to audit, understand, and modify their data generation model.
3. **Generation**: MetaSynth can generate synthetic data based on a MetaFrame. The synthetic data produced solely depends on the MetaFrame, thereby maintaining a critical separation between the original sensitive data and the synthetic data generated. The generated synthetic data, emulates the original data's format and plausibility at the individual record level and attempts to reproduce marginal (univariate) distributions where possible. Generated values are based on the observed distributions while adding a degree of variance and smoothing. The generated data does **not** aim to preserve the relationships between variables. The frequency of missing values and their codes are maintained in the synthetically-augmented dataset.
1. **Estimation**: Metasyn can create a MetaFrame, from a dataset. A MetaFrame is essentially a fitted model that characterizes the structure of the original dataset without storing actual values. It captures individual distributions and features, enabling generation of synthetic data based on these MetaFrames and can be seen as (statistical) metadata.
2. **Serialization**: Metasyn can export a MetaFrame into an easy to read JSON file, allowing users to audit, understand, and modify their data generation model.
3. **Generation**: Metasyn can generate synthetic data based on a MetaFrame. The synthetic data produced solely depends on the MetaFrame, thereby maintaining a critical separation between the original sensitive data and the synthetic data generated. The generated synthetic data, emulates the original data's format and plausibility at the individual record level and attempts to reproduce marginal (univariate) distributions where possible. Generated values are based on the observed distributions while adding a degree of variance and smoothing. The generated data does **not** aim to preserve the relationships between variables. The frequency of missing values and their codes are maintained in the synthetically-augmented dataset.

![MetaSynth Pipeline](docs/source/images/pipeline_basic.png)
![Metasyn Pipeline](docs/source/images/pipeline_basic.png)

### Key features
- **MetaFrame Generation**: MetaSynth allows the creation of a MetaFrame from a dataset provided as a Polars or Pandas DataFrame.
- **MetaFrame Generation**: Metasyn allows the creation of a MetaFrame from a dataset provided as a Polars or Pandas DataFrame.
MetaFrames includes key characteristics such as *variable names*, *data types*, *the percentage of missing values*, and *distribution parameters*.
- **Exporting MetaFrames**: MetaSynth can export and import MetaFrames to GMF files. These are JSON files that follow the easy to read and understand [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format).
- **Exporting MetaFrames**: Metasyn can export and import MetaFrames to GMF files. These are JSON files that follow the easy to read and understand [Generative Metadata Format (GMF)](https://github.com/sodascience/generative_metadata_format).

<details>
<summary> A simple example of an exported MetaFrame (following the GMF standard): </summary>
Expand All @@ -31,7 +31,7 @@ The package has three main functionalities:
"n_columns": 5,
"provenance": {
"created by": {
"name": "MetaSynth",
"name": "Metasyn",
"version": "0.4.0"
},
"creation time": "2023-08-07T12:04:40.669740"
Expand Down Expand Up @@ -130,41 +130,41 @@ The package has three main functionalities:
A more advanced example GMF, based on the [Titanic](https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv) dataset, can be found [here](examples/titanic_example.json)
</details>

- **Synthetic Data Generation**: MetaSynth allows for the generation of a polars DataFrame with synthetic data that resembles the original data.
- **Distribution Fitting**: MetaSynth allows for manual and automatic distribution fitting.
- **Data Type Support**: MetaSynth supports generating synthetic data for a variety of common data types including `categorical`, `string`, `integer`, `float`, `date`, `time`, and `datetime`.
- **Integration with Faker**: MetaSynth integrates with the [faker](https://github.com/joke2k/faker) package, a Python library for generating fake data such as names and emails. Allowing for synthetic data that is formatted realistically, while retaining privacy.
- **Structured String Detection**: MetaSynth identifies structured strings within your dataset, which can include formatted text,
- **Synthetic Data Generation**: Metasyn allows for the generation of a polars DataFrame with synthetic data that resembles the original data.
- **Distribution Fitting**: Metasyn allows for manual and automatic distribution fitting.
- **Data Type Support**: Metasyn supports generating synthetic data for a variety of common data types including `categorical`, `string`, `integer`, `float`, `date`, `time`, and `datetime`.
- **Integration with Faker**: Metasyn integrates with the [faker](https://github.com/joke2k/faker) package, a Python library for generating fake data such as names and emails. Allowing for synthetic data that is formatted realistically, while retaining privacy.
- **Structured String Detection**: Metasyn identifies structured strings within your dataset, which can include formatted text,
codes, identifiers, or any string that follows a specific pattern.
- **Handling Unique Values**: MetaSynth can identify and process variables with unique values or keys in the data, preserving their uniqueness in the synthetic dataset, which is crucial for generating synthetic data that maintains the characteristics of the original dataset.
- **Handling Unique Values**: Metasyn can identify and process variables with unique values or keys in the data, preserving their uniqueness in the synthetic dataset, which is crucial for generating synthetic data that maintains the characteristics of the original dataset.

Curious and want to learn more? Check out our[documentation](https://metasynth.readthedocs.io/en/latest/index.html)!
Curious and want to learn more? Check out our[documentation](https://metasyn.readthedocs.io/en/latest/index.html)!

## Getting Started
### Try it out online
If you're new to Python or simply want to quickly explore the basic features of MetaSynth, you can try it out using the online Google Colab tutorial. [Click here](https://colab.research.google.com/github/sodascience/metasynth/blob/main/examples/getting_started.ipynb) to access the tutorial. It provides a step-by-step walkthrough and example dataset to help you get started. However, please exercise caution when using sensitive data, as it will be handled through Google servers.
If you're new to Python or simply want to quickly explore the basic features of metasyn, you can try it out using the online Google Colab tutorial. [Click here](https://colab.research.google.com/github/sodascience/metasyn/blob/main/examples/getting_started.ipynb) to access the tutorial. It provides a step-by-step walkthrough and example dataset to help you get started. However, please exercise caution when using sensitive data, as it will be handled through Google servers.

### Local Installation
For more advanced users and researchers who prefer working on their local machines, you can install MetaSynth directly from PyPI using the following command in the terminal (not Python):
For more advanced users and researchers who prefer working on their local machines, you can install metasyn directly from PyPI using the following command in the terminal (not Python):

```sh
pip install metasynth
pip install metasyn
```

## Usage
To learn how to use MetaSynth effectively, refer to the comprehensive [documentation](https://metasynth.readthedocs.io/en/latest/index.html). The documentation covers all the necessary information and provides detailed explanations, examples, and usage guidelines.
To learn how to use metasyn effectively, refer to the comprehensive [documentation](https://metasyn.readthedocs.io/en/latest/index.html). The documentation covers all the necessary information and provides detailed explanations, examples, and usage guidelines.

Additionally, the documentation offers a series of [tutorials](https://metasynth.readthedocs.io/en/latest/usage/interactive_tutorials.html) that delve into specific features and use cases. These tutorials can further assist you in understanding and leveraging the capabilities of MetaSynth.
Additionally, the documentation offers a series of [tutorials](https://metasyn.readthedocs.io/en/latest/usage/interactive_tutorials.html) that delve into specific features and use cases. These tutorials can further assist you in understanding and leveraging the capabilities of metasyn.

### Quick start
Get started quickly with MetaSynth using the following example. In this concise demonstration, you'll learn the basic functionality of MetaSynth by generating synthetic data from [titanic](https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv) dataset.
Get started quickly with metasyn using the following example. In this concise demonstration, you'll learn the basic functionality of metasyn by generating synthetic data from [titanic](https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv) dataset.

It is important to start by importing the appropriate libraries:

```python
# import libraries
import polars as pl
from metasynth import MetaFrame, demo_file
from metasyn import MetaFrame, demo_file
```

#### Estimation: Generating a MetaFrame
Expand All @@ -191,7 +191,7 @@ df = pl.read_csv(dataset_csv, dtypes=data_types)
Note on using Pandas
</summary>
Internally, MetaSynth uses Polars (instead of Pandas) mainly because typing and the handling of non-existing data is more
Internally, metasyn uses Polars (instead of Pandas) mainly because typing and the handling of non-existing data is more
consistent. It is possible to supply a Pandas DataFrame instead of a polars DataFrame to `MetaFrame.fit_dataframe`.
However, this uses the automatic polars conversion functionality, which for some edge cases result in problems. Therefore,
we advise users to create Polars DataFrames. The resulting synthetic dataset is always a polars dataframe, but this can
Expand All @@ -205,7 +205,7 @@ be easily converted back to a Pandas DataFrame by using `df_pandas = df_polars.t
mf = MetaFrame.fit_dataframe(df)
```

> Note: At this point you will encounter a warning about `PassengerId` not being set as unique, you can safely ignore it and proceed. This warning occurs because `PassengerId` appears to contain unique values, but is not explicitly marked as a unique column. To remove the warning, you can set `PassengerId` to be a unique column. Our documentation explains how to do this when generating Metaframes: [Set Columns as Unique](https://metasynth.readthedocs.io/en/latest/usage/generating_metaframes.html#optional-parameters).
> Note: At this point you will encounter a warning about `PassengerId` not being set as unique, you can safely ignore it and proceed. This warning occurs because `PassengerId` appears to contain unique values, but is not explicitly marked as a unique column. To remove the warning, you can set `PassengerId` to be a unique column. Our documentation explains how to do this when generating Metaframes: [Set Columns as Unique](https://metasyn.readthedocs.io/en/latest/usage/generating_metaframes.html#optional-parameters).
#### (De)serialization: Exporting and importing a MetaFrame
_Note that exporting and importing is optional. You can generate synthetic data from **any** loaded MetaFrame, whether that be through importing a GMF file or generating a MetaFrame from an original DataFrame._
Expand Down Expand Up @@ -247,9 +247,7 @@ To contribute:

<!-- CONTACT -->
## Contact
**MetaSynth** is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team.
Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the
issue tracker or feel free to contact [Erik-Jan van Kesteren](https://github.com/vankesteren)
or [Raoul Schram](https://github.com/qubixes).
**Metasyn** is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team.
Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Erik-Jan van Kesteren](https://github.com/vankesteren) or [Raoul Schram](https://github.com/qubixes).

<img src="docs/source/images/logos/soda.png" alt="SoDa logo" width="250px"/>
4 changes: 2 additions & 2 deletions docs/source/about/about.rst
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
About
=====

Welcome to the MetaSynth About Section. This section is meant to provide more information on the background and context of MetaSynth.
Welcome to the metasyn About section. This section is meant to provide more information on the background and context of metasyn.

.. toctree::
:maxdepth: 1
:caption: Sections:

metasynth_in_detail
metasyn_in_detail
contact
license
Loading

0 comments on commit ede2337

Please sign in to comment.