From 40b1918a58a38a1673f7c216325451267bc6123f Mon Sep 17 00:00:00 2001 From: gAldeia Date: Mon, 9 Sep 2024 22:08:23 -0300 Subject: [PATCH] Replacing `pandas-profiling` (deprecated) with `ydata-profiling` the previously known pandas-profiling is now part of a bigger project and is decoupling from the idea that is intended to be used only with dataframes. The name of the package has changed, and the last version of `pandas-profiling` was released more than a year ago. The github workflow for profiling new datasets is not working as it should, due to deprecated dependences. --- docs_sources/index.Rmd | 4 ++-- paper/paper.md | 6 +++--- pmlb/profiling.py | 2 +- setup.py | 2 +- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs_sources/index.Rmd b/docs_sources/index.Rmd index 9b79b88ee..18335dda6 100644 --- a/docs_sources/index.Rmd +++ b/docs_sources/index.Rmd @@ -12,7 +12,7 @@ These datasets cover a broad range of applications including binary/multi-class In the interactive [plotly](https://plotly.com/) chart below, each dot represents a dataset colored based on its associated task (classification vs. regression). In log scale, the *x* and *y* axis shows the number of observations and features respectively. Please click on the legend to hide/show the groups of datasets. -Click on each dot to access the dataset's [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report. +Click on each dot to access the dataset's [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report. *Note*: If a dataset has more than 20 features, we randomly chose 20 to be displayed in its profiling report. Therefore, please disregard the `Number of variables` in the corresponding report and, instead, use the correct `n_features` in the chart and table below. @@ -84,7 +84,7 @@ ply Browse, sort, filter and search the complete table of summary statistics below. -* Click on the dataset's name to access its [pandas-profiling](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) report. +* Click on the dataset's name to access its [ydata-profiling](https://docs.profiling.ydata.ai/latest/) report. * Click on the GitHub Octocat to access its metadata. diff --git a/paper/paper.md b/paper/paper.md index e17f4c63f..b35599a0b 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -122,10 +122,10 @@ API reference guides that detail all user-facing functions and variables in PMLB ## Pandas profiling reports -For each dataset, we use [`pandas-profiling`](https://pandas-profiling.github.io/pandas-profiling/) to generate summary statistic reports. -In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `pandas-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples. +For each dataset, we use [`ydata-profiling`](https://docs.profiling.ydata.ai/latest/) to generate summary statistic reports. +In addition to the descriptive statistics provided by the commonly-used `pandas.describe` (Python) [@McKinney2010] or `skimr::skim` (R) functions, `ydata-profiling` gives a more extensive exploration of the dataset, including correlation structure within the dataset and flagging of duplicate samples. Browsing a report allows users and contributors to easily assess dataset quality and make any necessary changes. -For example, if a feature is flagged by `pandas-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset. +For example, if a feature is flagged by `ydata-profiling` as having a single value replicated in all samples, it is likely that this feature is uninformative for ML analysis and should be removed from the dataset. The profiling reports can be accessed by clicking on the dataset name in the interactive data table or the data point in the interactive chart on the PMLB website. Alternatively, all reports can be viewed on the repository's [gh-pages](https://github.com/EpistasisLab/pmlb/tree/gh-pages/profile) branch, or generated manually by users on their local computing resources. diff --git a/pmlb/profiling.py b/pmlb/profiling.py index 9d80a7c11..d80bfe162 100644 --- a/pmlb/profiling.py +++ b/pmlb/profiling.py @@ -3,7 +3,7 @@ import subprocess import pandas as pd -from pandas_profiling import ProfileReport +from ydata_profiling import ProfileReport from .pmlb import ( fetch_data, get_updated_datasets, last_commit_message diff --git a/setup.py b/setup.py index c3553dcb6..031086afb 100644 --- a/setup.py +++ b/setup.py @@ -41,7 +41,7 @@ def calculate_version(): ], extras_require={ 'dev': ['nose', 'numpy', 'scipy', 'tabulate', 'parameterized', - 'matplotlib', 'seaborn', 'pandas-profiling'], + 'matplotlib', 'seaborn', 'ydata-profiling'], }, classifiers=[ 'Intended Audience :: Developers',