-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summary tool : a lot more calculations needed #211
Labels
enhancement
New feature or request
Comments
@tgourdel I think this would require a mix of solutions, like pandas.DataFrame.dtypes and probably duckdb. Something more like that (well, first and last missing but you see the point) import duckdb
import pandas as pd
import numpy as np
# Example DataFrame
data = {
"id": [1, 2, 3, 4, 5],
"label": ["A", "BB", "CCC", None, "D"],
"amount": [10.5, 20.1, None, 40.8, 50.0],
}
df = pd.DataFrame(data)
# Create a connection to DuckDB and load the DataFrame
con = duckdb.connect()
con.register("df", df)
# Initialize an output list
results = []
# Analyze each column
for column in df.columns:
# Determine type
col_type = str(df[column].dtype)
# Numerical summaries using DuckDB
query = f"""
SELECT
COUNT(*) AS count,
COUNT(DISTINCT "{column}") AS count_distinct,
SUM(CASE WHEN "{column}" IS NULL THEN 1 ELSE 0 END) AS null_count,
MIN("{column}") AS min_value,
MAX("{column}") AS max_value,
AVG("{column}") AS average,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "{column}") AS median
FROM df
"""
summary = con.execute(query).fetchdf().iloc[0].to_dict()
# Additional string-specific analysis for object types
if df[column].dtype == "object" or col_type == "string":
lengths = df[column].dropna().str.len()
shortest_string = df[column].dropna().min()
longest_string = df[column].dropna().max()
summary.update(
{
"min_length": lengths.min(),
"max_length": lengths.max(),
"average_length": lengths.mean(),
"shortest_string": shortest_string,
"longest_string": longest_string,
}
)
else:
summary.update(
{
"min_length": None,
"max_length": None,
"average_length": None,
"shortest_string": None,
"longest_string": None,
}
)
# Percent of nulls
summary["percent_null"] = (summary["null_count"] / summary["count"]) * 100
# Add the type
summary["type"] = col_type
# Add the column name
summary["column"] = column
# Append to results
results.append(summary)
# Convert results to a DataFrame
output_df = pd.DataFrame(results)
# Reorganize columns for clarity
output_df = output_df[
[
"column",
"type",
"min_value",
"max_value",
"average",
"median",
"count",
"count_distinct",
"null_count",
"percent_null",
"min_length",
"max_length",
"average_length",
"shortest_string",
"longest_string",
]
]
print(output_df) |
@tgourdel looks like exactly what I have in mind import duckdb
import pandas as pd
import numpy as np
# Example DataFrame
data = {
"id": [1, 2, 3, 4, 5],
"label": ["A", "BB", "CCC", None, "D"],
"amount": [10.5, 20.1, None, 40.8, 50.0],
}
df = pd.DataFrame(data)
# Create a connection to DuckDB and load the DataFrame
con = duckdb.connect()
con.register("df", df)
# Initialize an output list
results = []
# Analyze each column
for column in df.columns:
# Determine type
col_type = str(df[column].dtype)
# Numerical summaries using DuckDB
query = f"""
SELECT
COUNT(*) AS count,
COUNT(DISTINCT "{column}") AS count_distinct,
SUM(CASE WHEN "{column}" IS NULL THEN 1 ELSE 0 END) AS null_count,
MIN("{column}") AS min_value,
MAX("{column}") AS max_value,
AVG("{column}") AS average,
STDDEV_POP("{column}") AS std_dev,
PERCENTILE_CONT(0.0) WITHIN GROUP (ORDER BY "{column}") AS p0,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY "{column}") AS p25,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "{column}") AS p50,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY "{column}") AS p75,
PERCENTILE_CONT(1.0) WITHIN GROUP (ORDER BY "{column}") AS p100
FROM df
"""
summary = con.execute(query).fetchdf().iloc[0].to_dict()
# First and last value using DuckDB
first_last_query = f"""
SELECT
FIRST("{column}") AS first_value,
LAST("{column}") AS last_value
FROM df
"""
first_last = con.execute(first_last_query).fetchdf().iloc[0].to_dict()
# Merge the summaries
summary.update(first_last)
# Additional string-specific analysis for object types
if df[column].dtype == "object" or col_type == "string":
lengths = df[column].dropna().str.len()
shortest_string = df[column].dropna().min()
longest_string = df[column].dropna().max()
summary.update(
{
"min_length": lengths.min(),
"max_length": lengths.max(),
"average_length": lengths.mean(),
"shortest_string": shortest_string,
"longest_string": longest_string,
}
)
else:
summary.update(
{
"min_length": None,
"max_length": None,
"average_length": None,
"shortest_string": None,
"longest_string": None,
}
)
# Percent of nulls
summary["percent_null"] = (summary["null_count"] / summary["count"]) * 100
# Add the type
summary["type"] = col_type
# Add the column name
summary["column"] = column
# Append to results
results.append(summary)
# Convert results to a DataFrame
output_df = pd.DataFrame(results)
# Reorganize columns for clarity
output_df = output_df[
[
"column",
"type",
"first_value",
"last_value",
"min_value",
"p0",
"p25",
"p50",
"p75",
"p100",
"max_value",
"average",
"std_dev",
"count",
"count_distinct",
"null_count",
"percent_null",
"min_length",
"max_length",
"average_length",
"shortest_string",
"longest_string",
]
]
print(output_df) what do you think of that? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello,
As of today, we only have a few metrics about the dataframe
We can also note that min and max aren't calculated at all for strings.
Here a few things I would add :
-min max working whatever is the type
use case for string : One very simple example that come in mind : let's say you have a customer dataset. It appears that despite having more than hundred, the max alphabetic is MIGHTYCUSTOMER, Inc. Well, it seems suspicious because you expect to have some customers coming after and clearly your dataset is not as complete as you think.
( a few use cases :
a/financial account (I will take french accounting, don't know foreign). They must have the same length. So min and max length have to be the same.
b/french department number can be either 2 or 3 characters. I want to be sure there is not at 1 or more than 3
c/also, if i have a string field with 10 values with min 0 and max 9, I can suppose it worth a look to see if I can transform it as integer
d/Also, about names, in my previous example : there are studies about name length distribution like https://www.researchgate.net/figure/First-names-and-last-names-lengths-distributions_fig1_328894441 and my average length is really different (like 4 or 10, I may have some issues).
-%of null,
-field type
Best regards,
Simon
The text was updated successfully, but these errors were encountered: