Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats: std population or sample calculation on 1 value #231

Open
slabasan opened this issue Nov 21, 2024 · 0 comments
Open

stats: std population or sample calculation on 1 value #231

slabasan opened this issue Nov 21, 2024 · 0 comments
Labels
area-stats Issues and PRs related to Thicket's stats subpackage

Comments

@slabasan
Copy link
Collaborator

Calling th.stats.std(ttk, cols) may result in aggregation of a single row. In thicket, we are calling .agg(np.std), which calculates the standard deviation with a degrees of freedom (ddof) of 1. In other words, it divides by n-1 (where n is the number of elements). This is statistically appropriate for estimating the population standard deviation from a sample. However, with only one element, the calculation becomes 0/0, resulting in a NaN value.

One alternative is to set ddof=0, calculating the standard deviation with ddof=0, dividing by n instead of n-1, resulting in a standard deviation of 0 for a single element:

   import pandas as pd
   import numpy as np

   df = pd.DataFrame({'A': [1]})

   # Calculate standard deviation with ddof=0
   result = df.agg(lambda x: np.std(x, ddof=0))
   print(result)

For standard deviation, it may be appropriate to have an option to toggle between population and sample calculation.

@slabasan slabasan added the area-stats Issues and PRs related to Thicket's stats subpackage label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-stats Issues and PRs related to Thicket's stats subpackage
Projects
None yet
Development

No branches or pull requests

1 participant