Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy.stats.chisquare throws error in compare_constraints.py due to unequal sum of frequencies #75

Open
doorleyr opened this issue Aug 9, 2021 · 7 comments

Comments

@doorleyr
Copy link

doorleyr commented Aug 9, 2021

Hi, thanks for the great library. I'm trying to use this to generate some populations and I'm consistently getting the same error, even when I run the examples in this repository.

"ValueError: For each axis slice, the sum of the observed frequencies must agree with the sum of the expected frequencies to a relative tolerance of 1e-08, but the percent differences are:" ...

This error is coming from scipy.stats.chisquare which is called by the compare_to_constraints function. It appears that the sum of constraints.values is not always equal to the sum of counts.values (within the required tolerance) and this causes the error.

@timothywhiteley
Copy link

Hi doorleyr,
I came across the same issue. I believe that there has been a change to scipy (no error given on scipy 1.4.1).
I bodged my code by changing synthpop/draw.py Line 167 to:
return chisquare(counts.values, constraints.values*np.mean(counts.values)/np.mean(constraints.values))
so that the frequencies is the same but I think there must be a better solution.

@PyMap
Copy link
Collaborator

PyMap commented Nov 10, 2021

Hi all,
As both are pointing out, the scipy.stats.chisquare used here to compare synthetic population against target constraints has been updated.
A new test was included in the chisquare class to check if observed vs expected frecuencies have the same sum or not. And if they don't, then an error is raised.
Following the discussions in the hiperlinked PR above, it seems that chisquare test is valid only in the case that both frequency vectors sum to the same total.

The workaround proporsed by @timothywhiteley seems to be a good solution to achieve that requirement. I also tried by normalizing both vectors to still have distributions that do not match but that sum up to 1 (the same total).

Below the results:

/home/federico/federico/urbansim/spop/synthpop/synthpop/ipu/ipu.py:192: RuntimeWarning: divide by zero encountered in double_scalars
  adj = constraint / float((column * weights).sum())
Drawing 765 households
> /home/federico/federico/urbansim/spop/synthpop/synthpop/draw.py(171)compare_to_constraints()
-> return chisquare(counts.values, constraints.values)
(Pdb) chi # original chisquare test ran with previous scipy version (brefore the new test)
Power_divergenceResult(statistic=116.96711825518415, pvalue=1.5273460185773555e-06)
(Pdb) norm_chi # chisquare test with normalized vectors
Power_divergenceResult(statistic=0.04170304232710738, pvalue=1.0)
(Pdb) chisquare(counts.values, constraints.values*np.mean(counts.values)/np.mean(constraints.values)) # timothy workaround
Power_divergenceResult(statistic=118.01960978571391, pvalue=1.1317109362713438e-06)

As expected, normalizing by the constraints mean is closer to the result obtained before the new scipy test. I think we can still compare results by checking the resultant amount of persons by category trowed by the synthesizer.
(cc @msoltadeo @alephcero @janowicz )

@bhaveshneekhra
Copy link

Is it resolved yet?
@PyMap and @timothywhiteley the workaround does not seem to work as on Sat, 14 Jan, 2023. (I remember it worked for me on 18th Oct, 2022). Did something change in between?

@werdnabae
Copy link

Is it resolved yet? @PyMap and @timothywhiteley the workaround does not seem to work as on Sat, 14 Jan, 2023. (I remember it worked for me on 18th Oct, 2022). Did something change in between?

The workaround given by @timothywhiteley is still working for me

@szalaimd
Copy link

I do not understand why is it required to have the expected and observed counts to be the same. The test statistic can be calculated without this restriction ( sum{(observed_i-expected_i)^2/expected_i} ), the distribution table is given, I can not see any reason for this.

PeterDavidson added a commit to TransPositionAust/synthpop that referenced this issue Jun 5, 2023
@bbarrios6
Copy link

Work around not working Dec 21 2023, any solutions?

@bhaveshneekhra
Copy link

Hi doorleyr, I came across the same issue. I believe that there has been a change to scipy (no error given on scipy 1.4.1). I bodged my code by changing synthpop/draw.py Line 167 to: return chisquare(counts.values, constraints.values*np.mean(counts.values)/np.mean(constraints.values)) so that the frequencies is the same but I think there must be a better solution.

Worked for me on macos 14.3.1 with python 3.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants