Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

Open
2 tasks done
beazerj opened this issue Jul 1, 2024 · 4 comments
Open
2 tasks done

sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

beazerj opened this issue Jul 1, 2024 · 4 comments
Labels
invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@beazerj
Copy link

beazerj commented Jul 1, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

data32 = pl.from_dict({"counts":[100000000]*100}, schema = {"counts": pl.UInt32})
data64 = pl.from_dict({"counts":[100000000]*100}, schema = {"counts": pl.UInt64})

assert data32.select(pl.col("counts").sum()).item() == data64.select(pl.col("counts").sum()).item()

assert data32.select(pl.col("counts").cum_sum().max()).item() == data64.select(pl.col("counts").cum_sum().max()).item()

Log output

No response

Issue description

When a column has dtype pl.UInt32 and either sum or cum_sum is called, polars ignores any cases of overflow instead of casting to pl.UInt64 causing the incorrect result to be silently returned.

Expected behavior

Polars should either cast the data to UInt64 and return the correct value or raise an error telling the user to first cast to UInt64. In my case, after several data transformations, i ended up with a large dataframe with a column of small values that was correctly inferred by polars to be UInt32. Because of the large number of rows, the sum overflowed UInt32 but i could not have known this would have happened beforehand.

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Linux-6.1.0-22-cloud-amd64-x86_64-with-glibc2.36
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.6.3
pyiceberg:            <not installed>
sqlalchemy:           2.0.28
torch:                2.3.0+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@beazerj beazerj added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 1, 2024
@s-banach
Copy link
Contributor

s-banach commented Jul 1, 2024

No opinion on your issue, just chiming in with what may be a clarification.

  • The output dtypes must be a function of the input dtypes. In this case, I guess you're proposing casting u32 to u64 before summing. I'm sure the developers must have considered it, since they chose to upcast all the 8 and 16 bit ints to i64 before summing, but not to do the same for 32 bit ints.
  • We don't always have larger types to cast up to, so operations like sum will have to overflow sometimes. In the general case, you're asking why not raise an error instead of overflowing. I'm too ignorant to answer that, I do find the question interesting.

@Julian-J-S
Copy link
Contributor

polars is a data transformation/manipulation library with a string focus on performance.
Every check will cost a lot of performance.
Also polars cannot know every use case or workflow and therefore cannot make assumptions when something should or should not be cast or warned.
I personally dislike it if a library like polars changes the data types I indentionally set. I want all changes to be explicit by the programmer.
And sometimes overflow might actually be the desired behaviour.

imo to a certain point this lies within the responsibility of us as data people to know our data, data types and ranges of our expected outputs 🤔 😃

Not sure if this is the case but if polars is inconsistent and sometimes cast and sometimes not, this inconsistency would be unfortunate.

@ritchie46
Copy link
Member

This is intended behavior. Polars, like other compute libraries like numpy does not check for overflow in arithmetic and other performance sensitive operations.

Separately schemas are determined by the operations and not by the data. For very summing very small integers, we cast them to a large dtype, but of uint32 we keep the same type as overflow isn't likely.

It means that Polars is not allowed to cast on overflow. Data types may not change by data, but are strictly dependent on the operation itself. They can be resolved statically. The only exceptions are pivot, and inference operations like csv, json parsing.

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jul 3, 2024
@Kikkon
Copy link

Kikkon commented Nov 1, 2024

I think this issue: #1941 is related to our current discussion. Could we consider converting UInt32 to UInt64? Alternatively, we could offer this as an optional configuration, as some users might be open to a bit of runtime checking overhead to avoid more hidden overflow issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

5 participants