sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

beazerj · 2024-07-01T20:37:15Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

data32 = pl.from_dict({"counts":[100000000]*100}, schema = {"counts": pl.UInt32})
data64 = pl.from_dict({"counts":[100000000]*100}, schema = {"counts": pl.UInt64})

assert data32.select(pl.col("counts").sum()).item() == data64.select(pl.col("counts").sum()).item()

assert data32.select(pl.col("counts").cum_sum().max()).item() == data64.select(pl.col("counts").cum_sum().max()).item()

Log output

No response

Issue description

When a column has dtype pl.UInt32 and either sum or cum_sum is called, polars ignores any cases of overflow instead of casting to pl.UInt64 causing the incorrect result to be silently returned.

Expected behavior

Polars should either cast the data to UInt64 and return the correct value or raise an error telling the user to first cast to UInt64. In my case, after several data transformations, i ended up with a large dataframe with a column of small values that was correctly inferred by polars to be UInt32. Because of the large number of rows, the sum overflowed UInt32 but i could not have known this would have happened beforehand.

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Linux-6.1.0-22-cloud-amd64-x86_64-with-glibc2.36
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.5.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.6.3
pyiceberg:            <not installed>
sqlalchemy:           2.0.28
torch:                2.3.0+cu121
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

s-banach · 2024-07-01T23:48:44Z

No opinion on your issue, just chiming in with what may be a clarification.

The output dtypes must be a function of the input dtypes. In this case, I guess you're proposing casting u32 to u64 before summing. I'm sure the developers must have considered it, since they chose to upcast all the 8 and 16 bit ints to i64 before summing, but not to do the same for 32 bit ints.
We don't always have larger types to cast up to, so operations like sum will have to overflow sometimes. In the general case, you're asking why not raise an error instead of overflowing. I'm too ignorant to answer that, I do find the question interesting.

Julian-J-S · 2024-07-02T08:09:47Z

polars is a data transformation/manipulation library with a string focus on performance.
Every check will cost a lot of performance.
Also polars cannot know every use case or workflow and therefore cannot make assumptions when something should or should not be cast or warned.
I personally dislike it if a library like polars changes the data types I indentionally set. I want all changes to be explicit by the programmer.
And sometimes overflow might actually be the desired behaviour.

imo to a certain point this lies within the responsibility of us as data people to know our data, data types and ranges of our expected outputs 🤔 😃

Not sure if this is the case but if polars is inconsistent and sometimes cast and sometimes not, this inconsistency would be unfortunate.

ritchie46 · 2024-07-03T12:08:37Z

This is intended behavior. Polars, like other compute libraries like numpy does not check for overflow in arithmetic and other performance sensitive operations.

Separately schemas are determined by the operations and not by the data. For very summing very small integers, we cast them to a large dtype, but of uint32 we keep the same type as overflow isn't likely.

It means that Polars is not allowed to cast on overflow. Data types may not change by data, but are strictly dependent on the operation itself. They can be resolved statically. The only exceptions are pivot, and inference operations like csv, json parsing.

Kikkon · 2024-11-01T07:09:31Z

I think this issue: #1941 is related to our current discussion. Could we consider converting UInt32 to UInt64? Alternatively, we could offer this as an optional configuration, as some users might be open to a bit of runtime checking overhead to avoid more hidden overflow issues.

beazerj added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 1, 2024

ritchie46 added invalid A bug report that is not actually a bug and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

beazerj commented Jul 1, 2024

s-banach commented Jul 1, 2024

Julian-J-S commented Jul 2, 2024

ritchie46 commented Jul 3, 2024

Kikkon commented Nov 1, 2024

sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

sum() and cum_sum() do not cast to UInt64 when data overflows UInt32 #17340

Comments

beazerj commented Jul 1, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

s-banach commented Jul 1, 2024

Julian-J-S commented Jul 2, 2024

ritchie46 commented Jul 3, 2024

Kikkon commented Nov 1, 2024