Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling mean operation is ignored after diff #21038

Open
2 tasks done
danalizieors opened this issue Feb 1, 2025 · 6 comments
Open
2 tasks done

Rolling mean operation is ignored after diff #21038

danalizieors opened this issue Feb 1, 2025 · 6 comments
Labels
closing-candidate PR's/issue candidate for closing python Related to Python Polars

Comments

@danalizieors
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

# %%
import polars as pl

df = pl.DataFrame({"x": range(2, 21)}).with_row_index()

lag = 3

log_diff = pl.col("x").log().diff(lag)
def rolling_mean(column):
    return column.mean().rolling("index", period=f"{lag}i", closed="both")

r = df.with_columns(
    log_diff.alias("ld"),
    log_diff.pipe(rolling_mean).alias("r1"),
).with_columns(pl.col("ld").pipe(rolling_mean).alias("r2"))
r

Log output

┌───────┬─────┬──────────┬──────────┬──────────┐
│ index ┆ x   ┆ ld       ┆ r1       ┆ r2       │
│ ---   ┆ --- ┆ ---      ┆ ---      ┆ ---      │
│ u32   ┆ i64 ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪═════╪══════════╪══════════╪══════════╡
│ 0     ┆ 2   ┆ null     ┆ null     ┆ null     │
│ 1     ┆ 3   ┆ null     ┆ null     ┆ null     │
│ 2     ┆ 4   ┆ null     ┆ null     ┆ null     │
│ 3     ┆ 5   ┆ 0.916291 ┆ 0.916291 ┆ 0.916291 │
│ 4     ┆ 6   ┆ 0.693147 ┆ 0.693147 ┆ 0.804719 │
│ …     ┆ …   ┆ …        ┆ …        ┆ …        │
│ 14    ┆ 16  ┆ 0.207639 ┆ 0.207639 ┆ 0.233577 │
│ 15    ┆ 17  ┆ 0.194156 ┆ 0.194156 ┆ 0.216525 │
│ 16    ┆ 18  ┆ 0.182322 ┆ 0.182322 ┆ 0.201815 │
│ 17    ┆ 19  ┆ 0.17185  ┆ 0.17185  ┆ 0.188992 │
│ 18    ┆ 20  ┆ 0.162519 ┆ 0.162519 ┆ 0.177712 │
└───────┴─────┴──────────┴──────────┴──────────┘

Issue description

In the provided example the columns r1 and r2 should be equal, since the underlying calculations are the same, but expressed in two different ways:

  • r1 is calculated directly
  • r2 is calculated through an intermediate result aliased as ld

I suppose there is an error in how Polars optimizes the operations internally.

After removing the diff operation, the bug can no longer be reproduced, the two columns are equal.

Expected behavior

There should not be any difference between the two columns, since the underlying calculations are the same.

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-6.8.7-arch1-2-x86_64-with-glibc2.39
Python:              3.13.1 (main, Dec 19 2024, 14:32:25) [Clang 18.1.8 ]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          3.1.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.12.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.10.0
numpy                2.2.2
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.0
pydantic             <not installed>
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                2.6.0+cu124
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@danalizieors danalizieors added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 1, 2025
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Feb 1, 2025

Hi @danalizieors

Confusingly enough, this is actually expected

When you write pl.col('x').log().diff().mean().rolling('index', '3i'), then this is equivalent to:

df.rolling('index', '3i').agg(pl.col('x').log().diff().mean())

That is to say - the whole expression pl.col('x').log().diff().mean() is computed in the rolling context, not just the .mean() part!

Let's break that down:

In [3]: df.rolling('index', period='3i').agg(pl.col('x'))
Out[3]:
shape: (19, 2)
┌───────┬──────────────┐
│ indexx            │
│ ------          │
│ u32list[i64]    │
╞═══════╪══════════════╡
│ 0     ┆ [2]          │
│ 1     ┆ [2, 3]       │
│ 2     ┆ [2, 3, 4]    │
│ 3     ┆ [3, 4, 5]    │
│ 4     ┆ [4, 5, 6]    │
│ …     ┆ …            │
│ 14    ┆ [14, 15, 16] │
│ 15    ┆ [15, 16, 17] │
│ 16    ┆ [16, 17, 18] │
│ 17    ┆ [17, 18, 19] │
│ 18    ┆ [18, 19, 20] │
└───────┴──────────────┘

In [4]: df.rolling('index', period='3i').agg(pl.col('x').log())
Out[4]:
shape: (19, 2)
┌───────┬────────────────────────────────┐
│ indexx                              │
│ ------                            │
│ u32list[f64]                      │
╞═══════╪════════════════════════════════╡
│ 0     ┆ [0.693147]                     │
│ 1     ┆ [0.693147, 1.098612]           │
│ 2     ┆ [0.693147, 1.098612, 1.386294] │
│ 3     ┆ [1.098612, 1.386294, 1.609438] │
│ 4     ┆ [1.386294, 1.609438, 1.791759] │
│ …     ┆ …                              │
│ 14    ┆ [2.639057, 2.70805, 2.772589]  │
│ 15    ┆ [2.70805, 2.772589, 2.833213]  │
│ 16    ┆ [2.772589, 2.833213, 2.890372] │
│ 17    ┆ [2.833213, 2.890372, 2.944439] │
│ 18    ┆ [2.890372, 2.944439, 2.995732] │
└───────┴────────────────────────────────┘

In [5]: df.rolling('index', period='3i').agg(pl.col('x').log().diff())
Out[5]:
shape: (19, 2)
┌───────┬────────────────────────────┐
│ indexx                          │
│ ------                        │
│ u32list[f64]                  │
╞═══════╪════════════════════════════╡
│ 0     ┆ [null]                     │
│ 1     ┆ [null, 0.405465]           │
│ 2     ┆ [null, 0.405465, 0.287682] │
│ 3     ┆ [null, 0.287682, 0.223144] │
│ 4     ┆ [null, 0.223144, 0.182322] │
│ …     ┆ …                          │
│ 14    ┆ [null, 0.068993, 0.064539] │
│ 15    ┆ [null, 0.064539, 0.060625] │
│ 16    ┆ [null, 0.060625, 0.057158] │
│ 17    ┆ [null, 0.057158, 0.054067] │
│ 18    ┆ [null, 0.054067, 0.051293] │
└───────┴────────────────────────────┘

In [6]: df.rolling('index', period='3i').agg(pl.col('x').log().diff().mean())
Out[6]:
shape: (19, 2)
┌───────┬──────────┐
│ indexx        │
│ ------      │
│ u32f64      │
╞═══════╪══════════╡
│ 0null     │
│ 10.405465 │
│ 20.346574 │
│ 30.255413 │
│ 40.202733 │
│ …     ┆ …        │
│ 140.066766 │
│ 150.062582 │
│ 160.058892 │
│ 170.055613 │
│ 180.05268  │
└───────┴──────────┘

I appreciate that this is confusing, but I'd say that it is expected behaviour

@MarcoGorelli MarcoGorelli added closing-candidate PR's/issue candidate for closing and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Feb 1, 2025
@danalizieors
Copy link
Author

By changing the code a bit, I found a more interpretable reproduction. I removed the mean operation to see which values end up in the rolling window.

lag = 2
def rolling_mean(column):
    return column.rolling("index", period=f"{lag}i", closed="both")
┌───────┬─────┬──────────┬────────────────────────┬────────────────────────────────┐
│ index ┆ x   ┆ ld       ┆ r1                     ┆ r2                             │
│ ---   ┆ --- ┆ ---      ┆ ---                    ┆ ---                            │
│ u32   ┆ i64 ┆ f64      ┆ list[f64]              ┆ list[f64]                      │
╞═══════╪═════╪══════════╪════════════════════════╪════════════════════════════════╡
│ 0     ┆ 2   ┆ null     ┆ [null]                 ┆ [null]                         │
│ 1     ┆ 3   ┆ null     ┆ [null, null]           ┆ [null, null]                   │
│ 2     ┆ 4   ┆ 0.693147 ┆ [null, null, 0.693147] ┆ [null, null, 0.693147]         │
│ 3     ┆ 5   ┆ 0.510826 ┆ [null, null, 0.510826] ┆ [null, 0.693147, 0.510826]     │
│ 4     ┆ 6   ┆ 0.405465 ┆ [null, null, 0.405465] ┆ [0.693147, 0.510826, 0.405465] │
│ …     ┆ …   ┆ …        ┆ …                      ┆ …                              │
│ 14    ┆ 16  ┆ 0.133531 ┆ [null, null, 0.133531] ┆ [0.154151, 0.143101, 0.133531] │
│ 15    ┆ 17  ┆ 0.125163 ┆ [null, null, 0.125163] ┆ [0.143101, 0.133531, 0.125163] │
│ 16    ┆ 18  ┆ 0.117783 ┆ [null, null, 0.117783] ┆ [0.133531, 0.125163, 0.117783] │
│ 17    ┆ 19  ┆ 0.111226 ┆ [null, null, 0.111226] ┆ [0.125163, 0.117783, 0.111226] │
│ 18    ┆ 20  ┆ 0.105361 ┆ [null, null, 0.105361] ┆ [0.117783, 0.111226, 0.105361] │
└───────┴─────┴──────────┴────────────────────────┴────────────────────────────────┘

For some reason, in the r1 column the first two elements in the list end up being null.

@MarcoGorelli
Copy link
Collaborator

because they don't have any previous elements to take a diff with

@danalizieors
Copy link
Author

Oh, okay, I get it now! Thank you for the answer!

I have no other input on this issue, maybe we should have some part of the documentation mentioning this quirk.

@danalizieors
Copy link
Author

Actually, I may have a follow-up question.

Can we have this rolling function behave in a way I have intended in the example on a column level or should I write this as two separate dataframe level operations? It would be nice if this would be possible without creating intermediate "variables".

@danalizieors
Copy link
Author

danalizieors commented Feb 1, 2025

I kind of found a hack, but better solutions are welcome!

I multiplied the lag in the rolling window period by 2, therefore there is enough space for the lagged diff to be executed.

def rolling_mean_hack(column):
    return column.mean().rolling("index", period=f"{2*lag}i", closed="both")

r = df.with_columns(
    log_diff.alias("ld"),
    log_diff.pipe(rolling_mean).alias("r1"),
    log_diff.pipe(rolling_mean_hack).alias("r3"),
).with_columns(
    rolling_mean(pl.col("ld")).alias("r2"),
)
r
┌───────┬─────┬──────────┬──────────┬──────────┬──────────┐
│ index ┆ x   ┆ ld       ┆ r1       ┆ r3       ┆ r2       │
│ ---   ┆ --- ┆ ---      ┆ ---      ┆ ---      ┆ ---      │
│ u32   ┆ i64 ┆ f64      ┆ f64      ┆ f64      ┆ f64      │
╞═══════╪═════╪══════════╪══════════╪══════════╪══════════╡
│ 0     ┆ 2   ┆ null     ┆ null     ┆ null     ┆ null     │
│ 1     ┆ 3   ┆ null     ┆ null     ┆ null     ┆ null     │
│ 2     ┆ 4   ┆ null     ┆ null     ┆ null     ┆ null     │
│ 3     ┆ 5   ┆ null     ┆ null     ┆ null     ┆ null     │
│ 4     ┆ 6   ┆ 1.098612 ┆ 1.098612 ┆ 1.098612 ┆ 1.098612 │
│ …     ┆ …   ┆ …        ┆ …        ┆ …        ┆ …        │
│ 14    ┆ 16  ┆ 0.287682 ┆ 0.287682 ┆ 0.3415   ┆ 0.3415   │
│ 15    ┆ 17  ┆ 0.268264 ┆ 0.268264 ┆ 0.31406  ┆ 0.31406  │
│ 16    ┆ 18  ┆ 0.251314 ┆ 0.251314 ┆ 0.290778 ┆ 0.290778 │
│ 17    ┆ 19  ┆ 0.236389 ┆ 0.236389 ┆ 0.270761 ┆ 0.270761 │
│ 18    ┆ 20  ┆ 0.223144 ┆ 0.223144 ┆ 0.253359 ┆ 0.253359 │
└───────┴─────┴──────────┴──────────┴──────────┴──────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closing-candidate PR's/issue candidate for closing python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants