-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Condition Mapping like numpy.select
#5823
Comments
If you're not aware - you can chain
|
@JulianCologne you don't have to use import polars as pl
from polars import col, when
df = pl.DataFrame({
"base_price": [100, 200, 300, 400, 500],
"discount_level": [0, -1, -5, 2, 1],
})
# polars (when, then, otherwise)
discount_level, base_price = col("discount_level"), col("base_price)
df.with_column(
when(discount_level == 0).then(base_price )
.when(discount_level == 1).then(base_price ) * 0.9)
.when(discount_level > 1).then(base_price * 0.8)
.when(discount_level == -1).then(2)
.otherwise(-3)
.alias("discounted_price")
) |
thanks a lot for your feedback and your ideas! @cmdlineluser: nice solution and very similar to the numpy/pandas way. What I don't like about it is this strange False/None initialization of the @mcrumiller: very nice! I did not know you could chain the I am not so much about performance but just playing around with this and comparing it to pandas/numpy I found that the Example: data = {
"r": np.random.rand(100_000_000),
}
df_pl = pl.DataFrame(data)
df_pd = pd.DataFrame(data)
%%timeit
df_pl.with_column(
when(col("r") < 0.5).then(0)
.when(col("r") < 0.75).then(col("r") * 2)
.when(col("r") < 0.9).then(col("r") * 3)
.when(col("r") < 0.95).then(col("r") * 4)
.otherwise(1)
.alias("calc")
);
# >>> 5.5s
%%timeit
condlist = [
df_pd["r"] < 0.5,
df_pd["r"] < 0.75,
df_pd["r"] < 0.9,
df_pd["r"] < 0.95,
]
choicelist = [
0,
df_pd["r"] * 2,
df_pd["r"] * 3,
df_pd["r"] * 4,
]
#df_pd.assign(calc=np.select(condlist, choicelist, default=1));
np.select(condlist, choicelist, default=1)
# >>> 3.3 (numpy+pandas assign)
# >>> 2.5s (numpy only) |
This is because (at least for now), Polars will calculate each then condition for the whole "r" column and not just on the subsection alone. |
If I am not mistaken polars evaluates (in-parallel) all the branches in when/then/otherwise constructs and then throws away the ones that are false. This might be the reason why multiple nested conditionals are slower. Curious to see, if a single when/then/otherwise is slower than numpy? |
@ghuls can someone explain the logic of doing 1) compute all, 2) filter? Wouldn't simply reversing those two operations significantly speed up the entire process? |
@mcrumiller As all those operations run in parallel, the compute all + filter afterwards operation can be faster in wall time and the calculations can be vectorized operations. But there are also downsides:
@ritchie46 Couldn't you limit the parallelisation to the when condition:
Then make the masks for each branch in paralell and fill the other values with None? |
just throwing in some performance comparisons polars: when/then/otherwise chain VS when/then/otherwise nested%%timeit
df_pl.with_column(
when(col("r") < 0.5).then(0)
.when(col("r") < 0.75).then(col("r") * 2)
.when(col("r") < 0.9).then(col("r") * 3)
.when(col("r") < 0.95).then(col("r") * 4)
.otherwise(1)
);
# >>> 5.5s
%%timeit
df_pl.with_column(
when(col("r") < 0.5).then(0)
.otherwise(
when(col("r") < 0.75).then(col("r") * 2)
.otherwise(
when(col("r") < 0.9).then(col("r") * 3)
.otherwise(
when(col("r") < 0.95).then(col("r") * 4)
.otherwise(1)
)
));
# >>> 5.5
(single condition) polars when/then/otherwise VS numpy select VS numpy where%%timeit
df_pl.with_column(
when(col("r") < 0.5).then(0)
.otherwise(1)
);
# >>> 2.16s
%%timeit
df_pd.assign(
calc=np.select(
[df_pd["r"] < 0.5],
[0],
default=1
)
)
# >>> 1.75s
%%timeit
df_pd.assign(
calc=np.where(df_pd["r"] < 0.5, 0, 1)
);
# >>> 1.34s
|
It should make zero difference: one uses |
Problem description
#5822 would be really great for mapping column values with a descrete/fixed set of possible values.
But additionally there are many cases where it is impossible to know the set of possible values in advance so you need to use a condition.
In the pandas/numpy world this is done with the
select
function where you specify a list of conditions and a list of values to return for each condition.In polars there is a
when
,then
,otherwise
chain which is really great and ergonomic for a single condition but becomes a bit messy when you have multiple conditions.Examples:
Do you think it would be possible or a worthy addition to add something like this to polars? =)
The text was updated successfully, but these errors were encountered: