Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add CumCountWhere() and RollingCountWhere() features to UpdateBy #6566

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

lbooker42
Copy link
Contributor

@lbooker42 lbooker42 commented Jan 15, 2025

Groovy Examples

table = emptyTable(1000).update("key=randomInt(0,10)", "intCol=randomInt(0,1000)")

// zero-key
t_summary = table.updateBy([
    CumCountWhere("running_gt_500", "intCol > 500"),
    RollingCountWhere(50, "windowed_gt_500", "intCol > 500"),
    ])

// bucketed
t_summary = table.updateBy([
    CumCountWhere("running_gt_500", "intCol > 500"),
    RollingCountWhere(50, "windowed_gt_500", "intCol > 500"),
    ], "key")

Python Examples

from deephaven import empty_table
from deephaven.updateby import cum_count_where, rolling_count_where_tick

table = empty_table(1000).update(["key=randomInt(0,10)", "intCol=randomInt(0,1000)"])

# zero-key
t_summary = table.update_by([
    cum_count_where(col="running_gt_500", filters="intCol > 500"),
    rolling_count_where_tick(rev_ticks=50, col="windowed_gt_500", filters="intCol > 500"),
    ])

# bucketed
t_summary_bucketed = table.update_by([
    cum_count_where(col="running_gt_500", filters="intCol > 500"),
    rolling_count_where_tick(rev_ticks=50, col="windowed_gt_500", filters="intCol > 500"),
    ], by="key")

Performance Notes

TL:DR Performance compares very well.

RollingCountWhere() has near identical performance to the comparison benchmarks (can be faster depending on the complexity of the filter. CumCountWhere() also compares well to Ema()but can't catch up to zero-key CumSum(), which is is remarkably fast.

Comparing CumCountWhere to CumSum and Ema:

120000000
avg of 2

ZeroKey
CumSum	137.36250
Ema	449.5528125
CumCountWhereConstant	475.9980005
CumCountWhereMatch	649.9689995
CumCountWhereRange	654.322250
CumCountWhereMultiple	695.4477915
CumCountWhereMultipleOr	704.900583

Bucketed - 250 buckets
CumSum	2979.1730005
Ema	3024.152458
CumCountWhereConstant	2569.7280835
CumCountWhereMatch	3031.6534795
CumCountWhereRange	3030.5433335
CumCountWhereMultiple	3052.597625
CumCountWhereMultipleOr	3059.911729

Bucketed - 640 buckets
CumSum	3827.299833
Ema	3880.2538125
CumCountWhereConstant	3416.4387715
CumCountWhereMatch	3906.691333
CumCountWhereRange	3902.3064375
CumCountWhereMultiple	3967.1584795
CumCountWhereMultipleOr	3925.0775205

Comparing RollingCountWhere to RollingCount and RollingSum:

120000000
avg of 2

ZeroKey
RollingCount	1511.7957295
RollingSum	1513.6013545
RollingCountWhereConstant	1403.2817915
RollingCountWhereMatch	1453.9323125
RollingCountWhereRange	1764.2137915
RollingCountWhereMultiple	1576.4896255
RollingCountWhereMultipleOr	1541.5631455

Bucketed - 250 buckets
RollingCount	3468.7696665
RollingSum	3326.047792
RollingCountWhereConstant	2858.677771
RollingCountWhereMatch	3327.958604
RollingCountWhereRange	3347.961083
RollingCountWhereMultiple	3429.413562
RollingCountWhereMultipleOr	3364.244104

Bucketed - 640 buckets
RollingCount	4310.4265835
RollingSum	4286.427479
RollingCountWhereConstant	3869.1892705
RollingCountWhereMatch	4333.8479375
RollingCountWhereRange	4269.3454375
RollingCountWhereMultiple	4290.0618545
RollingCountWhereMultipleOr	4346.8478535

@lbooker42 lbooker42 self-assigned this Jan 15, 2025
@lbooker42 lbooker42 added this to the 0.38.0 milestone Jan 15, 2025
Copy link
Contributor Author

@lbooker42 lbooker42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

@@ -341,6 +341,21 @@ public void testAggCountWhere() {
assertEquals(6L, counts.get(0));
counts = ColumnVectors.ofLong(doubleCounted, "filter15");
assertEquals(6L, counts.get(0));

// Get a static set table for use in dynamic where filters (contains 0-3)
final QueryTable setTable = (QueryTable) TableTools.newTable(col("sym", 1, 2, 3));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that AggCountWhere didn't test DynamicWhereFilter in CI, corrected here.

py/server/deephaven/updateby.py Outdated Show resolved Hide resolved
@@ -230,6 +230,7 @@ static Count AggCount(String resultColumn) {
* values that pass the supplied {@code filters}.
*
* @param resultColumn The {@link Count#column() output column} name
* @param filters The filters to apply to the input columns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected missing param

@lbooker42 lbooker42 marked this pull request as ready for review January 15, 2025 21:05
Copy link
Contributor

@cpwright cpwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does code coverage look like?

Chunk<? extends Values>[] influencerValueChunkArr,
LongChunk<OrderedRowKeys> affectedPosChunk,
LongChunk<OrderedRowKeys> influencerPosChunk,
IntChunk<? extends Values> pushChunk,
IntChunk<? extends Values> popChunk,
int len);
int affectedCount,
int influencerCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method parameters should have javadoc.

message UpdateByRollingCountWhere {
UpdateByWindowScale reverse_window_scale = 1;
UpdateByWindowScale forward_window_scale = 2;
string column_name = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What column name is this for?

}

message UpdateByCumulativeCountWhere {
string column_name = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 and 2 instead of 3 and 4 for the first things would be consistent.

Please add comments to each of the messages and fields you are adding; it makes the protodoc better going forward.

@@ -665,7 +665,8 @@ def agg_all_by(self, agg: Aggregation, by: Union[str, List[str]]) -> Table:
"""
return super(Table, self).agg_all_by(agg, by)

def update_by(self, ops: Union[UpdateByOperation, List[UpdateByOperation]], by: Union[str, List[str]]) -> Table:
def update_by(self, ops: Union[UpdateByOperation, List[UpdateByOperation]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage of the None case?

Comment on lines +198 to +200
Returns:
an UpdateByOperation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated



def rolling_count_where_time(ts_col: str, col: str, filters: Union[str, List[str]],
rev_time: Union[int, str] = 0, fwd_time: Union[int, str] = 0) -> UpdateByOperation:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default of zero does not match formula

filters (Union[str, Filter, List[str], List[Filter]], optional): the filter condition
expression(s) or Filter object(s)
rev_ticks (int): the look-behind window size (in rows/ticks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default matches formula, but is not listed as a default in the doc.

What ist he intention for = 0; =0 to do?

resultsChunk);
}
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might just want an else instead of the continue logic.

TestHelper.assertWhereInt(actualIt, expectedIt, val -> val > 10 && val <= 50);
}

// Test on String column (representing all Object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having simple tests for Instant and Boolean are worthwhile in my experience, there is stuff that can go wrong with reinterpretation. This is maybe not necessary here though, because we might be covered by where testing. I go back and forth on this, because we do need to create fake tables in some circumstances or could have problems with the ChunkFilter not matching what the aggs read.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants