Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): support non_pk_prefix_watermark state cleaning #19889

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

Li0k
Copy link
Contributor

@Li0k Li0k commented Dec 23, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

related to #18802

This PR supports non_pk_prefix_watermark state cleaning for Hummock.

Since non_pk_prefix_watermark relies on catalogs, this introduces additional overhead. Therefore, this PR does not guarantee read filtering for non_pk_prefix_watermark and only handles expired data.

The changes are as follows:

  1. watermarks of type non_pk_prefix_watermark are not added to ReadWatermarkIndex.

  2. state table support to write non_pk_prefix_watermark and serialize.

  3. compaction catalog agent support to get watermark serde

  4. skip watermark iterator supports filtering non_pk_prefix_watermark. support NonPkPrefixSkipWatermarkIterator to filter table data of NonPkPrefix Watermark.

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

@Li0k Li0k changed the title feat(storage): non_pk_watermark state clean WIP: feat(storage): non_pk_watermark state clean Dec 23, 2024
@Li0k Li0k marked this pull request as ready for review December 23, 2024 07:28
@graphite-app graphite-app bot requested a review from a team December 23, 2024 08:18
@Li0k Li0k requested a review from a team as a code owner December 25, 2024 12:20
@Li0k Li0k requested a review from xxchan December 25, 2024 12:20
@Li0k Li0k changed the title WIP: feat(storage): non_pk_watermark state clean feat(storage): support non_pk_prefix_watermark state cleaning Dec 30, 2024
@Li0k Li0k requested review from hzxa21, st1page and chenzl25 December 30, 2024 07:32
src/storage/src/hummock/store/version.rs Show resolved Hide resolved
src/meta/src/hummock/manager/compaction/mod.rs Outdated Show resolved Hide resolved
.table_watermarks
.iter()
.filter_map(|(table_id, table_watermarks)| {
if table_id_with_pk_prefix_watermark.contains(table_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a WaterMarkType define in the version, why don't we just use that to filter out table with non pk prefix watermark?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we filter out non pk prefix watermark here, how can compactor retrieve the non pk prefix watermark? Based on the logic here, it seems that we rely on the fact that non pk prefix watermark is present in the compact task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch , we should filter the watermark by WaterMarkType directly.

And, the filtered results are only passed to the picker, while all relevant watermarks are passed to the compactor (pk or non-pk).

src/storage/hummock_sdk/src/compact_task.rs Outdated Show resolved Hide resolved
src/storage/hummock_sdk/src/compact_task.rs Outdated Show resolved Hide resolved
src/storage/hummock_sdk/src/table_watermark.rs Outdated Show resolved Hide resolved
src/storage/hummock_sdk/src/table_watermark.rs Outdated Show resolved Hide resolved
src/storage/hummock_sdk/Cargo.toml Outdated Show resolved Hide resolved
@@ -42,10 +47,14 @@ pub struct SkipWatermarkIterator<I> {
}

impl<I: HummockIterator<Direction = Forward>> SkipWatermarkIterator<I> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: since SkipWatermarkIterator is only used by compactor, how about moving skip_watermark.rs into src/hummock/compactor?

Copy link
Contributor Author

@Li0k Li0k Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I will propose a separate pr for it

});
let watermark_col_in_pk =
row.datum_at(*watermark_col_idx_in_pk);
cmp_datum(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, if cmp_datum returns Euqal | Greater, based on the logic in L360, the watermark will be advanced. I think this is incorrect for non pk prefix watermark because the non pk prefix watermark and the pk doesn't have the same ordering.

Comment on lines 732 to 746
let table_watermarks = version
.latest_version()
.table_watermarks
.iter()
.filter_map(|(table_id, table_watermarks)| {
if matches!(
table_watermarks.watermark_type,
WatermarkSerdeType::PkPrefix,
) {
Some((*table_id, table_watermarks.clone()))
} else {
None
}
})
.collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually why don't we do the filtering inside the picker instead like in here if the watermark type is part of TableWatermarks:

We can avoid cloning the table watermark, which can be large given that it stores bytes from user data, with no harm.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on this comment? @Li0k

}
WatermarkSerdeType::Serde(_watermark) => {
// do not skip the non-pk prefix watermark when vnode is the same
return false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid this is still incorrect based on the semantic of advance_watermark:

/// Return a flag indicating whether the current key will be filtered by the current watermark.

If we always return false when the table, vnode are the same here, that means none of the keys can be filtered by the watermark. Please clearfully walk through the logics of advance_watermark, should_delete and advance_key_and_watermark. I am still concerned that the implementation of SkipWatermarkState and SkipWatermarkIterator rely on the assumption that the key ordering and watermark ordering is the same and we may still miss some changes.

@xxchan xxchan removed their request for review January 15, 2025 08:01
@@ -52,6 +52,18 @@ impl OrderedRowSerde {
}
}

#[must_use]
pub fn index(&self, idx: usize) -> Cow<'_, Self> {
if 1 == self.order_types.len() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we assert idx == 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, row can be any length and index can be a generic function.

Comment on lines 732 to 746
let table_watermarks = version
.latest_version()
.table_watermarks
.iter()
.filter_map(|(table_id, table_watermarks)| {
if matches!(
table_watermarks.watermark_type,
WatermarkSerdeType::PkPrefix,
) {
Some((*table_id, table_watermarks.clone()))
} else {
None
}
})
.collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas on this comment? @Li0k

src/storage/hummock_sdk/src/compact_task.rs Outdated Show resolved Hide resolved
),
&self.compact_task.table_watermarks,
))
let combine_iter = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add comment and referencing issue for state cleaning based on NonPkPrefix watermark iterator as well.

src/storage/src/hummock/compactor/compaction_utils.rs Outdated Show resolved Hide resolved
return direction.datum_filter_by_watermark(
watermark_col_in_pk,
watermark,
watermark_col_serde.get_order_types()[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we use [0] here, does it mean that either the watermark column is always a single column or the order type for all watermark columns must be the same? Is this assumption checked and guaranteed somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that the watermark is a single column. I've found no such guarantee on the state table side either.

From my point of view, this should be the point that the optimizer needs to guarantee, what is the reasonable way to do it? @hzxa21 @st1page

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said above, I think this guarantee should come from clean_watermark_index_in_pk. If the assumption is broken then using [0] should crash, which is fine with me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are multiple watermark column of different ordering type, I think this code won't crash but generate incorrect result.

src/storage/src/hummock/iterator/skip_watermark.rs Outdated Show resolved Hide resolved
Copy link

gru-agent bot commented Jan 20, 2025

This pull request has been modified. If you want me to regenerate unit test for any of the files related, please find the file in "Files Changed" tab and add a comment @gru-agent. (The github "Comment on this file" feature is in the upper right corner of each file in "Files Changed" tab.)

fn should_delete(&mut self, key: &FullKey<&[u8]>) -> bool;
fn reset_watermark(&mut self);

// fn new(watermarks: BTreeMap<TableId, ReadTableWatermark>) -> Self;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

fn reset_watermark(&mut self);

// fn new(watermarks: BTreeMap<TableId, ReadTableWatermark>) -> Self;
// fn from_safe_epoch_watermarks(safe_epoch_watermarks: BTreeMap<u32, TableWatermarks>) -> Self;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove?

@@ -172,32 +165,30 @@ impl<I: HummockIterator<Direction = Forward>> HummockIterator for SkipWatermarkI
self.inner.value_meta()
}
}
pub struct SkipWatermarkState {

pub trait SkipWatermarkState: Send {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this trait definition to the top of this file to make the codes more readable. Also, can you add documentation for each method?

continue;
}
Ordering::Equal => {
self.last_serde = self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we need last_serde. It seems that we always retrieve it from compaction_catalog_agent_ref.watermark_serde.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you probably want to cache it per table id. You should do the switch on table id switch instead.

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants