-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Update Dataset.count()
to avoid unnecessarily keeping BlockRef
s in-memory
#46369
Changes from 2 commits
c313a52
09b905d
5ef8041
172e423
374898b
b0ec894
f0b49f1
aa368d4
ab8d6ed
128614d
65c3f4e
7956f35
958d306
d6a1d79
87151ee
3ea9759
b6df226
70f82de
f520c62
629d6bb
4720bf6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
Dict, | ||
Generic, | ||
Iterable, | ||
Iterator, | ||
List, | ||
Literal, | ||
Mapping, | ||
|
@@ -2486,11 +2487,12 @@ def count(self) -> int: | |
|
||
get_num_rows = cached_remote_fn(_get_num_rows) | ||
|
||
return sum( | ||
ray.get( | ||
[get_num_rows.remote(block) for block in self.get_internal_block_refs()] | ||
) | ||
) | ||
# Directly loop over the iterator of `BlockRef`s instead of first | ||
# retrieving a list of `BlockRef`s. | ||
total_rows = 0 | ||
for block_ref in self.iter_internal_block_refs(): | ||
total_rows += ray.get(get_num_rows.remote(block_ref)) | ||
return total_rows | ||
|
||
@ConsumptionAPI( | ||
if_more_than_read=True, | ||
|
@@ -4563,7 +4565,29 @@ def stats(self) -> str: | |
def _get_stats_summary(self) -> DatasetStatsSummary: | ||
return self._plan.stats_summary() | ||
|
||
@ConsumptionAPI(pattern="Time complexity:") | ||
@ConsumptionAPI(pattern="") | ||
@DeveloperAPI | ||
def iter_internal_block_refs(self) -> Iterator[ObjectRef[Block]]: | ||
"""Get an iterator over references to the underlying blocks of this Dataset. | ||
|
||
This function can be used for zero-copy access to the data. It does not | ||
keep the data materialized in-memory. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does zero-copy access mean here? You might copy the data when you get the block reference, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i had thought that when we get the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh. Yeah, if you don't call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point, let me just remove the line. i think saying "It does not keep the data materialized in-memory." is the more important main point to get across. |
||
|
||
Examples: | ||
>>> import ray | ||
>>> ds = ray.data.range(1) | ||
>>> for block_ref in ds.get_internal_block_refs(): | ||
... block = ray.get(block_ref) | ||
|
||
Returns: | ||
An iterator over references to this Dataset's blocks. | ||
""" | ||
iter_block_refs_md, _, _ = self._plan.execute_to_iterator() | ||
iter_block_refs = (block_ref for block_ref, _ in iter_block_refs_md) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just realized that we already have block metadata here. So no need to submit additional tasks to count rows. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
self._synchronize_progress_bar() | ||
return iter_block_refs | ||
|
||
@ConsumptionAPI(pattern="") | ||
@DeveloperAPI | ||
def get_internal_block_refs(self) -> List[ObjectRef[Block]]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (can do this later) there are only a few use cases of |
||
"""Get a list of references to the underlying blocks of this dataset. | ||
|
@@ -4577,8 +4601,6 @@ def get_internal_block_refs(self) -> List[ObjectRef[Block]]: | |
>>> ds.get_internal_block_refs() | ||
[ObjectRef(...)] | ||
|
||
Time complexity: O(1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. removing this because it's no longer accurate. |
||
|
||
Returns: | ||
A list of references to this dataset's blocks. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://developers.google.com/style/contractions