-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make add_filename
str/bool
#465
Make add_filename
str/bool
#465
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@@ -273,6 +273,18 @@ def _set_torch_to_use_rmm(): | |||
torch.cuda.memory.change_current_allocator(rmm_torch_allocator) | |||
|
|||
|
|||
def _resolve_filename_col(filename: Union[bool, str]) -> Union[str, bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot, thanks! Was wondering if other files from #449 (like doc_dataset.py) need to be updated as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops looks like I forgot all the APIs where those changes needed to be made. Added those too now thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think you could be missing some?
- https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/download/doc_builder.py
- https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/documentdataset.rst
- https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/file_utils.py
And writing logic such as in https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/distributed_utils.py.
Signed-off-by: Praateek <[email protected]>
…n/NeMo-Curator into praateek/filename-col-str
@@ -273,6 +273,18 @@ def _set_torch_to_use_rmm(): | |||
torch.cuda.memory.change_current_allocator(rmm_torch_allocator) | |||
|
|||
|
|||
def _resolve_filename_col(filename: Union[bool, str]) -> Union[str, bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think you could be missing some?
- https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/download/doc_builder.py
- https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/documentdataset.rst
- https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/file_utils.py
And writing logic such as in https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/utils/distributed_utils.py.
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@@ -824,24 +837,26 @@ def _merge_tmp_simple_bitext_partitions(tmp_output_dir: str, output_dir: str): | |||
def write_to_disk( | |||
df, | |||
output_path: str, | |||
write_to_filename: bool = False, | |||
write_to_filename: Union[bool, str] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice, thanks.
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Description
Inspired from
dask.read_json
whereinclude_path_column : bool or str, optional
:We also had a bug in our code where we were doing
include_path_column=True
and then renaming the column frompath -> file_name
. This meant if there was already apath
column we would have an error.Usage
# Add snippet demonstrating usage
Checklist