Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Semdedup Embedding Restart not working cleanly #211

Closed
VibhuJawa opened this issue Aug 19, 2024 · 5 comments
Closed

[BUG] Semdedup Embedding Restart not working cleanly #211

VibhuJawa opened this issue Aug 19, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@VibhuJawa
Copy link
Collaborator

Describe the bug

Currently our semdedup restart mechanism for embedding is not working cleanly.

This is because of following ( add_filename=False)

ddf = read_data(
input_files=input_files, file_type=args.input_file_type, add_filename=False
)

And write to filename is False

And get_remaining_files by default cant handle comparing files with different extensions.

def get_remaining_files(
input_file_path, output_file_path, input_file_type, num_files=-1
):
"""
This function returns a list of the files that still remain to be read.
Args:
input_file_path: The path of the input files.
output_file_path: The path of the output files.
input_file_type: The type of the input files.
num_files: The max number of files to be returned. If -1, all files are returned.
Returns:
A list of files that still remain to be read.
"""

@VibhuJawa VibhuJawa added the bug Something isn't working label Aug 19, 2024
@VibhuJawa VibhuJawa self-assigned this Oct 14, 2024
@sarahyurick
Copy link
Collaborator

Happy to pair on this at some point; in general there are a couple of things I have been thinking should be refactored with DocumentDataset's read and write functions.

See: #50, #180, #293...

@VibhuJawa
Copy link
Collaborator Author

@sarahyurick , I think given your PRs , you should probably just take this on. Happy to provide input as needed. Let me know what you think.

@sarahyurick sarahyurick self-assigned this Oct 22, 2024
@VibhuJawa VibhuJawa removed their assignment Oct 22, 2024
@sarahyurick
Copy link
Collaborator

I'm not sure I can reproduce this. I ran:

python compute_embeddings.py \
    --input-data-dir "my_data" \
    --input-file-type "jsonl" \
    --input-file-extension "jsonl" \
    --config-file "semdedup_config.yaml"

where my_data is a directory with 2 JSONL files. In semdedup_config.yaml, I specified a different directory as the cache_dir where the 2 resulting Parquet files were written. When I rerun without changing anything, there are no errors.

LMK if there is anything else I should be setting or changing, otherwise we can close this issue.

@sarahyurick
Copy link
Collaborator

NVM, the issue is that it should not rerun if the embeddings are already present.

@sarahyurick
Copy link
Collaborator

Closed by #327.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants