[BUG] Semdedup Embedding Restart not working cleanly #211

VibhuJawa · 2024-08-19T17:57:54Z

Describe the bug

Currently our semdedup restart mechanism for embedding is not working cleanly.

This is because of following ( add_filename=False)

NeMo-Curator/nemo_curator/scripts/semdedup/compute_embeddings.py

Lines 62 to 64 in 3a31ab1

    
           ddf = read_data( 
        
               input_files=input_files, file_type=args.input_file_type, add_filename=False 
        
           )

And write to filename is False

NeMo-Curator/nemo_curator/scripts/semdedup/compute_embeddings.py

Line 78 in 3a31ab1

write_to_filename=False,

And get_remaining_files by default cant handle comparing files with different extensions.

NeMo-Curator/nemo_curator/utils/file_utils.py

Lines 66 to 80 in 3a31ab1

    
           def get_remaining_files( 
        
               input_file_path, output_file_path, input_file_type, num_files=-1 
        
           ): 
        
               """ 
        
               This function returns a list of the files that still remain to be read. 
        
               Args: 
        
                   input_file_path: The path of the input files. 
        
                   output_file_path: The path of the output files. 
        
                   input_file_type: The type of the input files. 
        
                   num_files: The max number of files to be returned. If -1, all files are returned. 
        
               Returns: 
        
                   A list of files that still remain to be read. 
        
               """

The text was updated successfully, but these errors were encountered:

sarahyurick · 2024-10-14T21:20:25Z

Happy to pair on this at some point; in general there are a couple of things I have been thinking should be refactored with DocumentDataset's read and write functions.

See: #50, #180, #293...

VibhuJawa · 2024-10-22T23:00:46Z

@sarahyurick , I think given your PRs , you should probably just take this on. Happy to provide input as needed. Let me know what you think.

sarahyurick · 2024-10-25T20:00:54Z

I'm not sure I can reproduce this. I ran:

python compute_embeddings.py \
    --input-data-dir "my_data" \
    --input-file-type "jsonl" \
    --input-file-extension "jsonl" \
    --config-file "semdedup_config.yaml"

where my_data is a directory with 2 JSONL files. In semdedup_config.yaml, I specified a different directory as the cache_dir where the 2 resulting Parquet files were written. When I rerun without changing anything, there are no errors.

LMK if there is anything else I should be setting or changing, otherwise we can close this issue.

sarahyurick · 2024-10-25T21:04:19Z

NVM, the issue is that it should not rerun if the embeddings are already present.

sarahyurick · 2024-11-06T21:20:10Z

Closed by #327.

VibhuJawa added the bug Something isn't working label Aug 19, 2024

VibhuJawa self-assigned this Oct 14, 2024

sarahyurick self-assigned this Oct 22, 2024

VibhuJawa removed their assignment Oct 22, 2024

sarahyurick mentioned this issue Oct 25, 2024

Semantic deduplication improvements #327

Merged

2 tasks

sarahyurick closed this as completed Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Semdedup Embedding Restart not working cleanly #211

[BUG] Semdedup Embedding Restart not working cleanly #211

VibhuJawa commented Aug 19, 2024

sarahyurick commented Oct 14, 2024

VibhuJawa commented Oct 22, 2024

sarahyurick commented Oct 25, 2024

sarahyurick commented Oct 25, 2024

sarahyurick commented Nov 6, 2024

[BUG] Semdedup Embedding Restart not working cleanly #211

[BUG] Semdedup Embedding Restart not working cleanly #211

Comments

VibhuJawa commented Aug 19, 2024

sarahyurick commented Oct 14, 2024

VibhuJawa commented Oct 22, 2024

sarahyurick commented Oct 25, 2024

sarahyurick commented Oct 25, 2024

sarahyurick commented Nov 6, 2024