Skip to content

Commit

Permalink
bugfix to also read WET files with WarcReader
Browse files Browse the repository at this point in the history
  • Loading branch information
guipenedo committed Dec 12, 2023
1 parent 476de37 commit 6014a6f
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/datatrove/pipeline/readers/warc.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def read_file(self, datafile: BaseInputDataFile):

def process_record(record: ArcWarcRecord) -> dict | None:
# record type
if record.rec_type != "response":
if record.rec_type != "response" and record.rec_type != "conversion": # wet files have "conversion" type
return

# content type filtering
Expand Down

0 comments on commit 6014a6f

Please sign in to comment.