Skip to content

Commit

Permalink
correctly track time
Browse files Browse the repository at this point in the history
  • Loading branch information
guipenedo committed Dec 13, 2023
1 parent cb81215 commit 0ebc066
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions src/datatrove/pipeline/readers/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,12 @@ def read_file(self, datafile: BaseInputDataFile):
li = 0
columns = [self.content_key, self.id_key] if not self.read_metadata else None
for batch in pqf.iter_batches(batch_size=self.batch_size, columns=columns):
with self.track_time():
documents = []
with self.track_time("batch"):
for line in batch.to_pylist():
document = self.get_document_from_dict(line, datafile, li)
if not document:
continue
documents.append(document)
li += 1
yield document
yield from documents

0 comments on commit 0ebc066

Please sign in to comment.