Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reading from tar archives #178

Merged
merged 5 commits into from
Jan 3, 2025
Merged

Improve reading from tar archives #178

merged 5 commits into from
Jan 3, 2025

Conversation

schlegelp
Copy link
Collaborator

@schlegelp schlegelp commented Jan 2, 2025

Addresses #173 by using archive file stream instead of random access when reading from tar archives.

Due to the way this is implemented now, we won't be using parallel processes (i.e. the parallel parameter is ignored). We could create chunks of files that are adjacent in the archive and split the chunks across multiple processes. However, that in turn would generate issues with the process bar.

Ultimately, the file streaming seems to be very performant (possibly because we're not having to open/close individual files?) and I'm not too worried about performance. On my machine I can read the tar archive with 97k hemibrain skeletons in around 3 minutes which isn't too shabby.

In addition to the above this PR contains:

  • making read_swc more robust against unexpected number of columns
  • following URL changes in two of the tutorials (download.brainlib.org:8811 -> download.brainimagelibrary.org)

@schlegelp schlegelp changed the title Fix reading from tar archives Improve reading from tar archives Jan 3, 2025
@schlegelp schlegelp merged commit ab4de9e into master Jan 3, 2025
20 of 21 checks passed
@schlegelp schlegelp deleted the read_tar_fix branch January 3, 2025 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant