-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use local scr for initial downloads / heavy writes #148
Comments
@jacobwhall at some point we can update completed pipelines to adhere to this practice, but most of the recently complete ones should be pretty light weight (and have already been fully run so near-future IO isn't an issue). |
Many of the dataset scripts currently follow a model like this: graph LR;
id1(data source)-- download tasks -->raw_dir;
raw_dir-- process_tasks -->output_dir;
output_dir-- ingest system -->GeoQuery;
Setting the raw_dir to be in a local scratch folder (e.g. |
From our conversation today, it sounds like raw_dir and output_dir both need to be permanent archives, so we can't have either of them point to a local scratch directory. We should specify a tmp_dir to write files into from tasks, and then at the end of each task write the file to its final destination. Perhaps a Dataset class function could manage the tmp_dir for us |
As a best practice given the potential scale of many of our pipelines (i.e., potentially running dozens+ tasks across nodes), I'd like to minimize the continuous IO on our main file system. This should be relatively simple by just using the node's local disk for downloads / processing before copying to the main file system. Even in scenarios where operations are quick and not heavy IO, this shouldn't slow jobs down much and worth the extra piece of mind.
(Currently, heavy IO is seemingly causing extra issues on our aging file system, but we are in the process of moving everything to a brand new file system.)
The text was updated successfully, but these errors were encountered: