-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pipeline to make samples #1
Comments
Hi @peterdudfield, I’ve been exploring the GFS dataset and the existing pipelines. I understand the goal is to process GFS data into Zarr format and create samples using the I have a few clarifications to ensure alignment: Is the GFS data already converted to Zarr, or do we need to handle the conversion as part of this issue? Thank you! |
GFS is already zarr, its in s3, see #34 Yea, by default I would use all of them, and then can always remove variables as needed |
Hi @peterdudfield, I've been working on the GFS pipeline and ran into a few challenges:
I've opened a draft PR (#38) with the current state of the code. Could you take a look and let me know if I’m heading in the right direction or if there’s anything major I should adjust? Thanks for your help! |
Thanks for doing all this |
I am mostly having trouble with understanding configs ,therefore I am unable to define them rightly and test them , I tried an initial code in #38 but I am not sure if I am going in right direction or not. Thanks for your guidance! |
|
OK I spent some time again debugging anfd afte going through it again, I have found few things : 1. Challenges with
|
I think something like this might help you nwp: xr.DataArray = gfs.to_array()It's from ocf_datapipes, are old data pipeline. P.s I'm on holiday for a few days, but I'll get back to you when I canOn 16 Jan 2025 21:30, Siddharth ***@***.***> wrote:
OK I spent some time again debugging anfd afte going through it again, I have found few things :
1. Challenges with select_time_slice_nwp
The select_time_slice_nwp function is tightly coupled to datasets that have a channel dimension. Since the GFS dataset lacks this dimension, the function fails to process the dataset correctly.
Here’s the traceback we encountered during processing:
Traceback (most recent call last):
...
KeyError: "No variable named 'channel'. Variables on the dataset include ['t', 'u10', 'v10', 'dlwrf', 'dswrf', ..., 'vis', 'init_time_utc', 'latitude', 'longitude', 'step']"
Modifying the dataset to include a pseudo channel dimension by restructuring it into an xarray.Dataset with a channel coordinate. This added significant overhead for larger datasets.Monkey-patching the function to bypass the channel dimension entirely, treating each variable as its own "channel." While this worked partially, it introduced inconsistencies in slicing.
Core Issue:
The function assumes that the dataset can be indexed directly with channel_dim_name. In GFS, each variable is separate, so this assumption breaks.
Suggestion:
Refactor select_time_slice_nwp to:
Allow datasets without a channel dimension to work seamlessly.Add a fallback for processing variables independently if channel_dim_name is not found.
2. Redundant Configuration Parameters
Parameters like image_size_pixels_width and image_size_pixels_height are required in the YAML config under nwp. These seem unrelated to slicing or data sampling tasks for GFS.
These parameters are more relevant for image-based datasets but seem redundant here.They add unnecessary complexity to the configuration file.
Can we make these parameters optional for NWP datasets or remove them where not applicable?
4. Non-Adoptability of site.py for NWP
The site.py functions are simpler and more general, but they don’t work directly for NWP due to additional handling for init_time, step, and accumulated variables.
5. Inefficiencies with Large Datasets
Restructuring the dataset to include a channel dimension is impractical for GFS datasets (terabytes of data). This approach introduces overhead and slows down processing.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks for telling me about I think most of the things are done, I am just verifying all the functions are working well, then we could move towards comparision. |
I was trying to write test for checking normalization but it is not working well, so I resorted to checking these values manually,while some are working well some are not giving good plots. Could it be issue from the code side or the data attriubtes it self is like that? |
Thats very interesting,
|
|
Ah, its because we haven't moved the GFS constant over to ocf-data-sampler. The old ones are here. do you want to try with these, and see if the normalization works? |
2.Also there was one more thing which I am not sure of , but for some reason I need to define |
|
Hey @siddharth7113 did the recent changes in ocf-data-sampler help you? |
Hey @peterdudfield , Unfortunately, I think bunch of PR's got merged in ocf-data-sampler of which one of these changes partly broke my code, I didn't get time to correct it, I 'll work on it ,and update it here in few hours. |
No problem, thanks @siddharth7113 . Any help is useful, so no pressure on this |
@peterdudfield Good news!, Everything works as intended , going to clean up the code,document things and open a PR ASAP. |
pipeline to make samples.
Use or create a pipeline in ocf-data-sampler to make batches. This could be the site pipeline we have already.
The samples could then be used to train a ML model
The text was updated successfully, but these errors were encountered: