pipeline to make samples #1

peterdudfield · 2024-12-04T16:20:31Z

pipeline to make samples.
Use or create a pipeline in ocf-data-sampler to make batches. This could be the site pipeline we have already.

The samples could then be used to train a ML model

peterdudfield · 2025-01-09T20:33:06Z

There is GFS data in the s3 bucket - s3://ocf-open-data-pvnet/data/gfs/
Need Target data #2

The pipeline to make samples is here

siddharth7113 · 2025-01-13T06:43:45Z

Hi @peterdudfield,

I’ve been exploring the GFS dataset and the existing pipelines. I understand the goal is to process GFS data into Zarr format and create samples using the ocf-data-sampler pipeline.

I have a few clarifications to ensure alignment:

Is the GFS data already converted to Zarr, or do we need to handle the conversion as part of this issue?
Are there specific variables (e.g., temperature, radiation flux) that the pipeline should focus on, or should we process all variables in the dataset?
Should the samples target specific regions or time periods, or should we process the entire dataset for all regions and times?

Thank you!

peterdudfield · 2025-01-13T08:39:35Z

GFS is already zarr, its in s3, see #34

Yea, by default I would use all of them, and then can always remove variables as needed

siddharth7113 · 2025-01-15T05:34:54Z

Hi @peterdudfield,

I've been working on the GFS pipeline and ran into a few challenges:

Provider Issue:
- GFS isn't listed as a provider in ocf-data-sampler. For now, I used "ecmwf" as a placeholder. Should we add GFS to the supported providers or is there another way to handle this?
Channel Dimension:
- The slice_datasets_by_time function expects a channel dimension, but GFS doesn’t have one. I worked around it by bypassing channel-based slicing and using all variables. Is this approach okay?
Slicing Errors:
- Ran into TimedeltaIndex errors during slicing. Fixed it temporarily by ensuring proper time conversions, but not sure if it's the best solution.
Simplified Slicing:
- The select_time_slice_nwp logic doesn’t align perfectly with GFS data. I tried a simplified slicing method instead. Would this be acceptable, or is there a preferred way?

I've opened a draft PR (#38) with the current state of the code. Could you take a look and let me know if I’m heading in the right direction or if there’s anything major I should adjust?

Thanks for your help!

peterdudfield · 2025-01-15T07:24:44Z

yes, please add gfs.
when lopading gfs, we could rename varibales to channels
hm,, can you share more?
Thats ok, Im interested to here why it doesnt work?

Thanks for doing all this

siddharth7113 · 2025-01-15T15:10:10Z

Provider Issue:
To use GFS data, I need to add "gfs" to NWP_PROVIDERS in ocf-data-sampler/constants.py.(https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/constants.py)This seems straightforward, but I also noticed that NWPStatDict relies on normalization stats for providers. Should I add placeholder stats for GFS or leave them undefined for now?
Channel Dimension:
GFS data doesn’t have a channel dimension, which caused issues with functions like slice_datasets_by_time. Initially, I removed the channel requirement but faced errors with missing dimensions. Since ocf-data-sampler doesn't have a documentation yet, I am haivng trouble understanding how to correctly write the config file for gfs data. I have tried to define whatever I found appropaite in config, could you please take a look in draft PR?
Slicing Errors:
GFS uses init_time_utc and step, so I had to calculate time_utc as init_time + step. This resolved slicing issues but feels like a hack. Is there a preferred way to handle this in the pipeline?
Simplified Slicing:
Since select_time_slice_nwp tightly depends on channel, I replaced it with .sel for direct slicing, bypassing channel logic. This works but deviates from the original design. Should I continue with this simplified approach or adapt select_time_slice_nwp(again the same channels config issue)?

I am mostly having trouble with understanding configs ,therefore I am unable to define them rightly and test them , I tried an initial code in #38 but I am not sure if I am going in right direction or not.

Thanks for your guidance!

peterdudfield · 2025-01-15T15:23:15Z

Provider Issue:
To use GFS data, I need to add "gfs" to NWP_PROVIDERS in ocf-data-sampler/constants.py.(https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/constants.py)This seems straightforward, but I also noticed that NWPStatDict relies on normalization stats for providers. Should I add placeholder stats for GFS or leave them undefined for now?

Channel Dimension:
GFS data doesn’t have a channel dimension, which caused issues with functions like slice_datasets_by_time. Initially, I removed the channel requirement but faced errors with missing dimensions. Since ocf-data-sampler doesn't have a documentation yet, I am haivng trouble understanding how to correctly write the config file for gfs data. I have tried to define whatever I found appropaite in config, could you please take a look in draft PR?

Slicing Errors:
GFS uses init_time_utc and step, so I had to calculate time_utc as init_time + step. This resolved slicing issues but feels like a hack. Is there a preferred way to handle this in the pipeline?

Simplified Slicing:
Since select_time_slice_nwp tightly depends on channel, I replaced it with .sel for direct slicing, bypassing channel logic. This works but deviates from the original design. Should I continue with this simplified approach or adapt select_time_slice_nwp(again the same channels config issue)?

I am mostly having trouble with understanding configs ,therefore I am unable to define them rightly and test them , I tried an initial code in #38 but I am not sure if I am going in right direction or not.

Thanks for your guidance!

You should be able to copy some from ocf-datapipes
oh, you can change teh list of data_vars to a dimensions instead, I think thats the way to do it
great
I would try 2.

siddharth7113 · 2025-01-16T21:29:48Z

OK I spent some time again debugging anfd afte going through it again, I have found few things :

1. Challenges with `select_time_slice_nwp`

The select_time_slice_nwp function is tightly coupled to datasets that have a channel dimension. Since the GFS dataset lacks this dimension, the function fails to process the dataset correctly.

Here’s the traceback we encountered during processing:

Traceback (most recent call last):
  ...
KeyError: "No variable named 'channel'. Variables on the dataset include ['t', 'u10', 'v10', 'dlwrf', 'dswrf', ..., 'vis', 'init_time_utc', 'latitude', 'longitude', 'step']"

Modifying the dataset to include a pseudo channel dimension by restructuring it into an xarray.Dataset with a channel coordinate. This added significant overhead for larger datasets.
Monkey-patching the function to bypass the channel dimension entirely, treating each variable as its own "channel." While this worked partially, it introduced inconsistencies in slicing.

Core Issue:
The function assumes that the dataset can be indexed directly with channel_dim_name. In GFS, each variable is separate, so this assumption breaks.

Suggestion:
Refactor select_time_slice_nwp to:

Allow datasets without a channel dimension to work seamlessly.
Add a fallback for processing variables independently if channel_dim_name is not found.

2. Redundant Configuration Parameters

Parameters like image_size_pixels_width and image_size_pixels_height are required in the YAML config under nwp. These seem unrelated to slicing or data sampling tasks for GFS.

These parameters are more relevant for image-based datasets but seem redundant here.
They add unnecessary complexity to the configuration file.

Can we make these parameters optional for NWP datasets or remove them where not applicable?

4. Non-Adoptability of `site.py` for NWP

The site.py functions are simpler and more general, but they don’t work directly for NWP due to additional handling for init_time, step, and accumulated variables.

5. Inefficiencies with Large Datasets

Restructuring the dataset to include a channel dimension is impractical for GFS datasets (terabytes of data). This approach introduces overhead and slows down processing.

peterdudfield · 2025-01-16T23:34:16Z

I think something like this might help you nwp: xr.DataArray = gfs.to_array()It's from ocf_datapipes, are old data pipeline. P.s I'm on holiday for a few days, but I'll get back to you when I canOn 16 Jan 2025 21:30, Siddharth ***@***.***> wrote: OK I spent some time again debugging anfd afte going through it again, I have found few things : 1. Challenges with select_time_slice_nwp The select_time_slice_nwp function is tightly coupled to datasets that have a channel dimension. Since the GFS dataset lacks this dimension, the function fails to process the dataset correctly. Here’s the traceback we encountered during processing: Traceback (most recent call last): ... KeyError: "No variable named 'channel'. Variables on the dataset include ['t', 'u10', 'v10', 'dlwrf', 'dswrf', ..., 'vis', 'init_time_utc', 'latitude', 'longitude', 'step']" Modifying the dataset to include a pseudo channel dimension by restructuring it into an xarray.Dataset with a channel coordinate. This added significant overhead for larger datasets.Monkey-patching the function to bypass the channel dimension entirely, treating each variable as its own "channel." While this worked partially, it introduced inconsistencies in slicing. Core Issue: The function assumes that the dataset can be indexed directly with channel_dim_name. In GFS, each variable is separate, so this assumption breaks. Suggestion: Refactor select_time_slice_nwp to: Allow datasets without a channel dimension to work seamlessly.Add a fallback for processing variables independently if channel_dim_name is not found. 2. Redundant Configuration Parameters Parameters like image_size_pixels_width and image_size_pixels_height are required in the YAML config under nwp. These seem unrelated to slicing or data sampling tasks for GFS. These parameters are more relevant for image-based datasets but seem redundant here.They add unnecessary complexity to the configuration file. Can we make these parameters optional for NWP datasets or remove them where not applicable? 4. Non-Adoptability of site.py for NWP The site.py functions are simpler and more general, but they don’t work directly for NWP due to additional handling for init_time, step, and accumulated variables. 5. Inefficiencies with Large Datasets Restructuring the dataset to include a channel dimension is impractical for GFS datasets (terabytes of data). This approach introduces overhead and slows down processing. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

siddharth7113 · 2025-01-21T23:19:42Z

I think something like this might help you nwp: xr.DataArray = gfs.to_array()It's from ocf_datapipes, are old data pipeline. P.s I'm on holiday for a few days, but I'll get back to you when I canOn 16 Jan 2025 21:30, Siddharth @.***> wrote:
OK I spent some time again debugging anfd afte going through it again, I have found few things :

Challenges with select_time_slice_nwp
The select_time_slice_nwp function is tightly coupled to datasets that have a channel dimension. Since the GFS dataset lacks this dimension, the function fails to process the dataset correctly.
Here’s the traceback we encountered during processing:
Traceback (most recent call last):
...
KeyError: "No variable named 'channel'. Variables on the dataset include ['t', 'u10', 'v10', 'dlwrf', 'dswrf', ..., 'vis', 'init_time_utc', 'latitude', 'longitude', 'step']"

Modifying the dataset to include a pseudo channel dimension by restructuring it into an xarray.Dataset with a channel coordinate. This added significant overhead for larger datasets.Monkey-patching the function to bypass the channel dimension entirely, treating each variable as its own "channel." While this worked partially, it introduced inconsistencies in slicing.
Core Issue:
The function assumes that the dataset can be indexed directly with channel_dim_name. In GFS, each variable is separate, so this assumption breaks.
Suggestion:
Refactor select_time_slice_nwp to:
Allow datasets without a channel dimension to work seamlessly.Add a fallback for processing variables independently if channel_dim_name is not found.

Redundant Configuration Parameters
Parameters like image_size_pixels_width and image_size_pixels_height are required in the YAML config under nwp. These seem unrelated to slicing or data sampling tasks for GFS.
These parameters are more relevant for image-based datasets but seem redundant here.They add unnecessary complexity to the configuration file.
Can we make these parameters optional for NWP datasets or remove them where not applicable?

Non-Adoptability of site.py for NWP
The site.py functions are simpler and more general, but they don’t work directly for NWP due to additional handling for init_time, step, and accumulated variables.

Inefficiencies with Large Datasets
Restructuring the dataset to include a channel dimension is impractical for GFS datasets (terabytes of data). This approach introduces overhead and slows down processing.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Thanks for telling me about to.array() function, this potentially saved me a lot of time.

I think most of the things are done, I am just verifying all the functions are working well, then we could move towards comparision.

siddharth7113 · 2025-01-22T00:35:25Z

@peterdudfield

I was trying to write test for checking normalization but it is not working well, so I resorted to checking these values manually,while some are working well some are not giving good plots. Could it be issue from the code side or the data attriubtes it self is like that?

peterdudfield · 2025-01-22T08:14:18Z

Thats very interesting,

which normalization constant did you use?
how did you do the normalization?

siddharth7113 · 2025-01-22T08:31:32Z

Thats very interesting,

which normalization constant did you use?

how did you do the normalization?

Normalization Constants:
I used the ECMWF values provided in ocf-data-sampler under NWP_MEANS and NWP_STDS.

Normalization Function:
I applied the normalization using the following function, also I checked the dataset present and after inspection apart from one attribute, which had 50% NaN values all had less than 0-5% of NaN-data:

def _normalize_sample(self, dataset: xr.Dataset) -> xr.Dataset:
    """
    Normalize the dataset using precomputed means and standard deviations.

    Args:
        dataset (xr.Dataset): The dataset to normalize.

    Returns:
        xr.Dataset: The normalized dataset.
    """
    logging.info("Starting normalization...")
    provider = self.config.input_data.nwp.gfs.provider
    dataset_channels = dataset.channel.values
    mean_channels = NWP_MEANS[provider].channel.values
    std_channels = NWP_STDS[provider].channel.values

    valid_channels = set(dataset_channels) & set(mean_channels) & set(std_channels)
    missing_in_dataset = set(mean_channels) - set(dataset_channels)
    missing_in_means = set(dataset_channels) - set(mean_channels)

    if missing_in_dataset:
        logging.warning(f"Channels missing in dataset: {missing_in_dataset}")
    if missing_in_means:
        logging.warning(f"Channels missing in normalization stats: {missing_in_means}")

    valid_channels = list(valid_channels)
    dataset = dataset.sel(channel=valid_channels)
    means = NWP_MEANS[provider].sel(channel=valid_channels)
    stds = NWP_STDS[provider].sel(channel=valid_channels)

    logging.info(f"Selected Channels: {valid_channels}")
    logging.info(f"Mean Values: {means.values}")
    logging.info(f"Std Values: {stds.values}")

    try:
        normalized_dataset = (dataset - means) / stds
        logging.info("Normalization completed.")
        return normalized_dataset
    except Exception as e:
        logging.error(f"Error during normalization: {e}")
        raise e

In practice:

The function computes the difference between each raw value and the channel mean.
This difference is then divided by the channel's standard deviation to scale it.

peterdudfield · 2025-01-22T08:41:46Z

Ah, its because we haven't moved the GFS constant over to ocf-data-sampler. The old ones are here.

do you want to try with these, and see if the normalization works?

siddharth7113 · 2025-01-22T08:50:03Z

Ah, its because we haven't moved the GFS constant over to ocf-data-sampler. The old ones are here.

do you want to try with these, and see if the normalization works?

Would it be okay if I open a PR in the ocf-data-sampler and import the values there?

2.Also there was one more thing which I am not sure of , but for some reason I need to define
image_size_pixels_height & image_size_pixels_width otherwise it throws an error, but I don't think it's relevant with nwp data, I am not sure , I checked the code and it has something to do with the way this file is defined here

peterdudfield · 2025-01-22T08:54:55Z

Of course
Yea, i see, its becasue we take a region. hmm, on option is we make the optional, another way round it is for you just to set them 1, but comment saying they are not used

peterdudfield · 2025-01-29T16:01:18Z

Hey @siddharth7113 did the recent changes in ocf-data-sampler help you?

siddharth7113 · 2025-01-29T16:14:09Z

Hey @siddharth7113 did the recent changes in ocf-data-sampler help you?

Hey @peterdudfield , Unfortunately, I think bunch of PR's got merged in ocf-data-sampler of which one of these changes partly broke my code, I didn't get time to correct it, I 'll work on it ,and update it here in few hours.

peterdudfield · 2025-01-29T16:32:22Z

No problem, thanks @siddharth7113 . Any help is useful, so no pressure on this

siddharth7113 · 2025-01-29T17:58:29Z

@peterdudfield Good news!, Everything works as intended , going to clean up the code,document things and open a PR ASAP.

peterdudfield mentioned this issue Dec 4, 2024

Meta Issue: Open Data PVnet #5

Open

7 tasks

peterdudfield changed the title pipeline to make batches. use or create a pipeline in ocf-datasample to make batches. This could be the site pipeline we have already, but I can imagine it needs adapting for this problem (but maybe not) pipeline to make samples Dec 4, 2024

peterdudfield transferred this issue from openclimatefix/PVNet Dec 4, 2024

peterdudfield added this to Open Data PVNet Dec 4, 2024

peterdudfield mentioned this issue Jan 13, 2025

train model + evaluate model #7

Open

peterdudfield mentioned this issue Jan 14, 2025

ML Modelling #37

Open

siddharth7113 mentioned this issue Jan 24, 2025

Add GFS Statistics to constants.py openclimatefix/ocf-data-sampler#145

Merged

5 tasks

siddharth7113 mentioned this issue Jan 29, 2025

GFS Data Sampling #50

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline to make samples #1

pipeline to make samples #1

peterdudfield commented Dec 4, 2024 •

edited

Loading

peterdudfield commented Jan 9, 2025

siddharth7113 commented Jan 13, 2025

peterdudfield commented Jan 13, 2025 •

edited

Loading

siddharth7113 commented Jan 15, 2025

peterdudfield commented Jan 15, 2025

siddharth7113 commented Jan 15, 2025

peterdudfield commented Jan 15, 2025

siddharth7113 commented Jan 16, 2025

peterdudfield commented Jan 16, 2025 via email

siddharth7113 commented Jan 21, 2025 •

edited

Loading

siddharth7113 commented Jan 22, 2025

peterdudfield commented Jan 22, 2025

siddharth7113 commented Jan 22, 2025

peterdudfield commented Jan 22, 2025

siddharth7113 commented Jan 22, 2025

peterdudfield commented Jan 22, 2025

peterdudfield commented Jan 29, 2025

siddharth7113 commented Jan 29, 2025

peterdudfield commented Jan 29, 2025

siddharth7113 commented Jan 29, 2025

pipeline to make samples #1

pipeline to make samples #1

Comments

peterdudfield commented Dec 4, 2024 • edited Loading

peterdudfield commented Jan 9, 2025

siddharth7113 commented Jan 13, 2025

peterdudfield commented Jan 13, 2025 • edited Loading

siddharth7113 commented Jan 15, 2025

peterdudfield commented Jan 15, 2025

siddharth7113 commented Jan 15, 2025

peterdudfield commented Jan 15, 2025

siddharth7113 commented Jan 16, 2025

1. Challenges with select_time_slice_nwp

2. Redundant Configuration Parameters

4. Non-Adoptability of site.py for NWP

5. Inefficiencies with Large Datasets

peterdudfield commented Jan 16, 2025 via email

siddharth7113 commented Jan 21, 2025 • edited Loading

siddharth7113 commented Jan 22, 2025

peterdudfield commented Jan 22, 2025

siddharth7113 commented Jan 22, 2025

peterdudfield commented Jan 22, 2025

siddharth7113 commented Jan 22, 2025

peterdudfield commented Jan 22, 2025

peterdudfield commented Jan 29, 2025

siddharth7113 commented Jan 29, 2025

peterdudfield commented Jan 29, 2025

siddharth7113 commented Jan 29, 2025

peterdudfield commented Dec 4, 2024 •

edited

Loading

peterdudfield commented Jan 13, 2025 •

edited

Loading

1. Challenges with `select_time_slice_nwp`

4. Non-Adoptability of `site.py` for NWP

siddharth7113 commented Jan 21, 2025 •

edited

Loading