Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline to make samples #1

Open
Tracked by #5
peterdudfield opened this issue Dec 4, 2024 · 20 comments
Open
Tracked by #5

pipeline to make samples #1

peterdudfield opened this issue Dec 4, 2024 · 20 comments

Comments

@peterdudfield
Copy link
Contributor

peterdudfield commented Dec 4, 2024

pipeline to make samples.
Use or create a pipeline in ocf-data-sampler to make batches. This could be the site pipeline we have already.

The samples could then be used to train a ML model

@peterdudfield peterdudfield changed the title pipeline to make batches. use or create a pipeline in ocf-datasample to make batches. This could be the site pipeline we have already, but I can imagine it needs adapting for this problem (but maybe not) pipeline to make samples Dec 4, 2024
@peterdudfield peterdudfield transferred this issue from openclimatefix/PVNet Dec 4, 2024
@peterdudfield
Copy link
Contributor Author

There is GFS data in the s3 bucket - s3://ocf-open-data-pvnet/data/gfs/
Need Target data #2

The pipeline to make samples is here

@siddharth7113
Copy link
Contributor

Hi @peterdudfield,

I’ve been exploring the GFS dataset and the existing pipelines. I understand the goal is to process GFS data into Zarr format and create samples using the ocf-data-sampler pipeline.

I have a few clarifications to ensure alignment:

Is the GFS data already converted to Zarr, or do we need to handle the conversion as part of this issue?
Are there specific variables (e.g., temperature, radiation flux) that the pipeline should focus on, or should we process all variables in the dataset?
Should the samples target specific regions or time periods, or should we process the entire dataset for all regions and times?

Thank you!

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Jan 13, 2025

GFS is already zarr, its in s3, see #34

Yea, by default I would use all of them, and then can always remove variables as needed

@siddharth7113
Copy link
Contributor

Hi @peterdudfield,

I've been working on the GFS pipeline and ran into a few challenges:

  1. Provider Issue:

    • GFS isn't listed as a provider in ocf-data-sampler. For now, I used "ecmwf" as a placeholder. Should we add GFS to the supported providers or is there another way to handle this?
  2. Channel Dimension:

    • The slice_datasets_by_time function expects a channel dimension, but GFS doesn’t have one. I worked around it by bypassing channel-based slicing and using all variables. Is this approach okay?
  3. Slicing Errors:

    • Ran into TimedeltaIndex errors during slicing. Fixed it temporarily by ensuring proper time conversions, but not sure if it's the best solution.
  4. Simplified Slicing:

    • The select_time_slice_nwp logic doesn’t align perfectly with GFS data. I tried a simplified slicing method instead. Would this be acceptable, or is there a preferred way?

I've opened a draft PR (#38) with the current state of the code. Could you take a look and let me know if I’m heading in the right direction or if there’s anything major I should adjust?

Thanks for your help!

@peterdudfield
Copy link
Contributor Author

  1. yes, please add gfs.
  2. when lopading gfs, we could rename varibales to channels
  3. hm,, can you share more?
  4. Thats ok, Im interested to here why it doesnt work?

Thanks for doing all this

@siddharth7113
Copy link
Contributor

  1. Provider Issue:
    To use GFS data, I need to add "gfs" to NWP_PROVIDERS in ocf-data-sampler/constants.py.(https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/constants.py)This seems straightforward, but I also noticed that NWPStatDict relies on normalization stats for providers. Should I add placeholder stats for GFS or leave them undefined for now?

  2. Channel Dimension:
    GFS data doesn’t have a channel dimension, which caused issues with functions like slice_datasets_by_time. Initially, I removed the channel requirement but faced errors with missing dimensions. Since ocf-data-sampler doesn't have a documentation yet, I am haivng trouble understanding how to correctly write the config file for gfs data. I have tried to define whatever I found appropaite in config, could you please take a look in draft PR?

  3. Slicing Errors:
    GFS uses init_time_utc and step, so I had to calculate time_utc as init_time + step. This resolved slicing issues but feels like a hack. Is there a preferred way to handle this in the pipeline?

  4. Simplified Slicing:
    Since select_time_slice_nwp tightly depends on channel, I replaced it with .sel for direct slicing, bypassing channel logic. This works but deviates from the original design. Should I continue with this simplified approach or adapt select_time_slice_nwp(again the same channels config issue)?

I am mostly having trouble with understanding configs ,therefore I am unable to define them rightly and test them , I tried an initial code in #38 but I am not sure if I am going in right direction or not.

Thanks for your guidance!

@peterdudfield
Copy link
Contributor Author

  1. Provider Issue:
    To use GFS data, I need to add "gfs" to NWP_PROVIDERS in ocf-data-sampler/constants.py.(https://github.com/openclimatefix/ocf-data-sampler/blob/main/ocf_data_sampler/constants.py)This seems straightforward, but I also noticed that NWPStatDict relies on normalization stats for providers. Should I add placeholder stats for GFS or leave them undefined for now?
  2. Channel Dimension:
    GFS data doesn’t have a channel dimension, which caused issues with functions like slice_datasets_by_time. Initially, I removed the channel requirement but faced errors with missing dimensions. Since ocf-data-sampler doesn't have a documentation yet, I am haivng trouble understanding how to correctly write the config file for gfs data. I have tried to define whatever I found appropaite in config, could you please take a look in draft PR?
  3. Slicing Errors:
    GFS uses init_time_utc and step, so I had to calculate time_utc as init_time + step. This resolved slicing issues but feels like a hack. Is there a preferred way to handle this in the pipeline?
  4. Simplified Slicing:
    Since select_time_slice_nwp tightly depends on channel, I replaced it with .sel for direct slicing, bypassing channel logic. This works but deviates from the original design. Should I continue with this simplified approach or adapt select_time_slice_nwp(again the same channels config issue)?

I am mostly having trouble with understanding configs ,therefore I am unable to define them rightly and test them , I tried an initial code in #38 but I am not sure if I am going in right direction or not.

Thanks for your guidance!

  1. You should be able to copy some from ocf-datapipes
  2. oh, you can change teh list of data_vars to a dimensions instead, I think thats the way to do it
  3. great
  4. I would try 2.

@siddharth7113
Copy link
Contributor

OK I spent some time again debugging anfd afte going through it again, I have found few things :


1. Challenges with select_time_slice_nwp

The select_time_slice_nwp function is tightly coupled to datasets that have a channel dimension. Since the GFS dataset lacks this dimension, the function fails to process the dataset correctly.

Here’s the traceback we encountered during processing:

Traceback (most recent call last):
  ...
KeyError: "No variable named 'channel'. Variables on the dataset include ['t', 'u10', 'v10', 'dlwrf', 'dswrf', ..., 'vis', 'init_time_utc', 'latitude', 'longitude', 'step']"
  1. Modifying the dataset to include a pseudo channel dimension by restructuring it into an xarray.Dataset with a channel coordinate. This added significant overhead for larger datasets.
  2. Monkey-patching the function to bypass the channel dimension entirely, treating each variable as its own "channel." While this worked partially, it introduced inconsistencies in slicing.

Core Issue:
The function assumes that the dataset can be indexed directly with channel_dim_name. In GFS, each variable is separate, so this assumption breaks.

Suggestion:
Refactor select_time_slice_nwp to:

  • Allow datasets without a channel dimension to work seamlessly.
  • Add a fallback for processing variables independently if channel_dim_name is not found.

2. Redundant Configuration Parameters

Parameters like image_size_pixels_width and image_size_pixels_height are required in the YAML config under nwp. These seem unrelated to slicing or data sampling tasks for GFS.

  • These parameters are more relevant for image-based datasets but seem redundant here.
  • They add unnecessary complexity to the configuration file.

Can we make these parameters optional for NWP datasets or remove them where not applicable?


4. Non-Adoptability of site.py for NWP

The site.py functions are simpler and more general, but they don’t work directly for NWP due to additional handling for init_time, step, and accumulated variables.


5. Inefficiencies with Large Datasets

Restructuring the dataset to include a channel dimension is impractical for GFS datasets (terabytes of data). This approach introduces overhead and slows down processing.

@peterdudfield
Copy link
Contributor Author

peterdudfield commented Jan 16, 2025 via email

@siddharth7113
Copy link
Contributor

siddharth7113 commented Jan 21, 2025

I think something like this might help you    nwp: xr.DataArray = gfs.to_array()It's from ocf_datapipes, are old data pipeline. P.s I'm on holiday for a few days, but I'll get back to you when I canOn 16 Jan 2025 21:30, Siddharth @.***> wrote:
OK I spent some time again debugging anfd afte going through it again, I have found few things :

  1. Challenges with select_time_slice_nwp
    The select_time_slice_nwp function is tightly coupled to datasets that have a channel dimension. Since the GFS dataset lacks this dimension, the function fails to process the dataset correctly.
    Here’s the traceback we encountered during processing:
    Traceback (most recent call last):
    ...
    KeyError: "No variable named 'channel'. Variables on the dataset include ['t', 'u10', 'v10', 'dlwrf', 'dswrf', ..., 'vis', 'init_time_utc', 'latitude', 'longitude', 'step']"

Modifying the dataset to include a pseudo channel dimension by restructuring it into an xarray.Dataset with a channel coordinate. This added significant overhead for larger datasets.Monkey-patching the function to bypass the channel dimension entirely, treating each variable as its own "channel." While this worked partially, it introduced inconsistencies in slicing.
Core Issue:
The function assumes that the dataset can be indexed directly with channel_dim_name. In GFS, each variable is separate, so this assumption breaks.
Suggestion:
Refactor select_time_slice_nwp to:
Allow datasets without a channel dimension to work seamlessly.Add a fallback for processing variables independently if channel_dim_name is not found.

  1. Redundant Configuration Parameters
    Parameters like image_size_pixels_width and image_size_pixels_height are required in the YAML config under nwp. These seem unrelated to slicing or data sampling tasks for GFS.
    These parameters are more relevant for image-based datasets but seem redundant here.They add unnecessary complexity to the configuration file.
    Can we make these parameters optional for NWP datasets or remove them where not applicable?

  2. Non-Adoptability of site.py for NWP
    The site.py functions are simpler and more general, but they don’t work directly for NWP due to additional handling for init_time, step, and accumulated variables.

  3. Inefficiencies with Large Datasets
    Restructuring the dataset to include a channel dimension is impractical for GFS datasets (terabytes of data). This approach introduces overhead and slows down processing.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Thanks for telling me about to.array() function, this potentially saved me a lot of time.

I think most of the things are done, I am just verifying all the functions are working well, then we could move towards comparision.

@siddharth7113
Copy link
Contributor

@peterdudfield

I was trying to write test for checking normalization but it is not working well, so I resorted to checking these values manually,while some are working well some are not giving good plots. Could it be issue from the code side or the data attriubtes it self is like that?

Image

Image

Image

Image

@peterdudfield
Copy link
Contributor Author

Thats very interesting,

  1. which normalization constant did you use?
  2. how did you do the normalization?

@siddharth7113
Copy link
Contributor

Thats very interesting,

  1. which normalization constant did you use?
  2. how did you do the normalization?
  1. Normalization Constants:
    I used the ECMWF values provided in ocf-data-sampler under NWP_MEANS and NWP_STDS.

  2. Normalization Function:
    I applied the normalization using the following function, also I checked the dataset present and after inspection apart from one attribute, which had 50% NaN values all had less than 0-5% of NaN-data:

    def _normalize_sample(self, dataset: xr.Dataset) -> xr.Dataset:
        """
        Normalize the dataset using precomputed means and standard deviations.
    
        Args:
            dataset (xr.Dataset): The dataset to normalize.
    
        Returns:
            xr.Dataset: The normalized dataset.
        """
        logging.info("Starting normalization...")
        provider = self.config.input_data.nwp.gfs.provider
        dataset_channels = dataset.channel.values
        mean_channels = NWP_MEANS[provider].channel.values
        std_channels = NWP_STDS[provider].channel.values
    
        valid_channels = set(dataset_channels) & set(mean_channels) & set(std_channels)
        missing_in_dataset = set(mean_channels) - set(dataset_channels)
        missing_in_means = set(dataset_channels) - set(mean_channels)
    
        if missing_in_dataset:
            logging.warning(f"Channels missing in dataset: {missing_in_dataset}")
        if missing_in_means:
            logging.warning(f"Channels missing in normalization stats: {missing_in_means}")
    
        valid_channels = list(valid_channels)
        dataset = dataset.sel(channel=valid_channels)
        means = NWP_MEANS[provider].sel(channel=valid_channels)
        stds = NWP_STDS[provider].sel(channel=valid_channels)
    
        logging.info(f"Selected Channels: {valid_channels}")
        logging.info(f"Mean Values: {means.values}")
        logging.info(f"Std Values: {stds.values}")
    
        try:
            normalized_dataset = (dataset - means) / stds
            logging.info("Normalization completed.")
            return normalized_dataset
        except Exception as e:
            logging.error(f"Error during normalization: {e}")
            raise e

    In practice:

    • The function computes the difference between each raw value and the channel mean.
    • This difference is then divided by the channel's standard deviation to scale it.

@peterdudfield
Copy link
Contributor Author

Ah, its because we haven't moved the GFS constant over to ocf-data-sampler. The old ones are here.

do you want to try with these, and see if the normalization works?

@siddharth7113
Copy link
Contributor

Ah, its because we haven't moved the GFS constant over to ocf-data-sampler. The old ones are here.

do you want to try with these, and see if the normalization works?

  1. Would it be okay if I open a PR in the ocf-data-sampler and import the values there?

2.Also there was one more thing which I am not sure of , but for some reason I need to define
image_size_pixels_height & image_size_pixels_width otherwise it throws an error, but I don't think it's relevant with nwp data, I am not sure , I checked the code and it has something to do with the way this file is defined here

@peterdudfield
Copy link
Contributor Author

  1. Of course
  2. Yea, i see, its becasue we take a region. hmm, on option is we make the optional, another way round it is for you just to set them 1, but comment saying they are not used

@peterdudfield
Copy link
Contributor Author

Hey @siddharth7113 did the recent changes in ocf-data-sampler help you?

@siddharth7113
Copy link
Contributor

Hey @siddharth7113 did the recent changes in ocf-data-sampler help you?

Hey @peterdudfield , Unfortunately, I think bunch of PR's got merged in ocf-data-sampler of which one of these changes partly broke my code, I didn't get time to correct it, I 'll work on it ,and update it here in few hours.

@peterdudfield
Copy link
Contributor Author

No problem, thanks @siddharth7113 . Any help is useful, so no pressure on this

@siddharth7113
Copy link
Contributor

Image

Image

Image

@peterdudfield Good news!, Everything works as intended , going to clean up the code,document things and open a PR ASAP.

@siddharth7113 siddharth7113 mentioned this issue Jan 29, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants