Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of EUMETSAT IASI-NG reader #2879

Open
wants to merge 55 commits into
base: main
Choose a base branch
from

Conversation

roche-emmanuel
Copy link
Contributor

@roche-emmanuel roche-emmanuel commented Aug 9, 2024

This PR introduces the initial implementation of the IASI-NG reader for Satpy.

The changes provided here are self-contained in three files:

  • satpy/readers/iasi_ng_l2_nc.py: The implementation file for the reader.
  • satpy/etc/readers/iasi_ng_l2_nc.yaml: The YAML configuration file for the reader.
  • satpy/tests/reader_tests/test_iasi_ng_l2_nc.py: The unit tests for the reader.

There are still a few points I need to address (this list may be extended progressively):

  • Check support for products with two-letter names (O3_ and CO_), as I don't think they will pass the current file pattern.
  • Confirm with the EUMETSAT team whether the processing logic needs to be adjusted (awaiting clarification on some points concerning the product file specs).
  • Extend the unit tests to perform more comprehensive checks on different products (currently only testing with TWV).
  • Clean the code and remove debug outputs (e.g., print(...) statements).
  • Clarify what needs to be done/added concerning the documentation of this reader.
  • Add my name to the AUTHORS.md file.

The idea here is to start this PR early in the development phase to get early feedback and review notes from you, ensuring we are on the right track for eventual PR acceptance.

If you have any feedback on this PR already, please let me know 😊! In the meantime, I will continue working on the points mentioned above and will provide additional details on the changes as they are made on the feature branch.

Thanks for your help 🙏!

@roche-emmanuel
Copy link
Contributor Author

Alright, so I see from the workflows above that there are some failures due to the formatting of the file, I should maybe start with this before moving forward with the remaining code changes 😉.

Copy link

codecov bot commented Aug 30, 2024

Codecov Report

Attention: Patch coverage is 98.76797% with 6 lines in your changes missing coverage. Please review.

Project coverage is 96.13%. Comparing base (01237e2) to head (b1c5e4c).
Report is 62 commits behind head on main.

Files with missing lines Patch % Lines
satpy/tests/reader_tests/test_iasi_ng_l2_nc.py 98.20% 5 Missing ⚠️
satpy/readers/iasi_ng_l2_nc.py 99.52% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2879      +/-   ##
==========================================
+ Coverage   96.11%   96.13%   +0.02%     
==========================================
  Files         383      385       +2     
  Lines       55673    56160     +487     
==========================================
+ Hits        53511    53992     +481     
- Misses       2162     2168       +6     
Flag Coverage Δ
behaviourtests 3.85% <0.00%> (-0.04%) ⬇️
unittests 96.23% <98.76%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coveralls
Copy link

coveralls commented Aug 30, 2024

Pull Request Test Coverage Report for Build 13135139872

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 483 of 489 (98.77%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.02%) to 96.244%

Changes Missing Coverage Covered Lines Changed/Added Lines %
satpy/readers/iasi_ng_l2_nc.py 209 210 99.52%
satpy/tests/reader_tests/test_iasi_ng_l2_nc.py 274 279 98.21%
Totals Coverage Status
Change from base Build 12905542942: 0.02%
Covered Lines: 54241
Relevant Lines: 56358

💛 - Coveralls

@roche-emmanuel roche-emmanuel marked this pull request as ready for review October 9, 2024 06:25
@roche-emmanuel
Copy link
Contributor Author

Hi everyone,

We now think that this MR is ready for a review by the Satpy team, so please let me know if you have any feedback on it. Thanks 😉!

@roche-emmanuel
Copy link
Contributor Author

Bump :-)!

Hi @djhoese , @mraspaud ,

I haven't heard back from you concerning this MR. So I was just wondering if there was anything else I could do on my side to help getting this one into the "processing pipeline" ? Please let me know if you have anything to suggest ;-)

Thanks! 🙏

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I started reviewing this, and it looks sensible, but got distracted by all the comments, so I think we need to address this first.
So let’s start by reading this: https://santim0ren0.medium.com/clean-code-comments-f9eac4ada16d

In this PR I can see different types of comments:

  • comments on what the next line is: if the name of the variable or function is good enough, they are not needed
  • comments that explain what the following lines do: these lines should then be refactored into a function with a clear name
  • comments that complement the docstrings: these should be integrated to the said docstrings

I’m happy to answer any questions if you need help with this :)

satpy/readers/iasi_ng_l2_nc.py Outdated Show resolved Hide resolved
satpy/readers/iasi_ng_l2_nc.py Outdated Show resolved Hide resolved
satpy/readers/iasi_ng_l2_nc.py Outdated Show resolved Hide resolved
satpy/readers/iasi_ng_l2_nc.py Outdated Show resolved Hide resolved
satpy/readers/iasi_ng_l2_nc.py Outdated Show resolved Hide resolved
Comment on lines 160 to 173
# Prepare a description for this variable:
prefix, var_name = key.rsplit("/", 1)
dims = self.file_content[f"{key}/dimensions"]
dtype = self.file_content[f"{key}/dtype"]

desc = {
"location": key,
"prefix": prefix,
"var_name": var_name,
"shape": shape,
"dtype": f"{dtype}",
"dims": dims,
"attribs": {},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that comment here surely means this code bit should be in its own function called eg prepare_description_for_variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mraspaud ,

Thank you very much for this initial feedback 😊

I've read the article you mentioned, and even if I don't fully agree to this perspective on code comments I think I see your point, and will start working on the requested changes asap. I'll let you when all the fixes are integrated and you can have another look at this PR.

Thanks 🙏!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mraspaud ,

I have now updated the content of the iasi_ng_l2_nc.py file removing those excessive comments (just keeping a couple of "dev notes" in the end).

Now, just to be sure before continuing: you would also expect me to do the same for the test_iasi_ng_l2_nc.py file, right :-)?

Also, I'm not sure if you would expect me to "resolve" each of the points reported above myself, or if you prefer to review the changes first and then resolve those points yourself when you juge the result acceptable (please just let me know if you think I should do it myself).

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, looks much better now!
yes it would be great if you could do the same in the test file.
Regarding the resolution of comments, I like it when the original commenter (in this case me) clicks the resolve button.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay 👌copy that: so I'm updating the test file now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mraspaud ,

I have now also updated the test_iasi_ng_l2_nc.py file as discussed above. So please let me know when you find some time to give this PR another look, if you notice anything else that we should discuss/change.

Thanks in advance for your time on this 🙏!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mraspaud ,

I haven't heard back from you concerning this PR since a good time now, So this is just a "ping message" to check if maybe this just got out of your mind but now you could find some time to allocate to it ;-). Thanks 🙏!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry, I had a long holiday. Reviewed now.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the hard work on the reader. I have mostly questions inline, but the one I’m a bit concerned about in the ability to work with actual 3d data…

Comment on lines +80 to +86
def _collect_groups_info(self, base_name, obj):
for group_name, group_obj in obj.groups.items():
full_group_name = base_name + group_name
self.file_content[full_group_name] = group_obj
self._collect_attrs(full_group_name, group_obj)
self.collect_metadata(full_group_name, group_obj)
self.collect_dimensions(full_group_name, group_obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be covered by tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mraspaud ,

Thank you very much for your time reviewing this PR! Let's see what we can do about those additional points...

Hmmm, well... first thing I notice here after collecting all the latest changes from satpy main branch is that now my python environment seems to be broken (on Windows), since I get this error when trying to run the unit tests on this reader:

ImportError while loading conftest 'D:\Projects\satpy\satpy\tests\reader_tests\conftest.py'.
reader_tests\conftest.py:29: in
from xarray import DataTree
E ImportError: cannot import name 'DataTree' from 'xarray' (D:\Projects\NervProj.pyenvs\eum_resp2_t2_satpy_env\lib\site-packages\xarray_init_.py)

(even after rebuilding this python env from scratches, so I should have the latest version of xarray, etc)

=> Strange 🤔... (I'm not completely sure about that but I think the "DataTree" module is a pretty new addition in xarray right ?) Anyway, I'm investigating this further now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(=> OK xarray datatree error clarified: I was still installing another package called "xarray-datatree" in my env requirements, and because of this the xarray module was stuck on an old version (v2024.7.0). If I remove the "xarray-datatree" package then I get a recent version of xarray and no error for the tests execution).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, so I started with the last point reported below (ie. usage of actual temporary netcdf files instead of the mock framework), and with that change, the code discussed in this point is now considered as covered in the unit tests as far as I can tell checking the iasi_ng_l2_nc.py file from https://app.codecov.io/gh/pytroll/satpy/pull/2879

Comment on lines +234 to +244
def is_attribute_path(self, var_path):
"""Check if a given path is a root attribute path."""
return var_path.startswith("/attr")

def is_property_path(self, var_path):
"""Check if a given path is a sub-property path."""
return var_path.endswith(("/dtype", "/shape", "/dimensions"))

def is_netcdf_group(self, obj):
"""Check if a given object is a netCDF group."""
return isinstance(obj, netCDF4.Group)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this trying to revert something that the base netcdf filehandler is building. Does is show that maybe the base netcdf filehandler class is not adapted to this case and that it would make more sense to go with eg an xarray.open_dataset(…) call for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, well, from my perspective, the implementation we have in this reader rather makes an appropriate usage of the base Netcdf4Handler class: this reader depends heavily on the "self.file_content" member to analyze the available variables/dimensions/attributes, so I think it really makes sense to use the NetCDF4Handler class to access this instead of rebuilding it completely from scratches (if we were to open the file using only xarray directly for instance).

And at the same time, we need a few small changes/extensions compared to the features provided by the base netcdf class, so, to me it sounds appropriate to also override the base implementation of "_collect_groups_info" for instance, and provide a few additional helper methods (ie. those listed above).

Comment on lines +260 to +324
def convert_data_type(self, data_array, dtype="auto"):
"""Convert the data type if applicable."""
attribs = data_array.attrs

cur_dtype = np.dtype(data_array.dtype).name

if dtype == "auto" and cur_dtype in ["float32", "float64"]:
dtype = cur_dtype

to_float = "scale_factor" in attribs or "add_offset" in attribs

if dtype == "auto":
dtype = "float64" if to_float else cur_dtype

if cur_dtype != dtype:
data_array = data_array.astype(dtype)

return data_array

def apply_fill_value(self, data_array):
"""Apply the rescaling transform on a given array."""
dtype = np.dtype(data_array.dtype).name
if dtype not in ["float32", "float64"]:
return data_array

nan_val = np.nan if dtype == "float64" else np.float32(np.nan)
attribs = data_array.attrs

if "valid_min" in attribs:
vmin = attribs["valid_min"]
data_array = data_array.where(data_array >= vmin, other=nan_val)

if "valid_max" in attribs:
vmax = attribs["valid_max"]
data_array = data_array.where(data_array <= vmax, other=nan_val)

if "valid_range" in attribs:
vrange = attribs["valid_range"]
data_array = data_array.where(data_array >= vrange[0], other=nan_val)
data_array = data_array.where(data_array <= vrange[1], other=nan_val)

missing_val = attribs.get("missing_value", None)
missing_val = attribs.get("_FillValue", missing_val)

if missing_val is None:
return data_array

return data_array.where(data_array != missing_val, other=nan_val)

def apply_rescaling(self, data_array):
"""Apply the rescaling transform on a given array."""
attribs = data_array.attrs
if "scale_factor" in attribs or "add_offset" in attribs:
scale_factor = attribs.setdefault("scale_factor", 1)
add_offset = attribs.setdefault("add_offset", 0)

data_array = (data_array * scale_factor) + add_offset

for key in ["valid_range", "valid_min", "valid_max"]:
if key in attribs:
attribs[key] = attribs[key] * scale_factor + add_offset

data_array.attrs.update(attribs)

return data_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the base netcdf file handler has a maskandscale parameter, could it be used instead of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just tried to enable this when creating the reader

def __init__(self, filename, filename_info, filetype_info, **kwargs): """Initialize object.""" super().__init__( filename, filename_info, filetype_info, auto_maskandscale=True, **kwargs )

and then disabling the call to the methods listed above, and this seems to work fine in most cases indeed, but then the following unit test with fail:

def test_nbr_iterations_dataset(self, twv_scene):
    """Test loading the nbr_iterations dataset."""
    twv_scene.load(["nbr_iterations"])
    dset = twv_scene["nbr_iterations"]

    assert len(dset.dims) == 2
  assert dset.dtype == np.int32

E AssertionError: assert dtype('float64') == <class 'numpy.int32'>

=> the original datatype for this array is "int32" in the netcdf file, but there is also a "missing_value" attribute for it, and it seems that in this case the auto_maskandscale flag will automatically convert the array to "float64", but we don't want that. So I'm not quite sure how we could make use of this properly? (=> unless you have a suggestion on this point I think we will need to keep the code above to deal with the int32 arrays as expected)

Comment on lines 345 to 357
def convert_to_datetime(self, data_array, ds_info):
"""Convert the data to datetime values."""
epoch = ds_info["seconds_since_epoch"]

# Note: converting the time values to ns precision to avoid warnings
# from panda+numpy:
data_array = xr.DataArray(
data=pd.to_datetime(epoch) + (data_array * 1e9).astype("timedelta64[ns]"),
dims=data_array.dims,
attrs=data_array.attrs,
)

return data_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an xarray utility function to do this? I think it's supposed to support cf time units

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to investigate this a bit further, and there is indeed an xr.decode_cf(dset) function, but this only works on a Dataset object, not a DataArray.

I tried a few things but I don't seem to be able to get it to work as desired in this case. I also tried to perform this decoding directly when opening the xarray dataset using this following:

`
class IASINGL2NCFileHandler(NetCDF4FsspecFileHandler):
"""Reader for IASI-NG L2 products in NetCDF format."""

def __init__(self, filename, filename_info, filetype_info, **kwargs):
    """Initialize object."""
    super().__init__(
        filename,
        filename_info,
        filetype_info,
        xarray_kwargs={"decode_cf": True},
        **kwargs,
    )

`

But the datetime checking unit tests are still failing with this 😅. Any suggestion on this point from your side ?

Comment on lines +362 to +364
if dim_name not in self.dimensions_desc:
raise KeyError(f"Invalid dimension name {dim_name}")
rep_count = self.dimensions_desc[dim_name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if dim_name not in self.dimensions_desc:
raise KeyError(f"Invalid dimension name {dim_name}")
rep_count = self.dimensions_desc[dim_name]
try:
rep_count = self.dimensions_desc[dim_name]
except KeyError:
raise KeyError(f"Invalid dimension name {dim_name}")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mraspaud , I'm not quite sure to really understand the reason behind this suggested code change ? Could you please elaborate a bit on why you think that is an improvement over the existing code ?

I mean, from my current perspective, the end effect is exactly the same in both case, but the try/except version would make the intent a "little bit less easy" to understand, also adding a very small [probably just negligeable sure] performance cost with the try/except block and breaking the overall design consistency of the reader implementation (ie. this reader file doesn't have any other try/except block for now: it just deals with all the known failing conditions directly raising exceptions in those cases, and let the caller deal with other (unexpected) exceptions).

Of course, I can make this change if you really think it's a good move, but if you have a minute I would be curious to have more details on your point of view on this topic first.

raise KeyError(f"Invalid dimension name {dim_name}")
rep_count = self.dimensions_desc[dim_name]

data_array = xr.concat([data_array] * rep_count, dim=data_array.dims[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not familiar with the format, can you tell me why replicating data is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not that replicating the data is strictly "necessary". The rationale is rather as follow:

  • The reader should make is easy for end user to access the data provided by the instrument in the most convient/standard/expected format, and it should reduce the need for technical understanding of the file content structure if possible.
  • On this instrument, the generated data will be 3D for most datasets since the dimensions used will be (n_lines, n_for, n_fov)
  • But the "onboard_utc" variable will only be 2D of shape (n_lines, n_for), because the same timestamp is applicable on all the elements on the n_fov dim for a given pair in (n_lines, n_for)
  • So to make this dataset match the number of elements in the other datasets, we replicate the source elements "n_fov" times, while still keeping the the data array 2D for user convinience (more on this below in fact).

Note: we have in fact a "configuration entry" in the reader yaml file to control if this transformation should happen or not, and as you can see, this configuration entry is set to True, simply because this seems the most appropriate setting considering a typical usage case where we have tested this reader at EUMETSAT. And for now, I don't see any reason myself why it could be beneficial for the end user to disable this timestamp broadcasting.

Comment on lines +326 to +343
def apply_reshaping(self, data_array):
"""Apply the reshaping transform on a given IASI-NG data array.
Those arrays may come as 3D array, in which case we collapse the
last 2 dimensions on a single axis (ie. the number of columns or "y")
In the process, we also rename the first axis to "x"
"""
if len(data_array.dims) > 2:
data_array = data_array.stack(y=(data_array.dims[1:]))

if data_array.dims[0] != "x":
data_array = data_array.rename({data_array.dims[0]: "x"})

if data_array.dims[1] != "y":
data_array = data_array.rename({data_array.dims[1]: "y"})

return data_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 3d data not working as you expect in satpy? I would rather we keep 3d array as such and not flatten them if possible.

@pytest.fixture(autouse=True, scope="class")
def fake_handler(self):
"""Wrap NetCDF4 FileHandler with our own fake handler."""
patch_ctx = mock.patch.object(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re trying in satpy to go away from mocking as much as possible, and for readers that means creating stub files in tmp_path. It looks like you have all the necessary element to create a test netcdf file here, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants