Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return type Matrix{Union{Missing, Float32}} #227

Closed
milankl opened this issue Sep 1, 2023 · 8 comments
Closed

Return type Matrix{Union{Missing, Float32}} #227

milankl opened this issue Sep 1, 2023 · 8 comments

Comments

@milankl
Copy link

milankl commented Sep 1, 2023

NCDatasets.jl returns a Array{Union{Missing,Float32}} (or Float64) when reading nc files with _fillValue specified, even if there aren't any missing values in the array. This can cause a bunch of problems downstream, for example with plotting

julia> A
3×4 Matrix{Union{Missing, Float32}}:
 0.278497  0.693684  0.841956  0.913807
 0.207002  0.450956  0.735774  0.541568
 0.072721  0.984475  0.659679  0.688397

julia> using PythonPlot

julia> pcolormesh(A)
ERROR: Python: TypeError: Image data of dtype object cannot be converted to float

(doing a conversion via Matrix{Float32}(A) would solve that though).

Maybe to start a discussion whether it's always sensible for NCDatasets to return Array{Union{Missing,Float32}}?

@Alexander-Barth
Copy link
Member

The problem is only after reading the whole array we can be sure that there a no missing values. Sometimes only a slice of the array is loaded. Python's netCDF4 library uses indeed a numpy array or masked arrays for the same variables depending whether a fillvalue is loaded or not (https://stackoverflow.com/questions/32046317/always-yield-a-masked-array-with-netcdf4) which can be quite confusing and it is not type-stable.

I typically use often the function nomissing, e.g. if a is Array{Union{Missing,Float64}}:

b =  nomissing(a,NaN); # returns a Matrix{Float64} and replace all missing by NaN
b =  nomissing(a); # returns a Matrix{Float64} but raises an error if there is an missing value

With the function cfvariable one can also ignore the _FillValue attribute:

ds = NCDataset(download("https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc"))
ncvar = cfvariable(ds,"tos",fillvalue=nothing,missing_value=[]); # similar to ds["tos"]
eltype(ncvar)
# Float32

Note of in this example file, _FillValue and missing_value are both defined in the NetCDF file and need to be overridden. Since the CF Standard allows to define multiple values as missing_value, it is set to the empty vector.

(I use PyPlot a lot myself and I wish that it could handle missing value, e.g. by automatically converting them to python masked arrays. See JuliaPy/PyCall.jl#616 )

I am open to any suggestion :-)

@milankl
Copy link
Author

milankl commented Sep 8, 2023

No I think I agree with you that NCDatasets.jl should return Array{Union{MIssing,Float32}} or (Float64) but that this either should be converted manually, or handled in other libraries such as for plotting.

@Alexander-Barth
Copy link
Member

OK, thank you for your feedback. Can this issue be closed?

@ctroupin
Copy link
Collaborator

Quick comment to @milankl :

Alex suggests to use b = nomissing(a,NaN);, another possibility is the function coalesce:

b = coalesce.(a, NaN)

@milankl milankl closed this as completed Sep 11, 2023
@milankl
Copy link
Author

milankl commented Aug 29, 2024

@Alexander-Barth I remember you told mere there's also some ignoremissing functionality, meaning reading out the raw float values, what's the recommended way to do that again? 🙈

@Alexander-Barth
Copy link
Member

Hi Milan,
Does this answer your question?
https://alexander-barth.github.io/NCDatasets.jl/dev/other/#Fill-values-and-missing-values

@milankl
Copy link
Author

milankl commented Aug 30, 2024

Yes, Alexander it does! I'm just confused about the crazy time and allocation differences.
Just reading in the raw data is fast (1200MB/s, there is some zlib decompression!) and allocates only what's needed (the array is ~92KB)

julia> ds["vor"]
vor (96 × 48 × 1 × 5)
  Datatype:    Union{Missing, Float32} (Float32)
  Dimensions:  lon × lat × layer × time
  Attributes:
   units                = s^-1
   long_name            = relative vorticity
   _FillValue           = NaN


julia> f(ds) = ds["vor"].var[:, :, :, :]

julia> @btime f($ds);
  75.415 μs (129 allocations: 95.45 KiB)

When using the default interface with Array{Union{Missing, Float32}} returned this is up to 10x slower and requires 10x more memory

julia> h(ds) = ds["vor"][:, :, :, :]

julia> @btime h($ds);
  757.944 μs (45704 allocations: 920.18 KiB)

julia> h(ds) = ds["vor"][:]   # faster, but returns a flat vector although variable is 4D

julia> @btime h($ds);
  101.149 μs (123 allocations: 207.57 KiB)

julia> h(ds) = Array(ds["vor"])
h (generic function with 1 method)

julia> @btime h($ds);
  795.377 μs (45705 allocations: 920.26 KiB)

I don't quite understand why the reshaping from vector <-> array{T, 4} matters here? And then nomissing is killer

julia> g(ds) = nomissing(ds["vor"])
g (generic function with 1 method)

julia> @btime g($ds);
  1.524 s (2949196 allocations: 133.69 MiB)

being 20,000x slower and using 1400x the memory? 🐌 Sorry maybe I should have asked that in another issue?

@Alexander-Barth
Copy link
Member

Can you re-make your benchmark where you cache the results of ds["vor"] like ncvar = ds["vor"]; f(ncvar) = Array(ncvar). For nomissing one should use nomissing(ncvar[:,:,:,:]) rather than nomissing(ncvar). In the later case you are loading every element individually. But yes, let's make a new issue with a reproducible example :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants