Return type `Matrix{Union{Missing, Float32}}` #227

milankl · 2023-09-01T20:49:04Z

NCDatasets.jl returns a Array{Union{Missing,Float32}} (or Float64) when reading nc files with _fillValue specified, even if there aren't any missing values in the array. This can cause a bunch of problems downstream, for example with plotting

julia> A
3×4 Matrix{Union{Missing, Float32}}:
 0.278497  0.693684  0.841956  0.913807
 0.207002  0.450956  0.735774  0.541568
 0.072721  0.984475  0.659679  0.688397

julia> using PythonPlot

julia> pcolormesh(A)
ERROR: Python: TypeError: Image data of dtype object cannot be converted to float

(doing a conversion via Matrix{Float32}(A) would solve that though).

Maybe to start a discussion whether it's always sensible for NCDatasets to return Array{Union{Missing,Float32}}?

The text was updated successfully, but these errors were encountered:

Alexander-Barth · 2023-09-02T20:31:32Z

The problem is only after reading the whole array we can be sure that there a no missing values. Sometimes only a slice of the array is loaded. Python's netCDF4 library uses indeed a numpy array or masked arrays for the same variables depending whether a fillvalue is loaded or not (https://stackoverflow.com/questions/32046317/always-yield-a-masked-array-with-netcdf4) which can be quite confusing and it is not type-stable.

I typically use often the function nomissing, e.g. if a is Array{Union{Missing,Float64}}:

b =  nomissing(a,NaN); # returns a Matrix{Float64} and replace all missing by NaN
b =  nomissing(a); # returns a Matrix{Float64} but raises an error if there is an missing value

With the function cfvariable one can also ignore the _FillValue attribute:

ds = NCDataset(download("https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc"))
ncvar = cfvariable(ds,"tos",fillvalue=nothing,missing_value=[]); # similar to ds["tos"]
eltype(ncvar)
# Float32

Note of in this example file, _FillValue and missing_value are both defined in the NetCDF file and need to be overridden. Since the CF Standard allows to define multiple values as missing_value, it is set to the empty vector.

(I use PyPlot a lot myself and I wish that it could handle missing value, e.g. by automatically converting them to python masked arrays. See JuliaPy/PyCall.jl#616 )

I am open to any suggestion :-)

milankl · 2023-09-08T20:10:31Z

No I think I agree with you that NCDatasets.jl should return Array{Union{MIssing,Float32}} or (Float64) but that this either should be converted manually, or handled in other libraries such as for plotting.

Alexander-Barth · 2023-09-11T13:53:35Z

OK, thank you for your feedback. Can this issue be closed?

ctroupin · 2023-09-11T13:58:15Z

Quick comment to @milankl :

Alex suggests to use b = nomissing(a,NaN);, another possibility is the function coalesce:

b = coalesce.(a, NaN)

milankl · 2024-08-29T16:58:20Z

@Alexander-Barth I remember you told mere there's also some ignoremissing functionality, meaning reading out the raw float values, what's the recommended way to do that again? 🙈

Alexander-Barth · 2024-08-30T06:55:14Z

Hi Milan,
Does this answer your question?
https://alexander-barth.github.io/NCDatasets.jl/dev/other/#Fill-values-and-missing-values

milankl · 2024-08-30T14:48:42Z

Yes, Alexander it does! I'm just confused about the crazy time and allocation differences.
Just reading in the raw data is fast (1200MB/s, there is some zlib decompression!) and allocates only what's needed (the array is ~92KB)

julia> ds["vor"]
vor (96 × 48 × 1 × 5)
  Datatype:    Union{Missing, Float32} (Float32)
  Dimensions:  lon × lat × layer × time
  Attributes:
   units                = s^-1
   long_name            = relative vorticity
   _FillValue           = NaN


julia> f(ds) = ds["vor"].var[:, :, :, :]

julia> @btime f($ds);
  75.415 μs (129 allocations: 95.45 KiB)

When using the default interface with Array{Union{Missing, Float32}} returned this is up to 10x slower and requires 10x more memory

julia> h(ds) = ds["vor"][:, :, :, :]

julia> @btime h($ds);
  757.944 μs (45704 allocations: 920.18 KiB)

julia> h(ds) = ds["vor"][:]   # faster, but returns a flat vector although variable is 4D

julia> @btime h($ds);
  101.149 μs (123 allocations: 207.57 KiB)

julia> h(ds) = Array(ds["vor"])
h (generic function with 1 method)

julia> @btime h($ds);
  795.377 μs (45705 allocations: 920.26 KiB)

I don't quite understand why the reshaping from vector <-> array{T, 4} matters here? And then nomissing is killer

julia> g(ds) = nomissing(ds["vor"])
g (generic function with 1 method)

julia> @btime g($ds);
  1.524 s (2949196 allocations: 133.69 MiB)

being 20,000x slower and using 1400x the memory? 🐌 Sorry maybe I should have asked that in another issue?

Alexander-Barth · 2024-09-02T06:30:10Z

Can you re-make your benchmark where you cache the results of ds["vor"] like ncvar = ds["vor"]; f(ncvar) = Array(ncvar). For nomissing one should use nomissing(ncvar[:,:,:,:]) rather than nomissing(ncvar). In the later case you are loading every element individually. But yes, let's make a new issue with a reproducible example :-)

milankl closed this as completed Sep 11, 2023

milankl mentioned this issue Aug 30, 2024

RingGrids: Fix RingGrids Array conversion SpeedyWeather/SpeedyWeather.jl#572

Merged

milankl mentioned this issue Sep 2, 2024

Read performance with/without missing #264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return type `Matrix{Union{Missing, Float32}}` #227

Return type `Matrix{Union{Missing, Float32}}` #227

milankl commented Sep 1, 2023

Alexander-Barth commented Sep 2, 2023

milankl commented Sep 8, 2023

Alexander-Barth commented Sep 11, 2023

ctroupin commented Sep 11, 2023

milankl commented Aug 29, 2024

Alexander-Barth commented Aug 30, 2024

milankl commented Aug 30, 2024 •

edited

Loading

Alexander-Barth commented Sep 2, 2024

Return type Matrix{Union{Missing, Float32}} #227

Return type Matrix{Union{Missing, Float32}} #227

Comments

milankl commented Sep 1, 2023

Alexander-Barth commented Sep 2, 2023

milankl commented Sep 8, 2023

Alexander-Barth commented Sep 11, 2023

ctroupin commented Sep 11, 2023

milankl commented Aug 29, 2024

Alexander-Barth commented Aug 30, 2024

milankl commented Aug 30, 2024 • edited Loading

Alexander-Barth commented Sep 2, 2024

Return type `Matrix{Union{Missing, Float32}}` #227

Return type `Matrix{Union{Missing, Float32}}` #227

milankl commented Aug 30, 2024 •

edited

Loading