-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return type Matrix{Union{Missing, Float32}}
#227
Comments
The problem is only after reading the whole array we can be sure that there a no missing values. Sometimes only a slice of the array is loaded. Python's netCDF4 library uses indeed a numpy array or masked arrays for the same variables depending whether a fillvalue is loaded or not (https://stackoverflow.com/questions/32046317/always-yield-a-masked-array-with-netcdf4) which can be quite confusing and it is not type-stable. I typically use often the function b = nomissing(a,NaN); # returns a Matrix{Float64} and replace all missing by NaN
b = nomissing(a); # returns a Matrix{Float64} but raises an error if there is an missing value With the function ds = NCDataset(download("https://www.unidata.ucar.edu/software/netcdf/examples/tos_O1_2001-2002.nc"))
ncvar = cfvariable(ds,"tos",fillvalue=nothing,missing_value=[]); # similar to ds["tos"]
eltype(ncvar)
# Float32 Note of in this example file, (I use PyPlot a lot myself and I wish that it could handle missing value, e.g. by automatically converting them to python masked arrays. See JuliaPy/PyCall.jl#616 ) I am open to any suggestion :-) |
No I think I agree with you that NCDatasets.jl should return |
OK, thank you for your feedback. Can this issue be closed? |
@Alexander-Barth I remember you told mere there's also some |
Hi Milan, |
Yes, Alexander it does! I'm just confused about the crazy time and allocation differences. julia> ds["vor"]
vor (96 × 48 × 1 × 5)
Datatype: Union{Missing, Float32} (Float32)
Dimensions: lon × lat × layer × time
Attributes:
units = s^-1
long_name = relative vorticity
_FillValue = NaN
julia> f(ds) = ds["vor"].var[:, :, :, :]
julia> @btime f($ds);
75.415 μs (129 allocations: 95.45 KiB) When using the default interface with julia> h(ds) = ds["vor"][:, :, :, :]
julia> @btime h($ds);
757.944 μs (45704 allocations: 920.18 KiB)
julia> h(ds) = ds["vor"][:] # faster, but returns a flat vector although variable is 4D
julia> @btime h($ds);
101.149 μs (123 allocations: 207.57 KiB)
julia> h(ds) = Array(ds["vor"])
h (generic function with 1 method)
julia> @btime h($ds);
795.377 μs (45705 allocations: 920.26 KiB) I don't quite understand why the reshaping from vector <-> array{T, 4} matters here? And then julia> g(ds) = nomissing(ds["vor"])
g (generic function with 1 method)
julia> @btime g($ds);
1.524 s (2949196 allocations: 133.69 MiB) being 20,000x slower and using 1400x the memory? 🐌 Sorry maybe I should have asked that in another issue? |
Can you re-make your benchmark where you cache the results of |
NCDatasets.jl returns a
Array{Union{Missing,Float32}}
(or Float64) when reading nc files with _fillValue specified, even if there aren't any missing values in the array. This can cause a bunch of problems downstream, for example with plotting(doing a conversion via
Matrix{Float32}(A)
would solve that though).Maybe to start a discussion whether it's always sensible for NCDatasets to return
Array{Union{Missing,Float32}}
?The text was updated successfully, but these errors were encountered: