Proposal: read with the same class that the object had on save #82

wlandau · 2024-08-01T18:39:37Z

Currently, parquet_options()$class controls the class that read_parquet() assigns to the returned object. As a default, I feel it would be more natural to restore original the class from the time the object was saved (which arrow does). This would help integrate nanoparquet with targets as @hadley proposed in ropensci/targets#1311.

The text was updated successfully, but these errors were encountered:

wlandau · 2024-08-01T18:55:14Z

For targets, I think I can actually work around this by storing the class in the metadata, but I think it may still be convenient for other end users.

gaborcsardi · 2024-08-01T19:10:33Z

The existing option is mainly for letting people opt out from tbl objects.

I am not sure if we should restore the class, because we cannot be sure that we also restore type invariants.

E.g. if somebody defines a metadataframe class that stores additional metadata in attributes, then nanoparquet would of course not restore those, so you could get a broken object. Another example would be a class that requires a column type or class that is not kept as is after a parquet write + read.

nanoparquet is for reading and writing data frames. If you set the existing class option then you accept the restriction that your object will be a "vanilla" data frame. If we blindly save and restore the class, then there'll be a potential for creating broken objects..

OTOH, maybe we could have an s3 generic that people could use to write out custom classes, or make write_parquet() a generic.

torfason · 2024-12-28T10:55:17Z

Apologies for (possibly) submitting a support request disguised as an issue report, but I am failing to get the class option to work and can't figure out what is going wrong. Here is the reprex:

# Load nanoparquet
library(nanoparquet)
packageVersion("nanoparquet")
#> [1] '0.3.1.9000'

# Write a data.frame
d <- data.frame(letters)
f <- tempfile()
write_parquet(d, f)

# Read after setting options
options(nanoparquet.class = c("tbl_df", "tbl", "data.frame"))
read_parquet(f) |> class()
#> [1] "tbl"        "data.frame"

# Read using explicit parquet_options()
read_parquet(f, options = parquet_options(class = c("tbl_df", "tbl", "data.frame"))) |> class()
#> [1] "tbl"        "data.frame"

# Try alternative specifications
read_parquet(f, options = parquet_options(class = c("data.frame")))    |> class()
#> [1] "tbl"        "data.frame"
read_parquet(f, options = parquet_options(class = character()))        |> class()
#> [1] "tbl"        "data.frame"
read_parquet(f, options = parquet_options(class = "tbl_df"))           |> class()
#> [1] "tbl"        "data.frame"
read_parquet(f, options = parquet_options(class = ""))                 |> class()
#> [1] "tbl"        "data.frame"

A few other somewhat related thoughts:

I appreciate the concern that one does not want to restore a broken object. One possibility for allowing the storage of different types of data frames would be to have an accept_classes (or simply classes) parameter, which would only allow reading (and writing?) objects if any classes were included in the accept_classes list, punting the responsibility to the user. Trying to read/write objects with unlisted class entries would result in an error. A default value of c("tbl_df", "tbl", "data.frame") would allow some very common use cases (i.e. base data frames and tibbles) to be transparently round-tripped. Interaction with class could work by some heuristic about priorities, such as:
- On write, add class data if accept_classes is set
- On read, if accept_classes is set and class data is found when reading a parquet file, use that
- If no class data is found when reading a parquet file, use value from class
- If one or the other params are null/NA, use the other
- If both are null/NA, error out
Would it make sense to reserve ... as the first parameter in parquet_options() to force the use of name-based parameters and prevent location-based usage? It feels like parquet_options() might accumulate a lot of parameters over time, and having the flexibility to group related parameters would help with documentation, but reordering them would break existing location-based usage. Putting ... first would of course break backwards compatibility now, but ensure that later reordering of parameters would from then on be backwards compatible. Since it is still early days for nanoparquet I think it might be worthwhile trade-off.
I wonder why the default value of class is c("tbl", "data.frame") and not c("tbl_df", "tbl", "data.frame")? It seems to me that tbl is more of an "interface" than a class, since all actual instances of tbls that I have encountered previously are actually also other classes such as either tbl_df or more complex types (such as those found in dbplyr). And the returned object is structurally a tbl_df, right, so it would be useful for any functions taking it to be able to use the most specific class information?

gaborcsardi added the feature a feature request or enhancement label Jan 23, 2025

gaborcsardi mentioned this issue Jan 23, 2025

class does not work in read_parquet() #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: read with the same class that the object had on save #82

Proposal: read with the same class that the object had on save #82

wlandau commented Aug 1, 2024 •

edited

Loading

wlandau commented Aug 1, 2024

gaborcsardi commented Aug 1, 2024

torfason commented Dec 28, 2024 •

edited

Loading

Proposal: read with the same class that the object had on save #82

Proposal: read with the same class that the object had on save #82

Comments

wlandau commented Aug 1, 2024 • edited Loading

wlandau commented Aug 1, 2024

gaborcsardi commented Aug 1, 2024

torfason commented Dec 28, 2024 • edited Loading

wlandau commented Aug 1, 2024 •

edited

Loading

torfason commented Dec 28, 2024 •

edited

Loading