Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: read with the same class that the object had on save #82

Open
wlandau opened this issue Aug 1, 2024 · 3 comments
Open

Proposal: read with the same class that the object had on save #82

wlandau opened this issue Aug 1, 2024 · 3 comments
Labels
feature a feature request or enhancement

Comments

@wlandau
Copy link

wlandau commented Aug 1, 2024

Currently, parquet_options()$class controls the class that read_parquet() assigns to the returned object. As a default, I feel it would be more natural to restore original the class from the time the object was saved (which arrow does). This would help integrate nanoparquet with targets as @hadley proposed in ropensci/targets#1311.

@wlandau
Copy link
Author

wlandau commented Aug 1, 2024

For targets, I think I can actually work around this by storing the class in the metadata, but I think it may still be convenient for other end users.

@gaborcsardi
Copy link
Member

The existing option is mainly for letting people opt out from tbl objects.

I am not sure if we should restore the class, because we cannot be sure that we also restore type invariants.

E.g. if somebody defines a metadataframe class that stores additional metadata in attributes, then nanoparquet would of course not restore those, so you could get a broken object. Another example would be a class that requires a column type or class that is not kept as is after a parquet write + read.

nanoparquet is for reading and writing data frames. If you set the existing class option then you accept the restriction that your object will be a "vanilla" data frame. If we blindly save and restore the class, then there'll be a potential for creating broken objects..

OTOH, maybe we could have an s3 generic that people could use to write out custom classes, or make write_parquet() a generic.

@torfason
Copy link

torfason commented Dec 28, 2024

Apologies for (possibly) submitting a support request disguised as an issue report, but I am failing to get the class option to work and can't figure out what is going wrong. Here is the reprex:

# Load nanoparquet
library(nanoparquet)
packageVersion("nanoparquet")
#> [1] '0.3.1.9000'

# Write a data.frame
d <- data.frame(letters)
f <- tempfile()
write_parquet(d, f)

# Read after setting options
options(nanoparquet.class = c("tbl_df", "tbl", "data.frame"))
read_parquet(f) |> class()
#> [1] "tbl"        "data.frame"

# Read using explicit parquet_options()
read_parquet(f, options = parquet_options(class = c("tbl_df", "tbl", "data.frame"))) |> class()
#> [1] "tbl"        "data.frame"

# Try alternative specifications
read_parquet(f, options = parquet_options(class = c("data.frame")))    |> class()
#> [1] "tbl"        "data.frame"
read_parquet(f, options = parquet_options(class = character()))        |> class()
#> [1] "tbl"        "data.frame"
read_parquet(f, options = parquet_options(class = "tbl_df"))           |> class()
#> [1] "tbl"        "data.frame"
read_parquet(f, options = parquet_options(class = ""))                 |> class()
#> [1] "tbl"        "data.frame"

A few other somewhat related thoughts:

  • I appreciate the concern that one does not want to restore a broken object. One possibility for allowing the storage of different types of data frames would be to have an accept_classes (or simply classes) parameter, which would only allow reading (and writing?) objects if any classes were included in the accept_classes list, punting the responsibility to the user. Trying to read/write objects with unlisted class entries would result in an error. A default value of c("tbl_df", "tbl", "data.frame") would allow some very common use cases (i.e. base data frames and tibbles) to be transparently round-tripped. Interaction with class could work by some heuristic about priorities, such as:
    • On write, add class data if accept_classes is set
    • On read, if accept_classes is set and class data is found when reading a parquet file, use that
    • If no class data is found when reading a parquet file, use value from class
    • If one or the other params are null/NA, use the other
    • If both are null/NA, error out
  • Would it make sense to reserve ... as the first parameter in parquet_options() to force the use of name-based parameters and prevent location-based usage? It feels like parquet_options() might accumulate a lot of parameters over time, and having the flexibility to group related parameters would help with documentation, but reordering them would break existing location-based usage. Putting ... first would of course break backwards compatibility now, but ensure that later reordering of parameters would from then on be backwards compatible. Since it is still early days for nanoparquet I think it might be worthwhile trade-off.
  • I wonder why the default value of class is c("tbl", "data.frame") and not c("tbl_df", "tbl", "data.frame")? It seems to me that tbl is more of an "interface" than a class, since all actual instances of tbls that I have encountered previously are actually also other classes such as either tbl_df or more complex types (such as those found in dbplyr). And the returned object is structurally a tbl_df, right, so it would be useful for any functions taking it to be able to use the most specific class information?

@gaborcsardi gaborcsardi added the feature a feature request or enhancement label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants