Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use mutate(.keep = "none") in transmute() #483

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

markfairbanks
Copy link
Collaborator

@markfairbanks markfairbanks commented Dec 10, 2024

Closes #470

Note: I removed most of the dedicated transmute() tests as they're now covered by mutate() tests

@markfairbanks markfairbanks requested a review from eutwt December 10, 2024 14:42
@eutwt
Copy link
Collaborator

eutwt commented Dec 25, 2024

The code change looks good. There's a tradeoff to the change, because if you're doing an immediate transmute after converting to a lazy frame, the data table is copied but currently in main it would have only used .()/list() in j. This makes the keep = "none" translation take more memory (example below). I'm not sure the relative frequency of using transmute ~immediately vs transmute later in a pipeline where your data would have already been copied.

library(bench)
library(data.table)
library(magrittr)
n <- 1e6

dt <- replicate(10, sample(n), simplify = FALSE) %>% 
  setNames(., tail(letters, length(.))) %>% 
  as.data.table()

mark(
  add_then_remove = copy(dt)[, `:=`(double_x = x * 2)][, `:=`(names(dt), NULL)],
  list_in_j = dt[, .(double_x = x * 2)],
)
#> # A tibble: 2 × 6
#>   expression           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 add_then_remove    6.8ms   8.96ms      91.7    47.6MB     504.
#> 2 list_in_j         2.19ms   2.86ms     315.     15.3MB     137.

Created on 2024-12-25 with reprex v2.0.2

One thought I had is that transmute could be a reframe() which uses vec_recycle_common(..., .size = .N) instead of list(...) in j, but that is a bad option because it would prevent GForce optimizations on the ... expressions

R/step-subset-transmute.R Outdated Show resolved Hide resolved
@markfairbanks
Copy link
Collaborator Author

I think we need a re-order to match dplyr output for tibble(x = 1, y = 2) %>% transmute(y, x)

Ah yep I'll make the reorder change.

One thought I had is that transmute could be a reframe() which uses vec_recycle_common(..., .size = .N) instead of list(...) in j, but that is a bad option because it would prevent GForce optimizations on the ... expressions

We could also do something like this to preserve GForce but I think it makes the translation look odd.

library(dtplyr)
library(dplyr)

df <- tibble(x = 1:2, y = 3:4) %>%
  lazy_dt()

df %>%
  reframe(y, mean_x = mean(x), .row_preserve = row_number()) %>%
  select(-.row_preserve)
#> Source: local data table [2 x 2]
#> Call:   `_DT1`[, .(y = y, mean_x = mean(x), .row_preserve = seq_len(.N))][, 
#>     `:=`(".row_preserve", NULL)]
#> 
#>       y mean_x
#>   <int>  <dbl>
#> 1     3    1.5
#> 2     4    1.5
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

@markfairbanks markfairbanks requested a review from eutwt December 26, 2024 18:43
@markfairbanks
Copy link
Collaborator Author

Actually sorry I completely did the column order thing wrong 😅

Let me take another pass at it.

@markfairbanks
Copy link
Collaborator Author

Ok good to review now

R/step-subset-transmute.R Outdated Show resolved Hide resolved
@markfairbanks markfairbanks requested a review from eutwt January 21, 2025 01:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert transmute() to use mutate(.keep = "none")
2 participants