-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching datasets in context [Question] #1396
Comments
I think you can do something like this. let mut ctx = ExecutionContext::new();
// read a file
ctx.register_csv("c", "path_to_csv", CsvReadOptions::new()).await?;
let df = ctx.sql("select * from c").await?;
let partitions = df.collect().await?;
// convert it into a memory table and register it to the context
let provider = MemTable::try_new(Arc::new(df.schema().into()), vec![partitions])?;
ctx.register_table("t", Arc::new(provider)).unwrap();
let df = ctx.sql("select * from t").await?;
df.show().await?; |
I think we could add a caching option in context to automatically cache full table scans between runs |
I suspect it is also possible perhaps to use Something like echo "1" > /tmp/foo.csv
datafusion-cli create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';
create table bar as select * from foo; |
Thanks @alamb |
At this moment it will be cached, as it's using |
We could take some inspiration from Spark, where you can |
The relevant code is here: https://github.com/apache/arrow-datafusion/blob/414c826bf06fd22e0bb52edbb497791b5fe558e0/datafusion/src/sql/planner.rs#L139-L171 (note how |
I had a question along these lines. Is there a way to load a CSV or Parquet file directly from memory into a context?
This is great if the file already exists on disk, but if I'm pulling the data from say Redis then I have to write the data to disk first then read it back. Thanks! |
@justinrmiller I don't know how to do this today -- it may be possible to register a new memory ObjectStore source and then pass the URL into |
Thanks I'll check the source code and try to figure out a way to do so! |
Hi,
Is there a way to somehow cache a large CSV or Parquet file that was loaded to CTX in order to avoid reading it again each time when a new process tried to access it
What is the proper way to manage this in case you want to write something like a server that needs to answer multiple requests about the same file?
Thanks
The text was updated successfully, but these errors were encountered: