Caching datasets in context [Question] #1396

a-agmon · 2021-12-03T12:15:41Z

Hi,
Is there a way to somehow cache a large CSV or Parquet file that was loaded to CTX in order to avoid reading it again each time when a new process tried to access it
What is the proper way to manage this in case you want to write something like a server that needs to answer multiple requests about the same file?

Thanks

capkurmagati · 2021-12-03T16:07:29Z

I think you can do something like this.

let mut ctx = ExecutionContext::new();
// read a file
ctx.register_csv("c", "path_to_csv", CsvReadOptions::new()).await?;
let df = ctx.sql("select * from c").await?;
let partitions = df.collect().await?;
// convert it into a memory table and register it to the context
let provider = MemTable::try_new(Arc::new(df.schema().into()), vec![partitions])?;
ctx.register_table("t", Arc::new(provider)).unwrap();
let df = ctx.sql("select * from t").await?;
df.show().await?;

houqp · 2021-12-04T07:35:47Z

I think we could add a caching option in context to automatically cache full table scans between runs

alamb · 2021-12-04T15:34:00Z

I suspect it is also possible perhaps to use CREATE TABLE AS SELECT for this purpose

Something like

echo  "1" > /tmp/foo.csv
datafusion-cli

create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';

create table bar as select * from foo;

a-agmon · 2021-12-04T16:25:31Z

I suspect it is also possible perhaps to use CREATE TABLE AS SELECT for this purpose

Something like
echo  "1" > /tmp/foo.csv
datafusion-cli
create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';

create table bar as select * from foo;

Thanks @alamb
But would it be saved or cached in mem for subsequent access?

Dandandan · 2021-12-04T17:10:27Z

I suspect it is also possible perhaps to use CREATE TABLE AS SELECT for this purpose

Something like
echo  "1" > /tmp/foo.csv
datafusion-cli
create external table foo(c1 int) stored as CSV location '/tmp/foo.csv';

create table bar as select * from foo;
Thanks @alamb
But would it be saved or cached in mem for subsequent access?

At this moment it will be cached, as it's using MemTable to load the data.
However, I think it's likely it will store the data in the future to a persistent location / format (parquet) by default. By then we need to add some option in the SQL syntax to store the results in memory.

Dandandan · 2021-12-04T17:12:34Z

We could take some inspiration from Spark, where you can .cache() or .persist() a DataFrame.

alamb · 2021-12-05T14:01:49Z

The relevant code is here: https://github.com/apache/arrow-datafusion/blob/414c826bf06fd22e0bb52edbb497791b5fe558e0/datafusion/src/sql/planner.rs#L139-L171

(note how CREATE TABLE AS SELECT ... gets translated into LogicalPlan::CreateMemoryTable

justinrmiller · 2022-01-01T23:44:06Z

I had a question along these lines. Is there a way to load a CSV or Parquet file directly from memory into a context?

ctx.register_csv("c", "path_to_csv", CsvReadOptions::new()).await?;

This is great if the file already exists on disk, but if I'm pulling the data from say Redis then I have to write the data to disk first then read it back. Thanks!

alamb · 2022-01-02T16:48:01Z

@justinrmiller I don't know how to do this today -- it may be possible to register a new memory ObjectStore source and then pass the URL into register_csv but one would have to play around with that more to really find out)

justinrmiller · 2022-01-02T21:15:06Z

@justinrmiller I don't know how to do this today -- it may be possible to register a new memory ObjectStore source and then pass the URL into register_csv but one would have to play around with that more to really find out)

Thanks I'll check the source code and try to figure out a way to do so!

alamb added the question Further information is requested label Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching datasets in context [Question] #1396

Caching datasets in context [Question] #1396

a-agmon commented Dec 3, 2021 •

edited

Loading

capkurmagati commented Dec 3, 2021

houqp commented Dec 4, 2021

alamb commented Dec 4, 2021

a-agmon commented Dec 4, 2021

Dandandan commented Dec 4, 2021

Dandandan commented Dec 4, 2021 •

edited

Loading

alamb commented Dec 5, 2021

justinrmiller commented Jan 1, 2022

alamb commented Jan 2, 2022

justinrmiller commented Jan 2, 2022

Caching datasets in context [Question] #1396

Caching datasets in context [Question] #1396

Comments

a-agmon commented Dec 3, 2021 • edited Loading

capkurmagati commented Dec 3, 2021

houqp commented Dec 4, 2021

alamb commented Dec 4, 2021

a-agmon commented Dec 4, 2021

Dandandan commented Dec 4, 2021

Dandandan commented Dec 4, 2021 • edited Loading

alamb commented Dec 5, 2021

justinrmiller commented Jan 1, 2022

alamb commented Jan 2, 2022

justinrmiller commented Jan 2, 2022

a-agmon commented Dec 3, 2021 •

edited

Loading

Dandandan commented Dec 4, 2021 •

edited

Loading