Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve datafusion-cli memory usage and considering reserve memory for the result batches #14751

Closed
zhuqi-lucas opened this issue Feb 18, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@zhuqi-lucas
Copy link
Contributor

Is your feature request related to a problem or challenge?

This is the follow-up for the discussion #14644 (comment)

Problem
I tried one query and this PR is not working as expected, I specified one query to run under 5GB memory (select * without order requires 7GB) but it's still consuming around 10GB, could you double check? I suspect we missed some details.

It is a problem of datafusion-cli. If datafusion-cli decides to hold all the result batches in memory, it should create a memory consumer for itself and reserve memory for the result batches.

datafusion-cli calls collect to hold the entire result set in memory before displaying it, this is unnecessary when maxrows is not unlimited. I've tried the following code to replace the collect call, and the maximum resident set size has reduced to 4.8 GB:

Describe the solution you'd like

  1. Support memory consumer for datafusion-cli itself and reserve memory for the result batches.
  2. Support hold the streaming the result set in memory and displaying it when maxrows is not unlimited.

Describe alternatives you've considered

No response

Additional context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant