Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the correct hive partition type information (hacky solution) #1

Open
wants to merge 1 commit into
base: partition_escape
Choose a base branch
from

Conversation

nils-braun
Copy link

Hi @gallamine !

Thanks for your very good bug fix in dask-contrib#180!
I tried to add to your PR also a very hacky solution for dask-contrib#181.

In my small tests, that has worked, but I think it will break on many occasions. I am wondering if we should go down this path and make it more robust, or if we should invest in a better solution where we do not need to translate between the string representation of the values from DESCRIBE FORMATTED an the python types all the time. I took this idea from the blazing folks, but I am open to other ideas.

What do you think?

partition_table_information["Partition Value"]
)
partition_values = partition_table_information["Partition Value"]
partition_values = partition_values[1 : len(partition_values) - 1]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug (blocking): Unless len(partition_values) is 1 (will it always?) then this will return a list object and the subsequent .split(',') call will fail.

@gallamine
Copy link
Owner

A bigger issue here, that I've realized is that for tables with many partitions (I'm trying to read one with 800+ partitions) this function will read in all the data first, and then you'll need to filter, rather than allowing table filtering before reading. This seems like a design problem. Additionally, I know that my table's Parquet schema got adjusted at some point and might not be compatible for concatenating, so I have to filter before reading.

@nils-braun
Copy link
Author

Right, unfortunately, that is true. We are facing multiple issues here that all-in-all lead to the problem you are describing. dask not understanding hives partitioning schema, and a mixture of dask-contrib#182 and dask-contrib#183.

How could we solve that? The easiest way would be to implement dask-contrib#183, turn the hive-input into a specific form of a table scan (if you know relational algebra: it is basically one step in the total execution plan to be optimized which represents the "read input" part) and make it combinable with a predicate pushdown on the partitioning key. Unfortunately, there is still quite some work to do before we end up with dask-contrib#183 solved...

Another possibility would be to extract the hive part into a convenience function that you could call standalone to get the file paths. You would then be able to filter the file paths by yourself and only load the ones you care about.

Still another possibility would be to use Hive directly, fire a hive query to only get the data of the partitions you care and load it into dask (e.g. with a read_sql_query from Dask and some sqlalchemy).

I can help you with all three possibilities, if you like, even though the first solution will be the most complex one.

jdye64 added a commit that referenced this pull request Sep 1, 2022
…ontrib#245)

* Add GPU tests, pytest `gpu` marker (#1)

* Add dask-cudf tests with GPU marker

* Change --gpu option to --rungpu

* xfail GPU table creation from memory

* Do not persist data to memory by default when creating tables

* Revert "Add GPU tests, pytest `gpu` marker (#1)"

This reverts commit efa492f.

* update persist doc

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants