Use the correct hive partition type information (hacky solution) #1

nils-braun · 2021-05-20T06:53:40Z

Thanks for your very good bug fix in dask-contrib#180!
I tried to add to your PR also a very hacky solution for dask-contrib#181.

In my small tests, that has worked, but I think it will break on many occasions. I am wondering if we should go down this path and make it more robust, or if we should invest in a better solution where we do not need to translate between the string representation of the values from DESCRIBE FORMATTED an the python types all the time. I took this idea from the blazing folks, but I am open to other ideas.

What do you think?

gallamine · 2021-05-24T20:35:18Z

dask_sql/input_utils/hive.py

-                    partition_table_information["Partition Value"]
-                )
+                partition_values = partition_table_information["Partition Value"]
+                partition_values = partition_values[1 : len(partition_values) - 1]


bug (blocking): Unless len(partition_values) is 1 (will it always?) then this will return a list object and the subsequent .split(',') call will fail.

gallamine · 2021-05-24T21:16:27Z

A bigger issue here, that I've realized is that for tables with many partitions (I'm trying to read one with 800+ partitions) this function will read in all the data first, and then you'll need to filter, rather than allowing table filtering before reading. This seems like a design problem. Additionally, I know that my table's Parquet schema got adjusted at some point and might not be compatible for concatenating, so I have to filter before reading.

nils-braun · 2021-05-25T19:12:57Z

Right, unfortunately, that is true. We are facing multiple issues here that all-in-all lead to the problem you are describing. dask not understanding hives partitioning schema, and a mixture of dask-contrib#182 and dask-contrib#183.

How could we solve that? The easiest way would be to implement dask-contrib#183, turn the hive-input into a specific form of a table scan (if you know relational algebra: it is basically one step in the total execution plan to be optimized which represents the "read input" part) and make it combinable with a predicate pushdown on the partitioning key. Unfortunately, there is still quite some work to do before we end up with dask-contrib#183 solved...

Another possibility would be to extract the hive part into a convenience function that you could call standalone to get the file paths. You would then be able to filter the file paths by yourself and only load the ones you care about.

Still another possibility would be to use Hive directly, fire a hive query to only get the data of the partitions you care and load it into dask (e.g. with a read_sql_query from Dask and some sqlalchemy).

I can help you with all three possibilities, if you like, even though the first solution will be the most complex one.

…ontrib#245) * Add GPU tests, pytest `gpu` marker (#1) * Add dask-cudf tests with GPU marker * Change --gpu option to --rungpu * xfail GPU table creation from memory * Do not persist data to memory by default when creating tables * Revert "Add GPU tests, pytest `gpu` marker (#1)" This reverts commit efa492f. * update persist doc Co-authored-by: Charles Blackmon-Luca <[email protected]>

Use the correct hive partition type information (hacky solution)

e62a72b

nils-braun mentioned this pull request May 20, 2021

Add better partition string escaping dask-contrib/dask-sql#180

Open

gallamine reviewed May 24, 2021

View reviewed changes

nils-braun mentioned this pull request May 25, 2021

Optimization: Go from general logical plans to dask-specific plans? dask-contrib/dask-sql#183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the correct hive partition type information (hacky solution) #1

Use the correct hive partition type information (hacky solution) #1

nils-braun commented May 20, 2021

gallamine May 24, 2021

gallamine commented May 24, 2021

nils-braun commented May 25, 2021

Use the correct hive partition type information (hacky solution) #1

Are you sure you want to change the base?

Use the correct hive partition type information (hacky solution) #1

Conversation

nils-braun commented May 20, 2021

gallamine May 24, 2021

Choose a reason for hiding this comment

gallamine commented May 24, 2021

nils-braun commented May 25, 2021