-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the correct hive partition type information (hacky solution) #1
base: partition_escape
Are you sure you want to change the base?
Use the correct hive partition type information (hacky solution) #1
Conversation
partition_table_information["Partition Value"] | ||
) | ||
partition_values = partition_table_information["Partition Value"] | ||
partition_values = partition_values[1 : len(partition_values) - 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug (blocking): Unless len(partition_values)
is 1 (will it always?) then this will return a list object and the subsequent .split(',')
call will fail.
A bigger issue here, that I've realized is that for tables with many partitions (I'm trying to read one with 800+ partitions) this function will read in all the data first, and then you'll need to filter, rather than allowing table filtering before reading. This seems like a design problem. Additionally, I know that my table's Parquet schema got adjusted at some point and might not be compatible for concatenating, so I have to filter before reading. |
Right, unfortunately, that is true. We are facing multiple issues here that all-in-all lead to the problem you are describing. How could we solve that? The easiest way would be to implement dask-contrib#183, turn the Another possibility would be to extract the hive part into a convenience function that you could call standalone to get the file paths. You would then be able to filter the file paths by yourself and only load the ones you care about. Still another possibility would be to use Hive directly, fire a hive query to only get the data of the partitions you care and load it into dask (e.g. with a I can help you with all three possibilities, if you like, even though the first solution will be the most complex one. |
…ontrib#245) * Add GPU tests, pytest `gpu` marker (#1) * Add dask-cudf tests with GPU marker * Change --gpu option to --rungpu * xfail GPU table creation from memory * Do not persist data to memory by default when creating tables * Revert "Add GPU tests, pytest `gpu` marker (#1)" This reverts commit efa492f. * update persist doc Co-authored-by: Charles Blackmon-Luca <[email protected]>
Hi @gallamine !
Thanks for your very good bug fix in dask-contrib#180!
I tried to add to your PR also a very hacky solution for dask-contrib#181.
In my small tests, that has worked, but I think it will break on many occasions. I am wondering if we should go down this path and make it more robust, or if we should invest in a better solution where we do not need to translate between the string representation of the values from
DESCRIBE FORMATTED
an the python types all the time. I took this idea from the blazing folks, but I am open to other ideas.What do you think?