You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While working on issue #36, I quickly came across a question that how flexible in querying nested fields the API should be. If an user only wants the information from a nested column, should my API be able to query just the nested column and return it like a separated dataset or the user can always just query the surface columns and he needs to unfold the nested columns under himself? This then leads to another question, is Parquet even designed to be queried in this fashion and can a package leverage such a design? If they are not then there is no point in providing a flexible API. Even if they are, in our current stage I tend to keep our design simple unless we absolutely need to. Still, I want to understand more about Parquet as I think it is important for later development so I conducted a study anyway and this issue serves as a record of my findings. This also serves as a notebook of how to query nested fields.
Parquet is capable to be queried just nested columns:
A quick search revealed that Parquet is designed to do so. Indeed a Parquet file only records leaf columns so for the following schema,
there are just 3 columns in a Parquet file, id, struct.nested_arr.field_a and struct.nested_arr.field_b. Two additional information for each row of a nested column repetition level and definition level is used to reconstruct the hierarchy. The internet has a lot of explanation regarding this and here is one of them from the official Parquet specification: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper
Knowing this, it means in theory it is possible to query just the field_b and imagine if a user only needs it and is querying over cloud storage, this would be beneficial as a lot of data could be skipped. The next question is are existing packages implemented to leverage this design.
Summary
In short, yes for pyarrow and polars, but only when the schema is simple. A perfect yes for duckdb it seems surprisingly. A reminder though their internal implementation is very different so the interpretation could be incorrect.
Due to the separated computing server design of pyspark, it's behaviour is unstable and debugging is difficult so I excluded it after a few tries.
Benchmark
Fake data generation
A fake large Parquet with the following schema is generated.
The following SQL is used to generate the Parquet using duckdb.
COPY (
SELECT
CAST(gen_random_uuid() AS STRING) AS id,
{
struct_arr: [
{
arr_a: [CAST(gen_random_uuid() AS STRING)],
arr_b: [CAST(gen_random_uuid() AS STRING) FOR i IN range(1000000)]
}
],
str: CAST(gen_random_uuid() AS STRING)
} AS struct,
FROM range(10)
)
TO 'fake-large.parquet'
The nested field arr_b under the struct array struct_arr is very long so when this field is scanned, it should significantly slows down the process. If a Parquet query engine is perfectly implemented, it should be able to skip it unless we want it. The script produces a file than is around 350MB, I tried a larger file at first but quickly found that a larger file put a larger stress on memory and frequent memory management renders tests not comparable. A smaller but repeated test does a better job.
Tests
A test against a package includes the following queries, to minimise the overhead, streaming is also not used here: full: scanning the whole file as is id: the easiest one, just querying a surface column, should be lightning fast struct: the surface column that includes the long array nested field, providing a ground reference struct.str: the easier version of nested field query, a nested field under a struct struct.struct_arr: the nested field that contains the long nested array, another reference
***struct.struct_arr.arr_a: the hardest part, a short array under a struct under an array. In theory the query should be fast struct.struct_arr.arr_b: the long array being an obstacle at the same level as the arr_a, another reference
timeit was used to time a query, each query is repeated 3 times and the min was taken.
Querying id is lightning fast comparing to full query showing that it can query surface columns, yet, no significant difference between all nested fields so it seems polars cannot query individual nested columns but only pulling the whole record at once.
Strangely querying everything and querying just the id is equally fast, I don't know how to explain this. pyarrow also seems to not be able to query individual nested columns. I cannot be sure is type casting the right way to query nested columns under an array but this is the only method I could find.
duckdb
importduckdbfrombaseimportrundefquery(sql):
duckdb.sql(sql).fetchall()
deffull():
sql="SELECT * FROM read_parquet('fake-large.parquet')"query(sql)
defid():
sql="SELECT id FROM read_parquet('fake-large.parquet')"query(sql)
defstruct():
sql="SELECT struct FROM read_parquet('fake-large.parquet')"query(sql)
defstruct_str():
sql="SELECT struct.str FROM read_parquet('fake-large.parquet')"query(sql)
defstruct_arr():
sql="SELECT struct.struct_arr FROM read_parquet('fake-large.parquet')"query(sql)
defstruct_arr_arr_a():
sql="SELECT arr_a FROM (SELECT unnest(struct.struct_arr, max_depth := 2) FROM read_parquet('fake-large.parquet'))"query(sql)
defstruct_arr_arr_b():
sql="SELECT arr_b FROM (SELECT unnest(struct.struct_arr, max_depth := 2) FROM read_parquet('fake-large.parquet'))"query(sql)
run([full, id, struct, struct_str, struct_arr, struct_arr_arr_a, struct_arr_arr_b])
This is a surprising one, while overall duckdb is the slowest, the time of different nested columns queries do suggest that duckdb i capable to query individual nested columns. Since struct_arr_arr_a always has only one element, the column lengths of struct_str and struct_arr_arr_a in the Parquet file should be roughly the same and the query time seems to agree with it. full, struct, struct_arr and struct_arr_arr_b all have a similar query time because all of them involves scanning the longest column. All in all the result is sensible and suggests that surprisingly duckdb has the best implementation in terms of leveraging Parquet design even though it's designed to be a generic universal query engine against many formats.
Remarks
Even if a package is perfect, providing a fully flexible API that can reflect it's capability is perhaps over-optimisation, at least for now. In the future this may be beneficial if querying over cloud becomes common. This study though made me have a better understanding of Parquet format and existing data science packages. Since polars still struggles to stream, polars reminds out of consideration. It took me a while to figure out how to deal with pyarrow high memory usage. It has a default, aggressive caching policy which could be turned off by supplying fragment_scan_options=ds.ParquetFragmentScanOptions(pre_buffer=False). While duckdb is slower but their attention to details continuously impress me. pyarrow is more like a low level Parquet file inspector then a query engine. It's API is harder to use but it is the most feature completed in terms of Parquet support. pyspark often crashes with no debugging information available right away due to it's separated computing design. When the computing server gives up, the Python client got a connection closed exception rather than the cause which is unhelpful. It's high maintenance cost makes it remain out of consideration. pyarrow and duckdb right now are my top candidates.
This issue serves as a subtask of new API design #36, will be closed once it is finished.
The text was updated successfully, but these errors were encountered:
@slobentanzer Sorry for confusing you, I mean pyarrow is more focused on low level comprehensive inspection and manipulation of a Parquet file then seeing it as a high level data container but it still equips a lot of computation which makes it useful for data processing. pyarrow is able to inspect the file structure of a Parquet file, including all metadata and even row groups. You can as well construct a new Parquet file from the ground up. polars users often need to use pyarrow for these functionalities.
That being said, I just found that duckdb actually offers a similarly comprehensive API for Parquet files, amazing.
While working on issue #36, I quickly came across a question that how flexible in querying nested fields the API should be. If an user only wants the information from a nested column, should my API be able to query just the nested column and return it like a separated dataset or the user can always just query the surface columns and he needs to unfold the nested columns under himself? This then leads to another question, is Parquet even designed to be queried in this fashion and can a package leverage such a design? If they are not then there is no point in providing a flexible API. Even if they are, in our current stage I tend to keep our design simple unless we absolutely need to. Still, I want to understand more about Parquet as I think it is important for later development so I conducted a study anyway and this issue serves as a record of my findings. This also serves as a notebook of how to query nested fields.
Parquet is capable to be queried just nested columns:
A quick search revealed that Parquet is designed to do so. Indeed a Parquet file only records leaf columns so for the following schema,
there are just 3 columns in a Parquet file,
id
,struct.nested_arr.field_a
andstruct.nested_arr.field_b
. Two additional information for each row of a nested columnrepetition level
anddefinition level
is used to reconstruct the hierarchy. The internet has a lot of explanation regarding this and here is one of them from the official Parquet specification: https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paperKnowing this, it means in theory it is possible to query just the
field_b
and imagine if a user only needs it and is querying over cloud storage, this would be beneficial as a lot of data could be skipped. The next question is are existing packages implemented to leverage this design.Summary
In short, yes for
pyarrow
andpolars
, but only when the schema is simple. A perfect yes forduckdb
it seems surprisingly. A reminder though their internal implementation is very different so the interpretation could be incorrect.Due to the separated computing server design of
pyspark
, it's behaviour is unstable and debugging is difficult so I excluded it after a few tries.Benchmark
Fake data generation
A fake large Parquet with the following schema is generated.
The following SQL is used to generate the Parquet using
duckdb
.The nested field
arr_b
under the struct arraystruct_arr
is very long so when this field is scanned, it should significantly slows down the process. If a Parquet query engine is perfectly implemented, it should be able to skip it unless we want it. The script produces a file than is around 350MB, I tried a larger file at first but quickly found that a larger file put a larger stress on memory and frequent memory management renders tests not comparable. A smaller but repeated test does a better job.Tests
A test against a package includes the following queries, to minimise the overhead, streaming is also not used here:
full
: scanning the whole file as isid
: the easiest one, just querying a surface column, should be lightning faststruct
: the surface column that includes the long array nested field, providing a ground referencestruct.str
: the easier version of nested field query, a nested field under a structstruct.struct_arr
: the nested field that contains the long nested array, another reference***
struct.struct_arr.arr_a
: the hardest part, a short array under a struct under an array. In theory the query should be faststruct.struct_arr.arr_b
: the long array being an obstacle at the same level as thearr_a
, another referencetimeit
was used to time a query, each query is repeated 3 times and themin
was taken.polars
Querying
id
is lightning fast comparing to full query showing that it can query surface columns, yet, no significant difference between all nested fields so it seemspolars
cannot query individual nested columns but only pulling the whole record at once.pyarrow
Strangely querying everything and querying just the
id
is equally fast, I don't know how to explain this.pyarrow
also seems to not be able to query individual nested columns. I cannot be sure is type casting the right way to query nested columns under an array but this is the only method I could find.duckdb
This is a surprising one, while overall
duckdb
is the slowest, the time of different nested columns queries do suggest thatduckdb
i capable to query individual nested columns. Sincestruct_arr_arr_a
always has only one element, the column lengths ofstruct_str
andstruct_arr_arr_a
in the Parquet file should be roughly the same and the query time seems to agree with it.full
,struct
,struct_arr
andstruct_arr_arr_b
all have a similar query time because all of them involves scanning the longest column. All in all the result is sensible and suggests that surprisinglyduckdb
has the best implementation in terms of leveraging Parquet design even though it's designed to be a generic universal query engine against many formats.Remarks
Even if a package is perfect, providing a fully flexible API that can reflect it's capability is perhaps over-optimisation, at least for now. In the future this may be beneficial if querying over cloud becomes common. This study though made me have a better understanding of Parquet format and existing data science packages. Since
polars
still struggles to stream,polars
reminds out of consideration. It took me a while to figure out how to deal withpyarrow
high memory usage. It has a default, aggressive caching policy which could be turned off by supplyingfragment_scan_options=ds.ParquetFragmentScanOptions(pre_buffer=False)
. Whileduckdb
is slower but their attention to details continuously impress me.pyarrow
is more like a low level Parquet file inspector then a query engine. It's API is harder to use but it is the most feature completed in terms of Parquet support.pyspark
often crashes with no debugging information available right away due to it's separated computing design. When the computing server gives up, the Python client got a connection closed exception rather than the cause which is unhelpful. It's high maintenance cost makes it remain out of consideration.pyarrow
andduckdb
right now are my top candidates.This issue serves as a subtask of new API design #36, will be closed once it is finished.
The text was updated successfully, but these errors were encountered: