You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(I don't really whether this should be a bug report or feature request.)
Description
Geoparquet files store metadata information (such as the projection system) either at the file level or at the field level of the geometry column. The geopandas library is using that later option when using their to_parquet method on GeoDataFrame.
When I use the pl.read_parquet function to read a geoparquet file created by geopandas, I get a PanicException with reason Arrow datatype Extension(ExtensionType { name: "geoarrow.wkb", inner: BinaryView, metadata: [...]}) not supported by polars. This is not really surprising so far and the error message is clear (although the column name could be specified).
My issue arises when running commands pl.read_parquet(filename, columns=["id"]) or pl.scan_parquet(filename).select("id").collect(). These two commands lead to the same error, while I would except them to work with no issue because the geometry column can be ignored.
Suggestion
After some investigation, I identified that the error occurs because polars tries to read the full schema of the parquet file (here) and tries to convert it from arrow dtypes to "polars dtypes", which it cannot do for the Extension dtype (here).
I thought of 2 ways to solve the issue:
Read only the schema for the columns which are "requested".
Read the Extension dtypes as their inner type.
Solution 2. would be the best for my use case because I would then even be able to read the geometry column as a Binary column. However the metadata of the field is then lost and this might have some unexpected consequences in other situations?
I think I can easily do a PR for solution 2. but I might need some help for solution 1. because it involves changing the (obscure to me) polars-plan crate.
Should we implement one of those solutions? Which one?
Workaround
Note that what I want to do can be already achieved with pl.scan_parquet(filename, schema={"id": pl.Int64}).select("id").collect() because polars does not read the full schema when it is provided as an argument to pl.scan_parquet or pl.read_parquet. This is not convenient though.
It also works when using pyarrow to read the parquet file but this is not available for pl.scan_parquet.
The text was updated successfully, but these errors were encountered:
Description
Hello.
(I don't really whether this should be a bug report or feature request.)
Description
Geoparquet files store metadata information (such as the projection system) either at the file level or at the field level of the
geometry
column. The geopandas library is using that later option when using theirto_parquet
method onGeoDataFrame
.When I use the
pl.read_parquet
function to read a geoparquet file created by geopandas, I get aPanicException
with reasonArrow datatype Extension(ExtensionType { name: "geoarrow.wkb", inner: BinaryView, metadata: [...]}) not supported by polars
. This is not really surprising so far and the error message is clear (although the column name could be specified).My issue arises when running commands
pl.read_parquet(filename, columns=["id"])
orpl.scan_parquet(filename).select("id").collect()
. These two commands lead to the same error, while I would except them to work with no issue because thegeometry
column can be ignored.Suggestion
After some investigation, I identified that the error occurs because polars tries to read the full schema of the parquet file (here) and tries to convert it from arrow dtypes to "polars dtypes", which it cannot do for the Extension dtype (here).
I thought of 2 ways to solve the issue:
Solution 2. would be the best for my use case because I would then even be able to read the
geometry
column as aBinary
column. However the metadata of the field is then lost and this might have some unexpected consequences in other situations?I think I can easily do a PR for solution 2. but I might need some help for solution 1. because it involves changing the (obscure to me)
polars-plan
crate.Should we implement one of those solutions? Which one?
Workaround
Note that what I want to do can be already achieved with
pl.scan_parquet(filename, schema={"id": pl.Int64}).select("id").collect()
because polars does not read the full schema when it is provided as an argument topl.scan_parquet
orpl.read_parquet
. This is not convenient though.It also works when using
pyarrow
to read the parquet file but this is not available forpl.scan_parquet
.The text was updated successfully, but these errors were encountered: