Reading geoparquet files from geopandas #20978

LucasJavaudin · 2025-01-29T14:01:18Z

Description

Hello.

(I don't really whether this should be a bug report or feature request.)

Description

Geoparquet files store metadata information (such as the projection system) either at the file level or at the field level of the geometry column. The geopandas library is using that later option when using their to_parquet method on GeoDataFrame.

When I use the pl.read_parquet function to read a geoparquet file created by geopandas, I get a PanicException with reason Arrow datatype Extension(ExtensionType { name: "geoarrow.wkb", inner: BinaryView, metadata: [...]}) not supported by polars. This is not really surprising so far and the error message is clear (although the column name could be specified).

My issue arises when running commands pl.read_parquet(filename, columns=["id"]) or pl.scan_parquet(filename).select("id").collect(). These two commands lead to the same error, while I would except them to work with no issue because the geometry column can be ignored.

Suggestion

After some investigation, I identified that the error occurs because polars tries to read the full schema of the parquet file (here) and tries to convert it from arrow dtypes to "polars dtypes", which it cannot do for the Extension dtype (here).

I thought of 2 ways to solve the issue:

Read only the schema for the columns which are "requested".
Read the Extension dtypes as their inner type.

Solution 2. would be the best for my use case because I would then even be able to read the geometry column as a Binary column. However the metadata of the field is then lost and this might have some unexpected consequences in other situations?

I think I can easily do a PR for solution 2. but I might need some help for solution 1. because it involves changing the (obscure to me) polars-plan crate.

Should we implement one of those solutions? Which one?

Workaround

Note that what I want to do can be already achieved with pl.scan_parquet(filename, schema={"id": pl.Int64}).select("id").collect() because polars does not read the full schema when it is provided as an argument to pl.scan_parquet or pl.read_parquet. This is not convenient though.

It also works when using pyarrow to read the parquet file but this is not available for pl.scan_parquet.

The text was updated successfully, but these errors were encountered:

LucasJavaudin added the enhancement New feature or an improvement of an existing feature label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading geoparquet files from geopandas #20978

Reading geoparquet files from geopandas #20978

LucasJavaudin commented Jan 29, 2025

Reading geoparquet files from geopandas #20978

Reading geoparquet files from geopandas #20978

Comments

LucasJavaudin commented Jan 29, 2025

Description

Description

Suggestion

Workaround