Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading geoparquet files from geopandas #20978

Open
LucasJavaudin opened this issue Jan 29, 2025 · 0 comments
Open

Reading geoparquet files from geopandas #20978

LucasJavaudin opened this issue Jan 29, 2025 · 0 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@LucasJavaudin
Copy link

Description

Hello.

(I don't really whether this should be a bug report or feature request.)

Description

Geoparquet files store metadata information (such as the projection system) either at the file level or at the field level of the geometry column. The geopandas library is using that later option when using their to_parquet method on GeoDataFrame.

When I use the pl.read_parquet function to read a geoparquet file created by geopandas, I get a PanicException with reason Arrow datatype Extension(ExtensionType { name: "geoarrow.wkb", inner: BinaryView, metadata: [...]}) not supported by polars. This is not really surprising so far and the error message is clear (although the column name could be specified).

My issue arises when running commands pl.read_parquet(filename, columns=["id"]) or pl.scan_parquet(filename).select("id").collect(). These two commands lead to the same error, while I would except them to work with no issue because the geometry column can be ignored.

Suggestion

After some investigation, I identified that the error occurs because polars tries to read the full schema of the parquet file (here) and tries to convert it from arrow dtypes to "polars dtypes", which it cannot do for the Extension dtype (here).

I thought of 2 ways to solve the issue:

  1. Read only the schema for the columns which are "requested".
  2. Read the Extension dtypes as their inner type.

Solution 2. would be the best for my use case because I would then even be able to read the geometry column as a Binary column. However the metadata of the field is then lost and this might have some unexpected consequences in other situations?

I think I can easily do a PR for solution 2. but I might need some help for solution 1. because it involves changing the (obscure to me) polars-plan crate.

Should we implement one of those solutions? Which one?

Workaround

Note that what I want to do can be already achieved with pl.scan_parquet(filename, schema={"id": pl.Int64}).select("id").collect() because polars does not read the full schema when it is provided as an argument to pl.scan_parquet or pl.read_parquet. This is not convenient though.

It also works when using pyarrow to read the parquet file but this is not available for pl.scan_parquet.

@LucasJavaudin LucasJavaudin added the enhancement New feature or an improvement of an existing feature label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant