Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: improve support for datetime columns #486

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
aaf8818
ENH: deal properly with naive datetimes with arrow
theroggy Oct 17, 2024
3e463a1
Add more testcases, also for tz datetimes
theroggy Oct 18, 2024
afdd0c1
Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…
theroggy Jan 16, 2025
c18ab22
Use datetime_as_string for reading with arrow
theroggy Jan 17, 2025
597855f
Update _io.pyx
theroggy Jan 17, 2025
fa4b86e
Skip tests where appropriate
theroggy Jan 17, 2025
0e41ae4
Improve support for mixed and naive datetimes
theroggy Jan 17, 2025
1378ace
Skip use_arrow tests with old gdal versions
theroggy Jan 17, 2025
0f1ab27
Take in account pandas version
theroggy Jan 17, 2025
6f78c68
Update test_geopandas_io.py
theroggy Jan 17, 2025
336d0d8
Also support columns with datetime objects
theroggy Jan 18, 2025
3035a11
Rename some test functions for consistency
theroggy Jan 18, 2025
9efdc09
Avoid warning in test
theroggy Jan 18, 2025
eb80e08
Improve inline comment
theroggy Jan 18, 2025
d50b2d0
Update CHANGES.md
theroggy Jan 18, 2025
47aa298
Merge remote-tracking branch 'upstream/main' into ENH-deal-properly-w…
theroggy Jan 19, 2025
1efa5bf
Symplify code
theroggy Jan 20, 2025
0032839
Don't cast UTC data to string when writing
theroggy Jan 20, 2025
9d2bfce
Various improvements to tests
theroggy Jan 20, 2025
ca9a8ae
Smal fixes to tests
theroggy Jan 20, 2025
deb862c
Xfail some tests where needed
theroggy Jan 20, 2025
e35c356
Make UTC assert more specific
theroggy Jan 22, 2025
593b282
Revert "Make UTC assert more specific"
theroggy Jan 22, 2025
35d8d87
Update test_geopandas_io.py
theroggy Jan 22, 2025
41c9da6
Use astype("string") instead of apply
theroggy Jan 23, 2025
f53af87
Improve tests
theroggy Jan 23, 2025
a8c85b7
Fix tests for older versions
theroggy Jan 23, 2025
40ca1a5
Update test_geopandas_io.py
theroggy Jan 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
### Improvements

- Capture all errors logged by gdal when opening a file fails (#495).
- Improve support for datetime columns with mixed or naive times (#486).

### Bug fixes

Expand Down
8 changes: 8 additions & 0 deletions pyogrio/_io.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1430,6 +1430,7 @@ def ogr_open_arrow(
int return_fids=False,
int batch_size=0,
use_pyarrow=False,
datetime_as_string=False,
):

cdef int err = 0
Expand Down Expand Up @@ -1624,6 +1625,12 @@ def ogr_open_arrow(
"GEOARROW".encode('UTF-8')
)

# Read DateTime fields as strings, as the Arrow DateTime column type is
# quite limited regarding support for mixed timezones,...
IF CTE_GDAL_VERSION >= (3, 11, 0):
if datetime_as_string:
options = CSLSetNameValue(options, "DATETIME_AS_STRING", "YES")

# make sure layer is read from beginning
OGR_L_ResetReading(ogr_layer)

Expand All @@ -1649,6 +1656,7 @@ def ogr_open_arrow(
'crs': crs,
'encoding': encoding,
'fields': fields[:,2], # return only names
"dtypes": fields[:,3],
'geometry_type': geometry_type,
'geometry_name': geometry_name,
'fid_column': fid_column,
Expand Down
120 changes: 108 additions & 12 deletions pyogrio/geopandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import os
import warnings
from datetime import datetime

import numpy as np

Expand Down Expand Up @@ -39,6 +40,7 @@ def _try_parse_datetime(ser):
datetime_kwargs = {"format": "ISO8601", "errors": "ignore"}
else:
datetime_kwargs = {"yearfirst": True}

with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
Expand All @@ -51,12 +53,6 @@ def _try_parse_datetime(ser):
res = pd.to_datetime(ser, **datetime_kwargs)
except Exception:
res = ser
# if object dtype, try parse as utc instead
if res.dtype == "object":
try:
res = pd.to_datetime(ser, utc=True, **datetime_kwargs)
except Exception:
pass
Comment on lines -54 to -59
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects)
(but the tests are also passing, so maybe I am missing something)

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

Copy link
Member Author

@theroggy theroggy Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From your top post explanation:

Add support to read mixed timezone datetimes. These are returned in an object column with Timestamps.

First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing utc=True, and that you have to use apply and datetime.datetime.strptime instead to get mixed offset objects) (but the tests are also passing, so maybe I am missing something)

Yes, I saw. Do you know what the rationale is that in pandas 3 people are being forced to use a more inefficient way (apply) to get to your data?

Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.

For starters, to be clear, this is only relevant for mixed timezone data. Data saved in naive or UTC timestamps should just stay "regular" pandas datetime columns.

For the case of mixed timezone data, it depends on what you want to do with the datetime data. If it is just to look at it/show/keep it as it is part of the table data, the Timestamps look just fine to me. If you really want to do "fancy stuff" with the datetimes it will in pandas indeed be more convenient for some things to transform them into e.g. UTC datetimes to get a datetime column instead of an object column.

Regarding default behaviour, it feels quite odd to me to transform data by default to a form where information (the original time zone) is lost. Also because when you save the data again, it will then be saved as UTC as well, so also: the timezone information will be lost.

To me, the other way around is more logical: by default you don't loose data. If you want to do "fancy stuff" with a datetime column that contains mixed timezone data, you convert it to e.g. UTC, typically in an extra column, because most likely you will want to keep the timezone information again when saving.


if res.dtype != "object":
# GDAL only supports ms precision, convert outputs to match.
Expand All @@ -66,6 +62,7 @@ def _try_parse_datetime(ser):
res = res.dt.as_unit("ms")
else:
res = res.dt.round(freq="ms")

return res


Expand Down Expand Up @@ -257,11 +254,10 @@ def read_dataframe(

read_func = read_arrow if use_arrow else read
gdal_force_2d = False if use_arrow else force_2d
if not use_arrow:
# For arrow, datetimes are read as is.
# For numpy IO, datetimes are read as string values to preserve timezone info
# as numpy does not directly support timezones.
kwargs["datetime_as_string"] = True

# Always read datetimes as string values to preserve (mixed) timezone info
# as numpy does not directly support timezones and arrow datetime columns
# don't support mixed timezones.
result = read_func(
path_or_buffer,
layer=layer,
Expand All @@ -278,6 +274,7 @@ def read_dataframe(
sql=sql,
sql_dialect=sql_dialect,
return_fids=fid_as_index,
datetime_as_string=True,
**kwargs,
)

Expand All @@ -292,6 +289,11 @@ def read_dataframe(
df = table.to_pandas(**kwargs)
del table

# convert datetime columns that were read as string to datetime
for dtype, column in zip(meta["dtypes"], meta["fields"]):
if dtype is not None and dtype.startswith("datetime"):
df[column] = _try_parse_datetime(df[column])

if fid_as_index:
df = df.set_index(meta["fid_column"])
df.index.names = ["fid"]
Expand Down Expand Up @@ -482,6 +484,8 @@ def write_dataframe(
gdal_tz_offsets = {}
for name in fields:
col = df[name]
values = None

if isinstance(col.dtype, pd.DatetimeTZDtype):
# Deal with datetimes with timezones by passing down timezone separately
# pass down naive datetime
Expand All @@ -496,8 +500,28 @@ def write_dataframe(
# Convert each row offset to a signed multiple of 15m and add to GMT value
gdal_offset_representation = tz_offset // pd.Timedelta("15m") + 100
gdal_tz_offsets[name] = gdal_offset_representation.values
else:

elif col.dtype == "object":
# Column of Timestamp objects, also split in naive datetime and tz offset
col_na = df[col.notna()][name]
if len(col_na) and all(isinstance(x, pd.Timestamp) for x in col_na):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit hesitant to add custom support for this, exactly given that it is not really supported by pandas itself, do we then need to add special support for it?

Right now, if you have an object dtype column with timestamp columns, they already get written as strings, which in the end preserves the offset information (in the string representation).
It might read back as strings (depending on the file format), but at that point the user can handle this column as they see fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are written as strings to the output files without the proper metadata, it depends on format to format if they will be recognized as datetimes when read. For text files they will typically be recognized as datetime as the data types are "guessed" when the file is read (e.g. geojson), for files like .fgb and .gpkg they won't be recognized as the file metadata will be wrong.

That's not very clean, and as it is very easy to solve I don't quite see the point of not supporting it properly?

tz_offset = col.apply(lambda x: None if pd.isna(x) else x.utcoffset())
gdal_offset_repr = tz_offset // pd.Timedelta("15m") + 100
gdal_tz_offsets[name] = gdal_offset_repr.values
naive = col.apply(lambda x: None if pd.isna(x) else x.tz_localize(None))
values = naive.values
elif len(col_na) and all(isinstance(x, datetime) for x in col_na):
tz_offset = col.apply(lambda x: None if pd.isna(x) else x.utcoffset())
gdal_offset_repr = tz_offset // pd.Timedelta("15m") + 100
gdal_tz_offsets[name] = gdal_offset_repr.values
naive = col.apply(
lambda x: None if pd.isna(x) else x.replace(tzinfo=None)
)
values = naive.values

if values is None:
values = col.values

if isinstance(values, pd.api.extensions.ExtensionArray):
from pandas.arrays import BooleanArray, FloatingArray, IntegerArray

Expand Down Expand Up @@ -620,8 +644,35 @@ def write_dataframe(
df = pd.DataFrame(df, copy=False)
df[geometry_column] = geometry

# Convert all datetime columns to isoformat strings, to avoid mixed timezone
# information getting lost.
datetime_cols = []
for name, dtype in df.dtypes.items():
col = df[name]
if dtype == "object":
# When all non-NA values are Timestamps, treat as datetime column
col_na = df[col.notna()][name]
if len(col_na) and all(
isinstance(x, (pd.Timestamp, datetime)) for x in col_na
):
df[name] = col.apply(
lambda x: None if pd.isna(x) else x.isoformat()
)
datetime_cols.append(name)
elif isinstance(dtype, pd.DatetimeTZDtype):
# Also for regular datetime columns with timezone mixed timezones are
# possible when thera is a difference between summer and winter time.
df[name] = col.apply(lambda x: None if pd.isna(x) else x.isoformat())
datetime_cols.append(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for properly typed datetime64 columns?
What does GDAL do with those values? Write as the UTC value? And with this change it will write it, still as a datetime (because of adding the metadata?), but with offset?

FWIW, if we need to do this, you can do df[name].astype(str) to avoid the apply

Copy link
Member Author

@theroggy theroggy Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed for properly typed datetime64 columns? What does GDAL do with those values? Write as the UTC value?

It is actually a bit weird. Based on what I tested, when the column is in UTC time zone, the data is written correctly to the file. If the column has another time zone it is simply dropped and the naive times are written. This is the case for both GPKG and e.g. .geojson.

Hence, naive and UTC times can be written via a native arrow datetime column, but in the other cases it needs to be written via a sidestep to a string column.
There is a TIMEZONE option that can be specified in the GDAL arrow code path... but it is only on layer level, so not per column so thats not super useful either. Based on a quick test it also didn't seem to work for timezones like "CET".

And with this change it will write it, still as a datetime (because of adding the metadata?), but with offset?

With this change offsets will be correctly written as timestamp, indeed because of the custom arrow metadata being added and interpreted by GDAL from GDAL 3.11 (OSGeo/gdal#11213)

FWIW, if we need to do this, you can do df[name].astype(str) to avoid the apply

Interesting! Note however that df[name].astype(str) outputs "... ..." instead of "...T...", so no strictly valid ISO... strings. But, astype gives signifficantly better performance (2 sec instead of 12 sec for nz buildings)... and probably not that important... so I changed it. If we rather want to have "...T...", we can add .str.replace(" ", "T") for GDAL < 3.11, that only adds 0.5 sec.
I used df[name].astype("string"), otherwise None/NAT is also cast to a string.


table = pa.Table.from_pandas(df, preserve_index=False)

# Add metadata to datetime columns so GDAL knows they are datetimes.
for datetime_col in datetime_cols:
table = _add_column_metadata(
table, column_metadata={datetime_col: {"GDAL:OGR:type": "DateTime"}}
)

if geometry_column is not None:
# ensure that the geometry column is binary (for all-null geometries,
# this could be a wrong type)
Expand Down Expand Up @@ -681,3 +732,48 @@ def write_dataframe(
gdal_tz_offsets=gdal_tz_offsets,
**kwargs,
)


def _add_column_metadata(table, column_metadata: dict = {}):
"""Add or update column-level metadata to an arrow table.

Parameters
----------
table : pyarrow.Table
The table to add the column metadata to.
column_metadata : dict
A dictionary with column metadata in the form
{
"column_1": {"some": "data"},
"column_2": {"more": "stuff"},
}

Returns
-------
pyarrow.Table: table with the updated column metadata.
"""
import pyarrow as pa

if not column_metadata:
return table

# Create updated column fields with new metadata
fields = []
for col in table.schema.names:
if col in column_metadata:
# Add/update column metadata
metadata = table.field(col).metadata or {}
for key, value in column_metadata[col].items():
metadata[key] = value
# Update field with updated metadata
fields.append(table.field(col).with_metadata(metadata))
else:
fields.append(table.field(col))

# Create new schema with the updated field metadata
schema = pa.schema(fields, metadata=table.schema.metadata)

# Build new table with updated schema (shouldn't copy data)
table = table.cast(schema)

return table
8 changes: 8 additions & 0 deletions pyogrio/raw.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,7 @@ def read_arrow(
sql=None,
sql_dialect=None,
return_fids=False,
datetime_as_string=False,
**kwargs,
):
"""Read OGR data source into a pyarrow Table.
Expand Down Expand Up @@ -303,6 +304,7 @@ def read_arrow(
skip_features=gdal_skip_features,
batch_size=batch_size,
use_pyarrow=True,
datetime_as_string=datetime_as_string,
**kwargs,
) as source:
meta, reader = source
Expand Down Expand Up @@ -358,6 +360,7 @@ def open_arrow(
return_fids=False,
batch_size=65_536,
use_pyarrow=False,
datetime_as_string=False,
**kwargs,
):
"""Open OGR data source as a stream of Arrow record batches.
Expand Down Expand Up @@ -386,6 +389,9 @@ def open_arrow(
ArrowStream object. In the default case, this stream object needs
to be passed to another library supporting the Arrow PyCapsule
Protocol to consume the stream of data.
datetime_as_string : bool, optional (default: False)
If True, will return datetime dtypes as detected by GDAL as strings,
as arrow doesn't support e.g. mixed timezones.

Examples
--------
Expand Down Expand Up @@ -423,6 +429,7 @@ def open_arrow(
Meta is: {
"crs": "<crs>",
"fields": <ndarray of field names>,
"dtypes": <ndarray of numpy dtypes corresponding to fields>
"encoding": "<encoding>",
"geometry_type": "<geometry_type>",
"geometry_name": "<name of geometry column in arrow table>",
Expand Down Expand Up @@ -453,6 +460,7 @@ def open_arrow(
dataset_kwargs=dataset_kwargs,
batch_size=batch_size,
use_pyarrow=use_pyarrow,
datetime_as_string=datetime_as_string,
)


Expand Down
Loading
Loading