-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: improve support for datetime columns #486
base: main
Are you sure you want to change the base?
Changes from 15 commits
aaf8818
3e463a1
afdd0c1
c18ab22
597855f
fa4b86e
0e41ae4
1378ace
0f1ab27
6f78c68
336d0d8
3035a11
9efdc09
eb80e08
d50b2d0
47aa298
1efa5bf
0032839
9d2bfce
ca9a8ae
deb862c
e35c356
593b282
35d8d87
41c9da6
f53af87
a8c85b7
40ca1a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
|
||
import os | ||
import warnings | ||
from datetime import datetime | ||
|
||
import numpy as np | ||
|
||
|
@@ -39,6 +40,7 @@ def _try_parse_datetime(ser): | |
datetime_kwargs = {"format": "ISO8601", "errors": "ignore"} | ||
else: | ||
datetime_kwargs = {"yearfirst": True} | ||
|
||
with warnings.catch_warnings(): | ||
warnings.filterwarnings( | ||
"ignore", | ||
|
@@ -51,12 +53,6 @@ def _try_parse_datetime(ser): | |
res = pd.to_datetime(ser, **datetime_kwargs) | ||
except Exception: | ||
res = ser | ||
# if object dtype, try parse as utc instead | ||
if res.dtype == "object": | ||
try: | ||
res = pd.to_datetime(ser, utc=True, **datetime_kwargs) | ||
except Exception: | ||
pass | ||
|
||
if res.dtype != "object": | ||
# GDAL only supports ms precision, convert outputs to match. | ||
|
@@ -66,6 +62,7 @@ def _try_parse_datetime(ser): | |
res = res.dt.as_unit("ms") | ||
else: | ||
res = res.dt.round(freq="ms") | ||
|
||
return res | ||
|
||
|
||
|
@@ -257,11 +254,10 @@ def read_dataframe( | |
|
||
read_func = read_arrow if use_arrow else read | ||
gdal_force_2d = False if use_arrow else force_2d | ||
if not use_arrow: | ||
# For arrow, datetimes are read as is. | ||
# For numpy IO, datetimes are read as string values to preserve timezone info | ||
# as numpy does not directly support timezones. | ||
kwargs["datetime_as_string"] = True | ||
|
||
# Always read datetimes as string values to preserve (mixed) timezone info | ||
# as numpy does not directly support timezones and arrow datetime columns | ||
# don't support mixed timezones. | ||
result = read_func( | ||
path_or_buffer, | ||
layer=layer, | ||
|
@@ -278,6 +274,7 @@ def read_dataframe( | |
sql=sql, | ||
sql_dialect=sql_dialect, | ||
return_fids=fid_as_index, | ||
datetime_as_string=True, | ||
**kwargs, | ||
) | ||
|
||
|
@@ -292,6 +289,11 @@ def read_dataframe( | |
df = table.to_pandas(**kwargs) | ||
del table | ||
|
||
# convert datetime columns that were read as string to datetime | ||
for dtype, column in zip(meta["dtypes"], meta["fields"]): | ||
if dtype is not None and dtype.startswith("datetime"): | ||
df[column] = _try_parse_datetime(df[column]) | ||
|
||
if fid_as_index: | ||
df = df.set_index(meta["fid_column"]) | ||
df.index.names = ["fid"] | ||
|
@@ -482,6 +484,8 @@ def write_dataframe( | |
gdal_tz_offsets = {} | ||
for name in fields: | ||
col = df[name] | ||
values = None | ||
|
||
if isinstance(col.dtype, pd.DatetimeTZDtype): | ||
# Deal with datetimes with timezones by passing down timezone separately | ||
# pass down naive datetime | ||
|
@@ -496,8 +500,28 @@ def write_dataframe( | |
# Convert each row offset to a signed multiple of 15m and add to GMT value | ||
gdal_offset_representation = tz_offset // pd.Timedelta("15m") + 100 | ||
gdal_tz_offsets[name] = gdal_offset_representation.values | ||
else: | ||
|
||
elif col.dtype == "object": | ||
# Column of Timestamp objects, also split in naive datetime and tz offset | ||
col_na = df[col.notna()][name] | ||
if len(col_na) and all(isinstance(x, pd.Timestamp) for x in col_na): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a bit hesitant to add custom support for this, exactly given that it is not really supported by pandas itself, do we then need to add special support for it? Right now, if you have an object dtype column with timestamp columns, they already get written as strings, which in the end preserves the offset information (in the string representation). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If they are written as strings to the output files without the proper metadata, it depends on format to format if they will be recognized as datetimes when read. For text files they will typically be recognized as datetime as the data types are "guessed" when the file is read (e.g. geojson), for files like .fgb and .gpkg they won't be recognized as the file metadata will be wrong. That's not very clean, and as it is very easy to solve I don't quite see the point of not supporting it properly? |
||
tz_offset = col.apply(lambda x: None if pd.isna(x) else x.utcoffset()) | ||
gdal_offset_repr = tz_offset // pd.Timedelta("15m") + 100 | ||
gdal_tz_offsets[name] = gdal_offset_repr.values | ||
naive = col.apply(lambda x: None if pd.isna(x) else x.tz_localize(None)) | ||
values = naive.values | ||
elif len(col_na) and all(isinstance(x, datetime) for x in col_na): | ||
tz_offset = col.apply(lambda x: None if pd.isna(x) else x.utcoffset()) | ||
gdal_offset_repr = tz_offset // pd.Timedelta("15m") + 100 | ||
gdal_tz_offsets[name] = gdal_offset_repr.values | ||
naive = col.apply( | ||
lambda x: None if pd.isna(x) else x.replace(tzinfo=None) | ||
) | ||
values = naive.values | ||
|
||
if values is None: | ||
values = col.values | ||
|
||
if isinstance(values, pd.api.extensions.ExtensionArray): | ||
from pandas.arrays import BooleanArray, FloatingArray, IntegerArray | ||
|
||
|
@@ -620,8 +644,35 @@ def write_dataframe( | |
df = pd.DataFrame(df, copy=False) | ||
df[geometry_column] = geometry | ||
|
||
# Convert all datetime columns to isoformat strings, to avoid mixed timezone | ||
# information getting lost. | ||
datetime_cols = [] | ||
for name, dtype in df.dtypes.items(): | ||
col = df[name] | ||
if dtype == "object": | ||
# When all non-NA values are Timestamps, treat as datetime column | ||
col_na = df[col.notna()][name] | ||
if len(col_na) and all( | ||
isinstance(x, (pd.Timestamp, datetime)) for x in col_na | ||
): | ||
df[name] = col.apply( | ||
lambda x: None if pd.isna(x) else x.isoformat() | ||
) | ||
datetime_cols.append(name) | ||
elif isinstance(dtype, pd.DatetimeTZDtype): | ||
# Also for regular datetime columns with timezone mixed timezones are | ||
# possible when thera is a difference between summer and winter time. | ||
df[name] = col.apply(lambda x: None if pd.isna(x) else x.isoformat()) | ||
datetime_cols.append(name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this needed for properly typed datetime64 columns? FWIW, if we need to do this, you can do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is actually a bit weird. Based on what I tested, when the column is in UTC time zone, the data is written correctly to the file. If the column has another time zone it is simply dropped and the naive times are written. This is the case for both GPKG and e.g. .geojson. Hence, naive and UTC times can be written via a native arrow datetime column, but in the other cases it needs to be written via a sidestep to a string column.
With this change offsets will be correctly written as timestamp, indeed because of the custom arrow metadata being added and interpreted by GDAL from GDAL 3.11 (OSGeo/gdal#11213)
Interesting! Note however that |
||
|
||
table = pa.Table.from_pandas(df, preserve_index=False) | ||
|
||
# Add metadata to datetime columns so GDAL knows they are datetimes. | ||
for datetime_col in datetime_cols: | ||
table = _add_column_metadata( | ||
table, column_metadata={datetime_col: {"GDAL:OGR:type": "DateTime"}} | ||
) | ||
|
||
if geometry_column is not None: | ||
# ensure that the geometry column is binary (for all-null geometries, | ||
# this could be a wrong type) | ||
|
@@ -681,3 +732,48 @@ def write_dataframe( | |
gdal_tz_offsets=gdal_tz_offsets, | ||
**kwargs, | ||
) | ||
|
||
|
||
def _add_column_metadata(table, column_metadata: dict = {}): | ||
"""Add or update column-level metadata to an arrow table. | ||
|
||
Parameters | ||
---------- | ||
table : pyarrow.Table | ||
The table to add the column metadata to. | ||
column_metadata : dict | ||
A dictionary with column metadata in the form | ||
{ | ||
"column_1": {"some": "data"}, | ||
"column_2": {"more": "stuff"}, | ||
} | ||
|
||
Returns | ||
------- | ||
pyarrow.Table: table with the updated column metadata. | ||
""" | ||
import pyarrow as pa | ||
|
||
if not column_metadata: | ||
return table | ||
|
||
# Create updated column fields with new metadata | ||
fields = [] | ||
for col in table.schema.names: | ||
if col in column_metadata: | ||
# Add/update column metadata | ||
metadata = table.field(col).metadata or {} | ||
for key, value in column_metadata[col].items(): | ||
metadata[key] = value | ||
# Update field with updated metadata | ||
fields.append(table.field(col).with_metadata(metadata)) | ||
else: | ||
fields.append(table.field(col)) | ||
|
||
# Create new schema with the updated field metadata | ||
schema = pa.schema(fields, metadata=table.schema.metadata) | ||
|
||
# Build new table with updated schema (shouldn't copy data) | ||
table = table.cast(schema) | ||
|
||
return table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From your top post explanation:
First, I don't think this will work with upcoming pandas 3.x (we are suppressing the warning above about mixed timezones going to raise unless passing
utc=True
, and that you have to useapply
anddatetime.datetime.strptime
instead to get mixed offset objects)(but the tests are also passing, so maybe I am missing something)
Second, a column of mixed offset objects is in general not that particularly useful .. So changing this behaviour feels like a regression to me. I understand that we might want to provide the user the option to get this, but by default, I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I saw. Do you know what the rationale is that in pandas 3 people are being forced to use a more inefficient way (apply) to get to your data?
For starters, to be clear, this is only relevant for mixed timezone data. Data saved in naive or UTC timestamps should just stay "regular" pandas datetime columns.
For the case of mixed timezone data, it depends on what you want to do with the datetime data. If it is just to look at it/show/keep it as it is part of the table data, the Timestamps look just fine to me. If you really want to do "fancy stuff" with the datetimes it will in pandas indeed be more convenient for some things to transform them into e.g. UTC datetimes to get a datetime column instead of an object column.
Regarding default behaviour, it feels quite odd to me to transform data by default to a form where information (the original time zone) is lost. Also because when you save the data again, it will then be saved as UTC as well, so also: the timezone information will be lost.
To me, the other way around is more logical: by default you don't loose data. If you want to do "fancy stuff" with a datetime column that contains mixed timezone data, you convert it to e.g. UTC, typically in an extra column, because most likely you will want to keep the timezone information again when saving.