-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to_timestamp with 2 arguments (timestamp format) #5398
Comments
hi @alamb do you think we should following postgresql formatting syntax or chorno's https://docs.rs/chrono/latest/chrono/format/strftime/index.html#specifiers in your example, the input contains timezone ❯ select to_timestamp('2014-05-11 01:07:34.779000+01:00', 'YYYY-MM-DD HH:MI:SS.NS');
Plan("Coercion from [Utf8, Utf8] to the signature Uniform(1, [Int64, Timestamp(Nanosecond, None), Timestamp(Microsecond, None), Timestamp(Millisecond, None), Timestamp(Second, None), Utf8]) failed.") postgrseql's |
🤔 that is an excellent question. I think we should follow postgresql ideally though I realize that may be infeasible given we use chrono underneath
That is also an excellent question @waitingkuo . I don't have an answer but there is more discussion here #686 |
We changed |
Having recently implemented this in our proprietary engine, these were my notes regarding postgres, compared to other pattern syntaxes:
We ended up implementing a pattern syntax closer to the one supported by Java SimpleDateFormat, supporting a subset of those pattern characters. Similar syntax is supported by Presto/TrinoDB/Athena and so should be familiar to our users. My personal opinion was also that the percent-based syntax from MySQL or chrono looked a bit ugly, and does not make the width or padding of fields obvious. Internally we however translated the user-facing syntax to chrono format items and used chrono::format::parse |
I would really like to avoid a massive amount of custom postgres compatible format string logic in DataFusion unless there is a super compelling usecase for it. I think it will likely be a large amount of code / tests (though @jhorstmann can correct me if I am wrong0 I would instead prefer to see using the chrono forma strings directly (via |
When to_date/to_timestamp/etc is used from spark the format is only passingly similar to Java's SimpleDateFormat with enough differences to be annoyingly incompatible except for simple formats ('EEE' is unsupported without additional work, w/W, Y, O, etc). I would not generally recommend the Java format. |
Question on implementation detail while implementing this: should this throw an Err if the timestamp cannot be parsed given either the default behaviour or with any provided formats? Current behaviour seems to be be to throw an Err which mirrors postgresql behaviour, whereas with spark for example the behaviour is to return null for any unparseable string. I'm asking because with date/timestamp string parsing you are often not dealing with machine generated data but rather human generated and there are only so many formats that can be tried before you must just set the data to null and move on. Not having that ability could be a hindrance for some use cases. A thought I did have would be to change the behaviour via flag that would be documented as part of the api and user guide. |
I think we should keep the existing behavior
I agree a flag (or maybe another function) could be appropriate This high level observation is that different SQL implementations have different semantics (Decimal handling is another thing that spark seems to do differently). Given DataFusion's built in functions have only one version, we can't mirror both systems. My long term hope / plan is to pull as many functions out of the core as possible (e.g #8045 ) so that people can more easily customize the behavior. For example, we could have a |
We support this in arrow-rs - https://docs.rs/arrow-cast/latest/arrow_cast/cast/struct.CastOptions.html#structfield.safe |
Thanks for your feedback @alamb! Next question: As best as I can tell it's not possible to overload a function to both accept a single string as well as accept a list of strings (no varargs). This means to me at least that the choice is either a break in the dataframe api (from |
I think yo ucould do it with |
* Support to_timestamp with chrono formatting #5398 * Updated user guide's to_timestamp to include chrono formatting information #5398 * Minor comment update. * Small documentation updates for to_timestamp functions. * Cargo fmt and clippy improvements. * Switched to assert and unwrap_err based on feedback * Fixed assert, code compiles and runs as expected now. * Fix fmt (again). * Add additional to_timestamp tests covering usage with tables with and without valid formats. * to_timestamp documentation fixes. * - Changed internal_err! -> exec_err! for unsupported data type errors. - Extracted out to_timestamp_impl method to reduce code duplication as per PR feedback. - Extracted out validate_to_timestamp_data_types to reduce code duplication as per PR feedback. - Added additional tests for argument validation and invalid arguments. - Removed unnecessary shim function 'string_to_timestamp_nanos_with_format_shim' * Resolved merge conflict, updated toStringXXX methods to reflect upstream change * prettier * Fix clippy --------- Co-authored-by: Andrew Lamb <[email protected]>
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion strives to be postgres compatible, so when there are differences between postgres and datafusion it is confusing
Today, one argument form of to_timestamp works in datafusion:
But the 2 argument form does not:
Describe the solution you'd like
Implement the two argument form of
to_timestamp
as described inhttps://www.postgresql.org/docs/current/functions-formatting.html
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Thank you to @cannonpalms and @idclark
The text was updated successfully, but these errors were encountered: