Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension Types / User Defined Types #12644

Open
findepi opened this issue Sep 27, 2024 · 13 comments
Open

Extension Types / User Defined Types #12644

findepi opened this issue Sep 27, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@findepi
Copy link
Member

findepi commented Sep 27, 2024

Is your feature request related to a problem or challenge?

Currently DataFusion provides a lot of built-in types which are useful when building applications / query engines on top of DataFusion. However, even plethora of types is not enough. DataFusion doesn't have types existing in other systems, limiting DataFusion applicability as "LLVM for query engines"

For example, these types commonly found in other systems do not exist today

  • char(n)
  • varchar(n)
  • timestamp with time zone (a pair of "point in time" + "time zone" information; found in Oracle, Trino, Snowflake, etc.)
    • DataFusion currently uses Arrow DataType and the closest Arrow has is "timestamp(zone)" where each value is in same zone
  • timestamp with local time zone (point in time without zone information; found in Spark, Hive, PostgreSQL)
    • DataFusion currently uses Arrow DataType and the closest Arrow has is "timestamp(zone)" with eg UTC zone. however cast to varchar for "timestamp(UTC)" and for "timestamp with local time zone" should behave differently
  • time with time zone
  • JSON
    • DataFusion currently uses Arrow DataType and the closest Arrow has Utf8 potentially with some metadata information. Utf8 might be a perfect carrier type for JSON data, but "cast(json AS T)" and "cast(utf8 AS T)" are usually pretty different operations
  • VARIANT (Open Variant Type for semi-structured data #10987)
  • geospatial Geometry types (Spatial data support #7859)
  • HLL (hyperloglog), digests (t-digest, q-digest, other statistical digests)
  • extensions for applications building on top of DF; including user defined types (UDT) ([Proposal] Support User-Defined Types (UDT) #7923)
    • ability to provide user-defined types is even broader than ability to provide extension types ("rust-defined types")

Describe the solution you'd like

  1. Introduction of DataFusion own type system
  2. Introduction of extensions in DataFusion type system allowing applications building on DataFusion to provide more types
    • the extension types -- not unlike DataFusion built-in types -- need to use Arrow types as "carrier type" for transporting
    • the Arrow type metadata weaved into schema fields can be used to indicate use of extension types to the client, when data is returned to the user in Arrow form
    • for example, a "timestamp with time zone" type could be represented as Struct with two fields: point_in_time, time_zone
  3. Ability to dynamically find operations on types during function resolution or runtime
    • for example a CAST(array<T> AS varchar) needs to know how to do cast(T AS varchar). It cannot delegate this logic fully to Arrow, because Arrow won't have a notion of extension types.
      • eg if "timestamp with time zone" uses a Struct as a carrier type, it still needs to define its own cast(... AS varchar). It cannot use the default cast(struct AS varchar).

Describe alternatives you've considered

Everything is built-in

DataFusion could provide all types needed by applications building on top of DataFusion as built-in DataFusion types.
This would be easiest to implement, but could lead to scope-creep for the project. This could also lead to conflicts where types look the same but the desired behavior differs between applications building on top of DataFusion. For example Oracle's and Trino's "timestamp with time zone" can represent political zones while Snowflake's allows only fixed offsets.

No-op

Not providing extension types. This would limit DataFusion applicability.
DataFusion cannot be considered "LLVM for query engines" if it cannot serve as an engine, or potential engine, for existing popular query engines.

Additional context

The need to create extension types was raised in the [Proposal] Decouple logical from physical types

However introduction of DataFusion own types does not require introduction of extension types.
Extension types are complex enough (especially given their impact on functions) that they deserve their own roadmap issue.

The impact of extension types on functions, functions runtime and resolution is very clear, so this relates to Simple Functions initiative:

Having ExtensionType in arrow-rs would could the implementation simpler:

@findepi findepi added the enhancement New feature or request label Sep 27, 2024
@findepi
Copy link
Member Author

findepi commented Sep 27, 2024

@kylebarron
Copy link
Contributor

kylebarron commented Sep 27, 2024

I'm not very knowledgeable about DataFusion internals or database theory, so it's hard for me to provide feedback on the proposal, but I'm very excited about the prospect of extension types to enable spatial types (#7859). I've been collaborating on the GeoArrow spec, which defines Arrow extension types for spatial data. It's important to have additional logical types because the same physical layout can be interpreted in multiple logical ways (e.g. an array of LineString and MultiPoint), and to store coordinate reference system information (what physical locations on earth these numbers represent) as part of the type. I'm happy to provide more motivating examples if that would help!

@findepi
Copy link
Member Author

findepi commented Sep 28, 2024

FYI i touched upon the topic of types on DataFusion meetup in Belgrade yesterday.
The slides are here if anyone is interested: https://docs.google.com/presentation/d/1VW_JCGbN22lrGUOMRvUXGpAmlJopbG02hn_SDYJouiY . It was an attempt to summarize why we need both: simpler types (#11513), more types (#12644), and simple function "SDK" (#12635).
The document has comments disabled to avoid diverging the discussion from the github issue.

@alamb
Copy link
Contributor

alamb commented Sep 29, 2024

See a previous proposal from @yukkit : #7923

@paleolimbot
Copy link
Member

I'm interested in this as well; however, I'm new to DataFusion development and I am not sure I have a handle on what the barriers are here (e.g., Is support for this blocked by lack of support in arrow-rs? Is there opposition to this on principle or is it just a matter of development time?).

A few more (albeit geospatial) reasons to add support for this are (not-quite-merged!) support for spatial data in Iceberg and Parquet. As currently written, those proposals include type-level metadata (notably: the CRS), which would mean that reading Parquet files with those types would loose information. There are also other Arrow canonical extension types that have type-level metadata (which to my current reading, means that canonical extension types are not usable in DataFusion?).

@kylebarron
Copy link
Contributor

Is support for this blocked by lack of support in arrow-rs?

arrow-rs intentionally doesn't "directly support" extension types through DataType. The alternative (now archived) arrow2 implementation had an Extension variant, but this has the downside of requiring all code that needs to match the DataType enum to be aware of logical extension types. See apache/arrow-rs#4472 for more discussion.

Instead, arrow-rs supports extension types indirectly via the metadata hashmap on the Field accompanying some Array. So e.g. in geoarrow-rs, we have custom types like PointArray that internally manage both the array data and the field metadata, but when exporting to arrow objects, you'd need to carry the Array and the Field together.

I haven't closely followed the logical type work in progress in Datafusion but I assume it would be associating some Field with the physical type, and extension types could inject metadata there.

@tobixdev
Copy link
Contributor

tobixdev commented Jan 15, 2025

I second @paleolimbot here, as I am also interested in this topic.
First of all, thank you for all your efforts in this area!

To provide some context, we are currently creating a prototype that handles data that is encoded as a union with 10-ish different variants. Logically, this union represents the encoding of a single type. While we make ends meet by creating logical plans manually with specialized UDFs (e.g., for equality), we are in a very early phase. Only working with physical types will add lots of complexity to our code that can be eliminated with logical types (#12622) and its benefits (e.g., #12635) (I think™).

So what our (dream) scenario is that we could define this extension type (and its encoding) together with implementations for equality, orderings, etc, and DataFusion would automatically make use of these when joining or sorting (AFAIK, this should be supported according to #7923).

I'd like to support you in these efforts. However, while I read a few discussions and proposals, I am still a bit lost on where I can/should help out as this is (from what I see) a huge ongoing project across multiple issues and epics. Do you have pointers on where I could start helping out? For example, I found #13301, but I am unsure if the efforts on simple functions #12635 make these changes somewhat premature.

Thank you for helping me to navigate this project!

@jayzhan211
Copy link
Contributor

The current status is that we have several changes in branch logical-types for #12622. Where the Scalar is introduced and the next step is to complete the tasks left in #12622.

Remove ScalarValue::LargeUtf8/Utf8View/LargeBinary/BinaryView.

The change in logical-types is huge

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

@mbrobbel has a nice proposal to add extension types in arrow-rs (which would potentially help ExtensionTypes in DataFusion). Would apprecaite any feedback:

@alamb
Copy link
Contributor

alamb commented Jan 23, 2025

I started thinking about how the ExtensionType trait would be used in DataFusion. I think we would need to improve the APIs a bit to be in terms of Field rather than DataType. See this ticket for more info

@alamb
Copy link
Contributor

alamb commented Feb 2, 2025

Coming soon (arrow 54.2 in a month), support for Extension Types:

@alamb alamb changed the title Extension Types Extension Types / User Defined Types Feb 23, 2025
@alamb
Copy link
Contributor

alamb commented Feb 23, 2025

BTW DataFusion now has access to the arrow user defined types / is upgraded to 54.2 -- maybe now is a good time to start with this

@tobixdev has a usecase for specifying sort orders in #14828 -- maybe that would be a good thing to try to make work at first 🤔

@paleolimbot
Copy link
Member

Just a note to thank you for coordinating all of us on this effort! I am new to the code base and I don't have a great handle on exactly where to start, but I'm particularly interested in getting the extension type (or Field) propagated through to user-defined functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants