Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tagged union field types #882

Closed
khusmann opened this issue Feb 19, 2024 · 0 comments
Closed

Support for tagged union field types #882

khusmann opened this issue Feb 19, 2024 · 0 comments

Comments

@khusmann
Copy link
Contributor

khusmann commented Feb 19, 2024

(Adapted from my comment here at @peterdesmet's request!)

(Note that I would consider this proposal to be "low-priority" at the present moment, because it depends on the acceptance of the categorical field type in #875 )

Sometimes tabular data are produced in a "long" format that combines data of multiple different types into a single field / column. I see this form of data a lot in event-driven sensor data. For example:

measurementType measurementValue
cloudiness partly cloudy
cloudiness cloudy
temperature 1
wind force 5
temperature 10

Where:

  • if measurementType = cloudiness then measurementValue:

    • type = categorical
    • categories = ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
  • If measurementType = temperature then measurementValue:

    • type = number
    • constraints.min = 0
    • constraints.max = 20
  • If measurementType = wind force then measurementValue:

    • type = categorical
    • categories = [0, 1, 2, 3, 4, 5]

(Example adapted from @peterdesmet's work here)

Here, measurementValue is not a single type, but actually a union of three types: either a cloudiness measurement, a temperature measurement, or a wind force measurement, each with their own type definitions and constraints. More specifically, this is a tagged union aka discriminated union compound type, where type of measurementValue depends on the "tag" or "discriminator" found in measurmentType.

Tagged union types are a well-established, well-understood abstraction already implemented in many programming languages (e.g. python, rust, etc.) and semantic data parsing / validation libraries (e.g. python's pydantic; and typescript's zod).

Implementing this behavior as a tagged union field type would allow implementations to validate this type of field by parsing its underlying types. It could also perform exhaustiveness checks on the definition (ensure that all levels in the categorical measurementType had corresponding type definitions). It would also facilitate implementations pivoting into wider table formats, because the dependent type definitions would translate into the column types of the resulting wide columns.

Here's an example of how a tagged union field type might look like in frictionless (using the proposed categorical syntax in #875:

{
  "fields": [
    {
      "name": "measurementType",
      "type": "categorical",
      "categories": ["cloudiness", "temperature", "wind force"]
    },
    {
      "name": "measurementValue",
      "type": "union",
      "tag": "measurementType",
      "match": {
        "cloudiness": {
          "type": "categorical",
          "categories": ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
        },
        "temperature": {
          "type": "number",
          "constraints": {
            "min": 0,
            "max": 20
          }
        },
        "wind force": {
          "type": "categorical",
          "categories": [0, 1, 2, 3, 4, 5]
        }
      }
    }
  ]
}

Note that the field-level validation on this type would ensure that all the levels of the measurementType categorical field were represented as keys of the match property in the measurementValue field. For example, if temperature wasn't defined as a key in the match property, this would trigger a validation error because temperature is one of the levels of the measurementType field, As mentioned earlier, this is a common feature of tagged union types.

If there is interest in this type, I can put together a more formal definition of the proposed union field's type signature (and RFC language).

@frictionlessdata frictionlessdata locked and limited conversation to collaborators Oct 21, 2024
@roll roll converted this issue into discussion #1026 Oct 21, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants