Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Discourage usage of unnecessary union types in Table Schema #28

Merged
merged 4 commits into from
Mar 14, 2024

Conversation

roll
Copy link
Member

@roll roll commented Feb 5, 2024


Rationale

I think having only one way of communicating things in cases of primaryKey and foreignKeys will actually benefit both data publishers and data consumers as it reduces confusion. For example in v1 foreignKey.fields can be a single string which might be really confusing for data publishers. Of course, a stricter data model is usually better for data consumers

@nichtich
Copy link

nichtich commented Feb 5, 2024

As far as I understand this only is about union types in primary and foreign keys, not about union types in other parts of the schema. In this case I'd support the change!

@roll
Copy link
Member Author

roll commented Feb 6, 2024

@nichtich
Yes. I did the analysis, and only pk/fk were unnecessary unions. Previously, on the v1 release, we cleaned others.

@peterdesmet
Copy link
Member

peterdesmet commented Feb 7, 2024

While it is a cleaner implementation:

  1. I don't understand the need for it. It's rather convenient for data publishers to allow string or array, cf. path.
  2. It doesn't really help data consumers, since they MUST (indefinitely) support the union approach, because it would otherwise be a breaking change for existing data packages.

Currently hesitating between 👎 and 👀 for me.

@roll
Copy link
Member Author

roll commented Feb 12, 2024

@peterdesmet
My opinion on this is that having different ways to define the same thing (resource.path is union because it's for defining different things) is actually not convenient to data publishers because if you learn something and see fields: 'name' in one example and fields: ['name'] in another example, it would rather confuse you than help (you need to check the docs to understand what's correct etc). Of course, it's only my perception based on kind of python's principle There should be one-- and preferably only one --obvious way to do it

@peterdesmet
Copy link
Member

Yeah, I'm coming to this from R, where a character and a character vector (= array) are interchangeable, so it feels very natural. 😄

val <- "a"
val
#> [1] "a"
val <- c("a")
val
#> [1] "a"
val <- c("a", "b")
val
#> [1] "a" "b"

Created on 2024-02-12 with reprex v2.1.0

@nichtich
Copy link

It makes sense, given that the property is already named fields (plural).

@khusmann
Copy link
Contributor

khusmann commented Feb 19, 2024

I think it'd be good to distinguish between the different ways union types can be used in the spec. Broadly, I think there's two main categories: distinguishing between features (like @roll mentioned above), and syntactic sugar.

I think a key example of the former category (a union type distinguishing between features) is the schema field in TableResources. It's type is string | TableSchema. When it is a string, it is an indicator a schema needs to be loaded from an external source, as opposed to being defined inline.

Although I don't like the behavior in this particular example (as I mention in this comment), but I'm not against it in general. For example, in my comment on resource.name I suggest using a union types (number | string) to differentiate between references by index or by name.

I think "unnecessary" union types are cases in the latter category, that is, "union types for syntactic sugar". I think there are three main use cases I've seen:

  1. Allowing single values to be specified on array properties (i.e. string | string[]), as with foreignKeys

  2. Allowing physical values to be specified as number | string (e.g. allowing trueValues / falseValues to be any type

  3. Allowing abbreviated definitions for common cases of complex & repetitive property definitions, like my suggestion in the categorical type proposal.

For me, keeping things consistent is paramount.

So, considering (1), if we want some arrays to be specified by a single value I think it should be the case for ALL arrays. Otherwise, every time I want to take the shortcut on a single-item array, I need to look up whether it's allowed.

Similarly, for (2), I think if we allow physical values to be specified as number | string, I think this should hold for wherever physical values are specified.

I lean towards NOT allowing (1) and (2), because it is more clear, explicit, and emphasizes the actual type of the field. It also avoids potentially ambiguity down the road if we wanted to use the same type signature for a "feature-distinguishing" union type as a way of adding a new feature. An example of this might be "named collections of missing values", where field-level missing values could be declared via a string referencing a named collection, rather than an array of physical values). This would not be possible if we were implementing (1).

I think syntactic sugar union types that fall into (3) should be evaluated on a case-by-case basis to weigh their convenience & clarity vs the complexity they introduce. Because they are specific special cases, I'm less worried about them interfering with anything else in the spec.

@roll roll added the candidate label Feb 20, 2024
Co-authored-by: Peter Desmet <[email protected]>
Copy link

cloudflare-workers-and-pages bot commented Feb 20, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 13c48ca
Status: ✅  Deploy successful!
Preview URL: https://32309e47.datapackage.pages.dev
Branch Preview URL: https://873-reduce-union-types.datapackage.pages.dev

View logs

@roll
Copy link
Member Author

roll commented Feb 20, 2024

@khusmann
Great write-up!

Currently field.foreignKeys.reference.fields says:

MUST have a property fields which is a string if the outer fields is a string, else an array of the same length as the outer fields, describing the field (or fields) references on the destination resource. The structure of the string or array is as per primaryKey above.

This wording is so complex and referential so by this PR I really think we simplify data publishers live rather than complicate it =)

@roll
Copy link
Member Author

roll commented Feb 20, 2024

@ezwelty
Can you please take a look?

@ezwelty
Copy link

ezwelty commented Feb 20, 2024

Of course software will have to continue supporting v1 quirks, but the quirks won't go away someday unless we start to deprecate them...

I've been nagged by this many times when writing code to process foreignKey.fields and primaryKey. My approach: pre-process them to ensure they contain lists of strings...

@roll
Copy link
Member Author

roll commented Feb 21, 2024

@peterdesmet
@khusmann
So it's up to you to decide =)

@peterdesmet
Copy link
Member

Was reminded of this issue when changing an AWS S3 permission (which are also defined in JSON). There you can mix arrays and strings too:

Two actions:

"Action": [
  "s3:GetObject",
  "s3:ListBucket"
]

One action:

"Action": [
  "s3:GetObject"
]

or

"Action": "s3:GetObject"

I quite like this approach (Data Package has it for e.g. path) and I think lower the burden for publishers if Data Package allowed this everywhere a an array of strings is allowed.

@roll
Copy link
Member Author

roll commented Mar 11, 2024

@peterdesmet
I think it's also important to think of what is datapackage.json. Initially, it was designed as both:

  • configuration format (like in your AWS example) -- people write it by hands
  • serialization format -- machines write it and use it for communication

My hope that as tooling emerges datapackage.json eventually becomes only a serialization format as all the manipulation will be done on the models level in Open Data Editor, frictionless-py, frictionless-r, etc. In this case, I believe that the stricter the serialization format, the better. On the other hand, on the models level an implementation can provide whatever natural interface for the platform it has e.g. frictionless-r still can support both array and string (because it's a pattern for R) and e.g. frictionless-js won't (as it's an antipattern for TypeScript).

So I still suggest we accept if for simplicity on the specs level but honestly speaking, this PR doesn't even change anything because of backward-compat so we probably just need to make a call on it (accept or reject) and focus on more important topics 👍

@peterdesmet
Copy link
Member

@roll

  1. Approved to make it more strict for primaryKey and foreignKeys (the topic of this PR)
  2. In general, hand-editing/simplicity is one of the features I like about datapackage.json, so would not use this PR as a way to forbid mixed types elsewhere (e.g. role)
  3. Given that it is a (always) an array, should we rename primaryKey to primaryKeys?

@roll
Copy link
Member Author

roll commented Mar 14, 2024

APPROVED by WG (6/9)

@roll roll merged commit a1aacaa into main Mar 14, 2024
1 check passed
@roll roll deleted the 873/reduce-union-types branch March 14, 2024 10:54
@roll
Copy link
Member Author

roll commented Mar 14, 2024

Thanks @peterdesmet !

Given that it is a (always) an array, should we rename primaryKey to primaryKeys?

primaryKey is after the SQL terminology meaning it's a one optionally composite key (key === tuple)

So I think we now have quite consistent terminology:

  • primaryKey - one (composite) key
  • uniqueKeys - an array of (composite) keys
  • foregnKeys - an array of (composite) keys

PS.
Hand-authoring datapackage.json is of course always an option

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discourage usage of unnecessary union types
5 participants