Discourage usage of unnecessary union types in Table Schema #28

roll · 2024-02-05T14:49:36Z

fixes Discourage usage of unnecessary union types datapackage#873

Rationale

I think having only one way of communicating things in cases of primaryKey and foreignKeys will actually benefit both data publishers and data consumers as it reduces confusion. For example in v1 foreignKey.fields can be a single string which might be really confusing for data publishers. Of course, a stricter data model is usually better for data consumers

nichtich · 2024-02-05T20:41:46Z

As far as I understand this only is about union types in primary and foreign keys, not about union types in other parts of the schema. In this case I'd support the change!

roll · 2024-02-06T08:39:19Z

@nichtich
Yes. I did the analysis, and only pk/fk were unnecessary unions. Previously, on the v1 release, we cleaned others.

content/docs/specifications/table-schema.md

peterdesmet · 2024-02-07T10:18:41Z

While it is a cleaner implementation:

I don't understand the need for it. It's rather convenient for data publishers to allow string or array, cf. path.
It doesn't really help data consumers, since they MUST (indefinitely) support the union approach, because it would otherwise be a breaking change for existing data packages.

Currently hesitating between 👎 and 👀 for me.

roll · 2024-02-12T08:28:28Z

@peterdesmet
My opinion on this is that having different ways to define the same thing (resource.path is union because it's for defining different things) is actually not convenient to data publishers because if you learn something and see fields: 'name' in one example and fields: ['name'] in another example, it would rather confuse you than help (you need to check the docs to understand what's correct etc). Of course, it's only my perception based on kind of python's principle There should be one-- and preferably only one --obvious way to do it

peterdesmet · 2024-02-12T10:00:32Z

Yeah, I'm coming to this from R, where a character and a character vector (= array) are interchangeable, so it feels very natural. 😄

val <- "a"
val
#> [1] "a"
val <- c("a")
val
#> [1] "a"
val <- c("a", "b")
val
#> [1] "a" "b"

^{Created on 2024-02-12 with reprex v2.1.0}

nichtich · 2024-02-12T14:30:09Z

It makes sense, given that the property is already named fields (plural).

khusmann · 2024-02-19T22:17:35Z

I think it'd be good to distinguish between the different ways union types can be used in the spec. Broadly, I think there's two main categories: distinguishing between features (like @roll mentioned above), and syntactic sugar.

I think a key example of the former category (a union type distinguishing between features) is the schema field in TableResources. It's type is string | TableSchema. When it is a string, it is an indicator a schema needs to be loaded from an external source, as opposed to being defined inline.

Although I don't like the behavior in this particular example (as I mention in this comment), but I'm not against it in general. For example, in my comment on resource.name I suggest using a union types (number | string) to differentiate between references by index or by name.

I think "unnecessary" union types are cases in the latter category, that is, "union types for syntactic sugar". I think there are three main use cases I've seen:

Allowing single values to be specified on array properties (i.e. string | string[]), as with foreignKeys
Allowing physical values to be specified as number | string (e.g. allowing trueValues / falseValues to be any type
Allowing abbreviated definitions for common cases of complex & repetitive property definitions, like my suggestion in the categorical type proposal.

For me, keeping things consistent is paramount.

So, considering (1), if we want some arrays to be specified by a single value I think it should be the case for ALL arrays. Otherwise, every time I want to take the shortcut on a single-item array, I need to look up whether it's allowed.

Similarly, for (2), I think if we allow physical values to be specified as number | string, I think this should hold for wherever physical values are specified.

I lean towards NOT allowing (1) and (2), because it is more clear, explicit, and emphasizes the actual type of the field. It also avoids potentially ambiguity down the road if we wanted to use the same type signature for a "feature-distinguishing" union type as a way of adding a new feature. An example of this might be "named collections of missing values", where field-level missing values could be declared via a string referencing a named collection, rather than an array of physical values). This would not be possible if we were implementing (1).

I think syntactic sugar union types that fall into (3) should be evaluated on a case-by-case basis to weigh their convenience & clarity vs the complexity they introduce. Because they are specific special cases, I'm less worried about them interfering with anything else in the spec.

Co-authored-by: Peter Desmet <[email protected]>

cloudflare-workers-and-pages · 2024-02-20T09:39:48Z

Deploying with Cloudflare Pages

Latest commit:	`13c48ca`
Status:	✅ Deploy successful!
Preview URL:	https://32309e47.datapackage.pages.dev
Branch Preview URL:	https://873-reduce-union-types.datapackage.pages.dev

View logs

roll · 2024-02-20T09:48:13Z

@khusmann
Great write-up!

Currently field.foreignKeys.reference.fields says:

MUST have a property fields which is a string if the outer fields is a string, else an array of the same length as the outer fields, describing the field (or fields) references on the destination resource. The structure of the string or array is as per primaryKey above.

This wording is so complex and referential so by this PR I really think we simplify data publishers live rather than complicate it =)

roll · 2024-02-20T09:50:34Z

@ezwelty
Can you please take a look?

ezwelty · 2024-02-20T13:30:16Z

Of course software will have to continue supporting v1 quirks, but the quirks won't go away someday unless we start to deprecate them...

I've been nagged by this many times when writing code to process foreignKey.fields and primaryKey. My approach: pre-process them to ensure they contain lists of strings...

roll · 2024-02-21T08:52:30Z

@peterdesmet
@khusmann
So it's up to you to decide =)

peterdesmet · 2024-03-01T16:19:05Z

Was reminded of this issue when changing an AWS S3 permission (which are also defined in JSON). There you can mix arrays and strings too:

Two actions:

"Action": [
  "s3:GetObject",
  "s3:ListBucket"
]

One action:

"Action": [
  "s3:GetObject"
]

or

"Action": "s3:GetObject"

I quite like this approach (Data Package has it for e.g. path) and I think lower the burden for publishers if Data Package allowed this everywhere a an array of strings is allowed.

roll · 2024-03-11T10:22:54Z

@peterdesmet
I think it's also important to think of what is datapackage.json. Initially, it was designed as both:

configuration format (like in your AWS example) -- people write it by hands
serialization format -- machines write it and use it for communication

My hope that as tooling emerges datapackage.json eventually becomes only a serialization format as all the manipulation will be done on the models level in Open Data Editor, frictionless-py, frictionless-r, etc. In this case, I believe that the stricter the serialization format, the better. On the other hand, on the models level an implementation can provide whatever natural interface for the platform it has e.g. frictionless-r still can support both array and string (because it's a pattern for R) and e.g. frictionless-js won't (as it's an antipattern for TypeScript).

So I still suggest we accept if for simplicity on the specs level but honestly speaking, this PR doesn't even change anything because of backward-compat so we probably just need to make a call on it (accept or reject) and focus on more important topics 👍

peterdesmet · 2024-03-13T16:02:24Z

@roll

Approved to make it more strict for primaryKey and foreignKeys (the topic of this PR)
In general, hand-editing/simplicity is one of the features I like about datapackage.json, so would not use this PR as a way to forbid mixed types elsewhere (e.g. role)
Given that it is a (always) an array, should we rename primaryKey to primaryKeys?

roll · 2024-03-14T10:54:19Z

APPROVED by WG (6/9)

roll · 2024-03-14T10:57:35Z

Thanks @peterdesmet !

Given that it is a (always) an array, should we rename primaryKey to primaryKeys?

primaryKey is after the SQL terminology meaning it's a one optionally composite key (key === tuple)

So I think we now have quite consistent terminology:

primaryKey - one (composite) key
uniqueKeys - an array of (composite) keys
foregnKeys - an array of (composite) keys

PS.
Hand-authoring datapackage.json is of course always an option

roll added 3 commits February 5, 2024 14:33

Updated primary key

41c7e3c

Updated foreign keys

04b4a69

Minor fixes

b6a310c

roll mentioned this pull request Feb 5, 2024

Frictionless Standards (v2). Feb 5 - Feb 11 frictionlessdata/datapackage#877

Closed

peterdesmet suggested changes Feb 7, 2024

View reviewed changes

content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved

roll mentioned this pull request Feb 12, 2024

Frictionless Standards (v2). Live Tracker frictionlessdata/datapackage#853

Closed

roll added the candidate label Feb 20, 2024

Fixed type string -> strings

13c48ca

Co-authored-by: Peter Desmet <[email protected]>

roll merged commit a1aacaa into main Mar 14, 2024
1 check passed

roll deleted the 873/reduce-union-types branch March 14, 2024 10:54

roll mentioned this pull request Mar 14, 2024

Frictionless Standards (v2). Mar 1 - Mar 31 frictionlessdata/datapackage#892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discourage usage of unnecessary union types in Table Schema #28

Discourage usage of unnecessary union types in Table Schema #28

roll commented Feb 5, 2024

nichtich commented Feb 5, 2024

roll commented Feb 6, 2024

peterdesmet commented Feb 7, 2024 •

edited

Loading

roll commented Feb 12, 2024

peterdesmet commented Feb 12, 2024

nichtich commented Feb 12, 2024

khusmann commented Feb 19, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 20, 2024 •

edited

Loading

roll commented Feb 20, 2024

roll commented Feb 20, 2024

ezwelty commented Feb 20, 2024

roll commented Feb 21, 2024

peterdesmet commented Mar 1, 2024

roll commented Mar 11, 2024

peterdesmet commented Mar 13, 2024

roll commented Mar 14, 2024

roll commented Mar 14, 2024 •

edited

Loading

Discourage usage of unnecessary union types in Table Schema #28

Discourage usage of unnecessary union types in Table Schema #28

Conversation

roll commented Feb 5, 2024

Rationale

nichtich commented Feb 5, 2024

roll commented Feb 6, 2024

peterdesmet commented Feb 7, 2024 • edited Loading

roll commented Feb 12, 2024

peterdesmet commented Feb 12, 2024

nichtich commented Feb 12, 2024

khusmann commented Feb 19, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Feb 20, 2024 • edited Loading

Deploying with Cloudflare Pages

roll commented Feb 20, 2024

roll commented Feb 20, 2024

ezwelty commented Feb 20, 2024

roll commented Feb 21, 2024

peterdesmet commented Mar 1, 2024

roll commented Mar 11, 2024

peterdesmet commented Mar 13, 2024

roll commented Mar 14, 2024

roll commented Mar 14, 2024 • edited Loading

peterdesmet commented Feb 7, 2024 •

edited

Loading

khusmann commented Feb 19, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Feb 20, 2024 •

edited

Loading

roll commented Mar 14, 2024 •

edited

Loading