Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Add a categorical field type [field property version] #68

Merged
merged 9 commits into from
Jun 5, 2024
66 changes: 64 additions & 2 deletions content/docs/specifications/table-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,9 +127,18 @@ A Table Schema descriptor `MAY` contain a property `fieldsMatch` that `MUST` be

Many datasets arrive with missing data values, either because a value was not collected or it never existed. Missing values may be indicated simply by the value being empty in other cases a special value may have been used e.g. `-`, `NaN`, `0`, `-9999` etc.

`missingValues` dictates which string values `MUST` be treated as `null` values. This conversion to `null` is done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value.
`missingValues` dictates which values `SHOULD` be treated as missing values. Depending on implementation support for representing missing values, implementations `MAY` offer different ways of handling missingness when loading a field, including but not limited to: converting all missing values to null, loading missing values inline with a field's logical values, or loading the missing values for a field in a separate, additional column.

`missingValues` `MUST` be an `array` where each entry is a `string`.
`missingValues` `MUST` be an `array` where each entry is a `string`, or an `array` where each entry is an `object`.

If an `array` of `object`s is provided, each object `MUST` have a `value` and optional `label` property. The `value` property `MUST` be a `string` that matches the physical value of the field. The optional `label` property `MUST` be a `string` that provides a human-readable label for the missing value. For example:

```json
"missingValues": [
{ "value": "", "label": "OMITTED" },
{ "value": "-99", "label": "REFUSED" }
]
```

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing value which are not of their type, for example a `number` field to have missing values indicated by `-`.
khusmann marked this conversation as resolved.
Show resolved Hide resolved

Expand All @@ -141,6 +150,8 @@ Examples:
"missingValues": ["NaN", "-"]
```

When implementations choose to convert missing values to null, this conversion to `null` `MUST` be done before any other attempted type-specific string conversion. The default value `[ "" ]` means that empty strings will be converted to null before any other processing takes place. Providing the empty list `[]` means that no conversion to null will be done, on any value.

#### `primaryKey`

A primary key is a field or set of fields that uniquely identifies each row in the table. Per SQL standards, the fields cannot be `null`, so their use in the primary key is equivalent to adding `required: true` to their [`constraints`](#constraints).
Expand Down Expand Up @@ -355,6 +366,57 @@ An example value for the field

See [Field Constraints](#field-constraints)

#### `categories` / `categoriesOrdered`

`string` and `integer` field types `MAY` include a `categories` property to indicate that the field contains categorical data, and the field `MAY` be loaded as a categorical data type if supported by the implementation. The `categories` property `MUST` be an array of values or an array of objects that define the levels of the categorical.

When the `categories` property is an array of values, the values `MUST` be unique and `MUST` match logical values of the field. For example:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST match logical values of the field

This sounds like categories cannot contain a value that is not present in the data, but I believe we intend the reverse: the field cannot contain a value that is not in categories. It also seems that the unique constraint should apply whether an array or array of objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points! Just made some clarifications in the latest commits. Let me know if it looks good or if you have other rephrasings I should try!


```json
{
"name": "fruit",
"type": "string",
"categories": ["apple", "orange", "banana"]
}
```

When the categories property is an array of objects, each object `MUST` have a `value` and an optional `label` property. The `value` property `MUST` be a value that matches the logical value of the field when representing that level. The optional `label` property, when present, `MUST` be a `string` that provides a human-readable label for the level. For example, if the `integer` values `0`, `1`, `2` were used as codes to represent the levels `"apple"`, `"orange"`, and `"banana"` in the previous example, the `categories` property would be defined as follows:

```json
{
"name": "fruit",
"type": "integer",
"categories": [
{ "value": 0, "label": "apple" },
{ "value": 1, "label": "orange" },
{ "value": 2, "label": "banana" }
]
}
```

When the `categories` property is defined, it `MAY` be accompanied by a `categoriesOrdered` property in the field definition. When present, the `categoriesOrdered` property `MUST` be a boolean. When `categoriesOrdered` is true, implementations `SHOULD` interpret the order of the levels as defined in the `categories` property as the natural ordering of the levels. For example:

```json
{
"name": "agreementLevel",
"type": "integer",
"categories": [
{ "value": 1, "label": "Strongly Disagree" },
{ "value": 2 },
{ "value": 3 },
{ "value": 4 },
{ "value": 5, "label": "Strongly Agree" }
],
"categoriesOrdered": true
}
```

When the property `categoriesOrdered` is `false` or not present, implementations `SHOULD` assume that the levels of the categorical do not have a natural order.

Although the `categories` property restricts a field to a finite set of possible values, similar to an [`enum`](#enum) constraint, they explicitly indicate that the field `MAY` be loaded as a categorical data type if supported by the implementation. By contrast, `enum` constraints restrict values of a field, but `SHOULD` not change the data type of the field when loaded.

`enum` constraints `MAY` be added to fields with the `categories` property, but when added, the values in the `enum` constraint `MUST` be a subset of the logical values defined in `categories`.

#### `missingValues`

A list of missing values for this field as per [Missing Values](#missingvalues) definition. If this property is defined, it takes precedence over the schema-level property and completely replaces it for the field without combining the values.
Expand Down