Skip to content
This repository has been archived by the owner on Oct 28, 2024. It is now read-only.

Data representation model (simplified with reduced amount of changes) #71

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 9 additions & 9 deletions content/docs/specifications/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ It is recommended to cache profiles using their URL as a unique key.

The Data Package Standard uses a concept of a `descriptor` to represent metadata defined according to the core specefications such as Data Package or Table Schema.

On logical level, a descriptor is represented by a data structure. The data structure `MUST` be a JSON `object` as defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt).
In [Logical Representation](#logical-representation), a descriptor is a data structure. The data structure `MUST` be a JSON `object` as defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt).

On physical level, a descriptor is represented by a file. The file `MUST` contain a valid JSON `object` as defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt).
In [Physical Representation](#physical-representation), a descriptor is a file. The file `MUST` contain a valid JSON `object` as defined in [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt).

This specification does not define any discoverability mechanisms. Any URI can be used to directly reference a file containing a descriptor.

Expand Down Expand Up @@ -120,20 +120,20 @@ In JSON, a table would be:

```json
[
{ "A": value, "B": value, ... },
{ "A": value, "B": value, ... },
{ "A": "value", "B": "value", ... },
{ "A": "value", "B": "value", ... },
...
]
```

### Data Representation

In order to talk about the representation and processing of tabular data from text-based sources, it is useful to introduce the concepts of the _physical_ and the _logical_ representation of data.
In order to talk about the representation and processing of tabular data from varios data sources, it is useful to introduce the concepts of the `physical` and the `logical` representation of data.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "varios" -> "various"
  2. "...and processing of tabular data..." Is it necessary to specify "tabular" here?


The _physical representation_ of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation can have some _type_ information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).
#### Physical Representation

The _logical representation_ of data refers to the "ideal" representation of the data in terms of primitive types, data structures, and relations, all as defined by the specification. We could say that the specification is about the logical representation of data, as well as about ways in which to handle conversion of a physical representation to a logical one.
The physical representation of data refers to format-specific representation of data, for example, in a CSV or JSON file. This representation can have some type information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form).

In this document, we'll explicitly refer to either the _physical_ or _logical_ representation in places where it prevents ambiguity for those engaging with the specification, especially implementors.
#### Logical Representation

For example, `constraints` `SHOULD` be tested on the logical representation of data, whereas a property like `missingValues` applies to the physical representation of the data.
The logical representation of data refers to the "ideal" representation of the data in terms of the Data Package standard's primitive types, data structures, and relations, all as defined by the Data Package's specifications. We could say that the standard is about the logical representation of data, as well as about ways in which to handle conversion of a physical representation to a logical one.
107 changes: 73 additions & 34 deletions content/docs/specifications/table-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ If an `array` of `object`s is provided, each object `MUST` have a unique `value`
]
```

**Why strings**: `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing values which are not of their type, for example a `number` field to have missing values indicated by `-`.
Note that `missingValues` are strings rather than being the data type of the particular field. This allows for comparison prior to casting and for fields to have missing values which are not of their type, for example a `number` field to have missing values indicated by `-`.

Examples:

Expand Down Expand Up @@ -452,7 +452,11 @@ Supported formats:

The field contains numbers of any kind including decimals.

The lexical formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'.
**String Representation**

As strings, values `MUST` be represented following the rules below.

Formatting follows that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'.

The following special string values are permitted (case need not be respected):

Expand All @@ -464,118 +468,153 @@ A number `MAY` also have a trailing:

- exponent: this `MUST` consist of an E followed by an optional + or - sign followed by one or more decimal digits (0-9)

This lexical formatting `MAY` be modified using these additional properties:
Formatting `MAY` be modified using these additional properties:

- **decimalChar**: A string whose value is used to represent a decimal point within the number. The default value is ".".
- **groupChar**: A string whose value is used to group digits within the number. This property does not have a default value. A common value is "," e.g. "100,000".
- **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text.
- **bareNumber**: a boolean field with a default of `true`. If `true` the contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text.

### `integer`

The field contains integers - that is whole numbers.

Integer values are indicated in the standard way for any valid integer.
**String Representation**

As strings, values `MUST` be represented following the rules below.

This lexical formatting `MAY` be modified using these additional properties:
Integer values are indicated in the standard way for any valid integer. Formatting `MAY` be modified using these additional properties:

- **groupChar**: A string whose value is used to group digits within the integer. This property does not have a default value. A common value is "," e.g. "100,000".
- **bareNumber**: a boolean field with a default of `true`. If `true` the physical contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text.
- **bareNumber**: a boolean field with a default of `true`. If `true` the contents of this field `MUST` follow the formatting constraints already set out. If `false` the contents of this field may contain leading and/or trailing non-numeric characters (which implementors `MUST` therefore strip). The purpose of `bareNumber` is to allow publishers to publish numeric data that contains trailing characters such as percentages e.g. `95%` or leading characters such as currencies e.g. `€95` or `EUR 95`. Note that it is entirely up to implementors what, if anything, they do with stripped text.

### `boolean`

The field contains boolean (true/false) data.
The field contains boolean data i.e. logical `true` or logical `false`.

In the physical representations of data where boolean values are represented with strings, the values set in `trueValues` and `falseValues` are to be cast to their logical representation as booleans. `trueValues` and `falseValues` are arrays which can be customised to user need. The default values for these are in the additional properties section below.
**String Representation**

The boolean field can be customised with these additional properties:
As strings, values `MUST` be represented as defined by the `trueValues` and `falseValues` properties that can be customized to user need:

- **trueValues**: `[ "true", "True", "TRUE", "1" ]`
- **falseValues**: `[ "false", "False", "FALSE", "0" ]`
- **trueValues**: An array of strings to be interpreted as logical `true`. The default is `[ "true", "True", "TRUE", "1" ]`.
- **falseValues**: An array of strings to be interpreted as logical `false`. The default is `[ "false", "False", "FALSE", "0" ]`.

### `object`

The field contains a valid JSON object.

**String Representation**

As strings, values `MUST` be represented by valid serialized JSON objects.

### `array`

The field contains a valid JSON array.

**String Representation**

As strings, values `MUST` be represented by valid serialized JSON arrays.

### `list`

The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. In the lexical representation, the field `MUST` contain a string with values separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections.
The field contains data that is an ordered one-level depth collection of primitive values with a fixed item type. In comparison to the `array` type, the `list` type is directly modelled on the concept of SQL typed collections.

The list field can be customised with this additional property:

- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetime`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type.

`format`: no options (other than the default).
**String Representation**

The list field can be customised with these additional properties:
As strings, values `MUST` be represented by strings with list items separated by a delimiter which is `,` (comma) by default e.g. `value1,value2`. The list items `MUST` be serialized using a default format of the corresponding `itemType`. The delimiter can be customised with this additional property:

- **delimiter**: specifies the character sequence which separates lexically represented list items. If not present, the default is `,` (comma).
- **itemType**: specifies the list item type in terms of existent Table Schema types. If present, it `MUST` be one of `string`, `integer`, `boolean`, `number`, `datetme`, `date`, and `time`. If not present, the default is `string`. A data consumer `MUST` process list items as it were individual values of the corresponding data type. Note, that on lexical level only default formats are supported, for example, for a list with `itemType` set to `date`, items have to be in default form for dates i.e. `yyyy-mm-dd`.
- **delimiter**: specifies the character sequence which separates list items. If not present, the default is `,` (comma).

### `datetime`

The field contains a date with a time.
The field contains a date with a time and an optional timezone.

Supported formats:
**String Representation**

As strings, values `MUST` be represented in one of the following formats:

- **default**: The lexical representation `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`.
- **default**: values `MUST` be in a form defined by [XML Schema](https://www.w3.org/TR/xmlschema-2/#dateTime) containing required date and time parts, followed by optional milliseconds and timezone parts, for example, `2024-01-26T15:00:00` or `2024-01-26T15:00:00.300-05:00`.
- **\<PATTERN\>**: values in this field can be parsed according to `<PATTERN>`. `<PATTERN>` `MUST` follow the syntax of [standard Python / C strptime](https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior). Values in the this field `SHOULD` be parsable by Python / C standard `strptime` using `<PATTERN>`. Example for `"format": ""%d/%m/%Y %H:%M:%S"` which would correspond to a date with time like: `12/11/2018 09:15:32`.
- **any**: Any parsable representation of the value. The implementing library can attempt to parse the datetime via a range of strategies. An example is `dateutil.parser.parse` from the `python-dateutils` library. It is `NOT RECOMMENDED` to use `any` format as it might cause interoperability issues.

### `date`

The field contains a date without a time.

Supported formats:
**String Representation**

- **default**: The lexical representation `MUST` be `yyyy-mm-dd` e.g. `2024-01-26`
As strings, values `MUST` be represented in one of the following formats:

- **default**: values `MUST` be `yyyy-mm-dd` e.g. `2024-01-26`
- **\<PATTERN\>**: The same as for `datetime`
- **any**: The same as for `datetime`

### `time`

The field contains a time without a date.

Supported formats:
**String Representation**

As strings, values `MUST` be represented in one of the following formats:

- **default**: The lexical representation `MUST` be `hh:mm:ss` e.g. `15:00:00`
- **default**: values `MUST` be `hh:mm:ss` e.g. `15:00:00`
- **\<PATTERN\>**: The same as for `datetime`
- **any**: The same as for `datetime`

### `year`

A calendar year as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual lexical representation is `YYYY`. There are no format options.
The field contains a calendar year.

**String Representation**

As strings, values `MUST` be represented as per [XMLSchema `gYear`](https://www.w3.org/TR/xmlschema-2/#gYear). Usual representation as a string is `YYYY`.

### `yearmonth`

A specific month in a specific year as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual lexical representation is: `YYYY-MM`. There are no format options.
The field contains a specific month in a specific year.

**String Representation**

As strings, values `MUST` be represented as per [XMLSchema `gYearMonth`](https://www.w3.org/TR/xmlschema-2/#gYearMonth). Usual representation as a string is `YYYY-MM`.

### `duration`

A duration of time.
The field contains a duration of time.

**String Representation**

We follow the definition of [XML Schema duration datatype](http://www.w3.org/TR/xmlschema-2/#duration) directly and that definition is implicitly inlined here.
As strings, values `MUST` be represented as per [XML Schema `duration`](http://www.w3.org/TR/xmlschema-2/#duration).

To summarize: the lexical representation for duration is the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format PnYnMnDTnHnMnS, where nY represents the number of years, nM the number of months, nD the number of days, 'T' is the date/time separator, nH the number of hours, nM the number of minutes and nS the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision.
The duration `MUST` be in the [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601#Durations) extended format `PnYnMnDTnHnMnS`, where `nY` represents the number of years, `nM` the number of months, `nD` the number of days, `T` is the date/time separator, `nH` the number of hours, `nM` the number of minutes and `nS` the number of seconds. The number of seconds can include decimal digits to arbitrary precision. Date and time elements including their designator `MAY` be omitted if their value is zero, and lower order elements `MAY` also be omitted for reduced precision.

### `geopoint`

The field contains data describing a geographic point.
The field contains data describing a geographic point i.e. `lon` and `lat` values that are floating point numbers.

Supported formats:
**String Representation**

As strings, values `MUST` be represented in one of the following formats:

- **default**: A string of the pattern "lon, lat", where each value is a number, and `lon` is the longitude and `lat` is the latitude (note the space is optional after the `,`). E.g. `"90.50, 45.50"`.
- **array**: A JSON array, or a string parsable as a JSON array, of exactly two items, where each item is a number, and the first item is `lon` and the second
item is `lat` e.g. `[90.50, 45.50]`
- **object**: A JSON object with exactly two keys, `lat` and `lon` and each value is a number e.g. `{"lon": 90.50, "lat": 45.50}`
- **object**: A JSON object with exactly two keys, `lon` and `lat` and each value is a number e.g. `{"lon": 90.50, "lat": 45.50}`

### `geojson`

The field contains a JSON object according to GeoJSON or TopoJSON spec.
The field contains a JSON object according to GeoJSON or TopoJSON specifications.

Supported formats:

- **default**: A geojson object as per the [GeoJSON spec](http://geojson.org/).
- **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md)
- **topojson**: A topojson object as per the [TopoJSON spec](https://github.com/topojson/topojson-specification/blob/master/README.md).

**String Representation**

As strings, values `MUST` be represented by valid serialized JSON objects.

### `any`

Expand Down