diff --git a/content/docs/specifications/glossary.md b/content/docs/specifications/glossary.md index 160ae192..25ce8d23 100644 --- a/content/docs/specifications/glossary.md +++ b/content/docs/specifications/glossary.md @@ -95,3 +95,45 @@ Example of a relative path that this will work both as a relative path on disk a :::caution[Security] `/` (absolute path) and `../` (relative parent path) are forbidden to avoid security vulnerabilities when implementing data package software. These limitations on resource `path` ensure that resource paths only point to files within the data package directory and its subdirectories. This prevents data package software being exploited by a malicious user to gain unintended access to sensitive information. For example, suppose a data package hosting service stores packages on disk and allows access via an API. A malicious user uploads a data package with a resource path like `/etc/passwd`. The user then requests the data for that resource and the server naively opens `/etc/passwd` and returns that data to the caller. ::: + +### Tabular Data + +Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about _the_ fields for the table as a whole. + +In case of tables in spreadsheets or CSV files we often interpret the first row as a header row, giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases, the field names are explicitly designated. + +To illustrate, here's a classic spreadsheet table: + +```text +field field + | | + | | + V V + + A | B | C | D <--- Row (Header) + ------------------------------------ + valA | valB | valC | valD <--- Row + ... +``` + +In JSON, a table would be: + +```json +[ + { "A": value, "B": value, ... }, + { "A": value, "B": value, ... }, + ... +] +``` + +### Data Representation + +In order to talk about the representation and processing of tabular data from text-based sources, it is useful to introduce the concepts of the _physical_ and the _logical_ representation of data. + +The _physical representation_ of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation can have some _type_ information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form). + +The _logical representation_ of data refers to the "ideal" representation of the data in terms of primitive types, data structures, and relations, all as defined by the specification. We could say that the specification is about the logical representation of data, as well as about ways in which to handle conversion of a physical representation to a logical one. + +In this document, we'll explicitly refer to either the _physical_ or _logical_ representation in places where it prevents ambiguity for those engaging with the specification, especially implementors. + +For example, `constraints` `SHOULD` be tested on the logical representation of data, whereas a property like `missingValues` applies to the physical representation of the data. diff --git a/content/docs/specifications/table-dialect.md b/content/docs/specifications/table-dialect.md index e77abd3c..3c3d2b24 100644 --- a/content/docs/specifications/table-dialect.md +++ b/content/docs/specifications/table-dialect.md @@ -23,7 +23,7 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S ## Introduction -Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various tabular data formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects. It is not expected that all the properties are supported by all the Data Package implementations. An implementation `MUST` choose the most suitable strategy for communicating to the users if some relevant feature is not supported. +Table Dialect defines set of properties that can be used by data producers and data consumers to ensure data interoperability in various [Tabular Data](../glossary/#tabular-data) formats such as CSV, JSON, or Excel. The main goal of this specification is to define a common language for defining tabular data dialects. It is not expected that all the properties are supported by all the Data Package implementations. An implementation `MUST` choose the most suitable strategy for communicating to the users if some relevant feature is not supported. Table Dialect is useful for programmes which might have to deal with multiple dialects of tabular files, but which can rely on being told out-of-band which dialect will be used in a given input stream. This reduces the need for heuristic inference of dialects, and simplifies the implementation of readers, which must juggle dialect inference, schema inference, unseekable input streams, character encoding issues, and the lazy reading of very large input streams. diff --git a/content/docs/specifications/table-schema.md b/content/docs/specifications/table-schema.md index c5bfc555..c671a5e6 100644 --- a/content/docs/specifications/table-schema.md +++ b/content/docs/specifications/table-schema.md @@ -25,49 +25,10 @@ The key words `MUST`, `MUST NOT`, `REQUIRED`, `SHALL`, `SHALL NOT`, `SHOULD`, `S Table Schema is a simple language- and implementation-agnostic way to declare a schema for tabular data. Table Schema is well suited for use cases around handling and validating tabular data in text formats such as CSV, but its utility extends well beyond this core usage, towards a range of applications where data benefits from a portable schema format. -## Concepts +This specification heavily relies on the following concepts: -### Tabular Data - -Tabular data consists of a set of rows. Each row has a set of fields (columns). We usually expect that each row has the same set of fields and thus we can talk about _the_ fields for the table as a whole. - -In case of tables in spreadsheets or CSV files we often interpret the first row as a header row, giving the names of the fields. By contrast, in other situations, e.g. tables in SQL databases, the field names are explicitly designated. - -To illustrate, here's a classic spreadsheet table: - -```text -field field - | | - | | - V V - - A | B | C | D <--- Row (Header) - ------------------------------------ - valA | valB | valC | valD <--- Row - ... -``` - -In JSON, a table would be: - -```json -[ - { "A": value, "B": value, ... }, - { "A": value, "B": value, ... }, - ... -] -``` - -### Data Representation - -In order to talk about the representation and processing of tabular data from text-based sources, it is useful to introduce the concepts of the _physical_ and the _logical_ representation of data. - -The _physical representation_ of data refers to the representation of data as text on disk, for example, in a CSV or JSON file. This representation can have some _type_ information (JSON, where the primitive types that JSON supports can be used) or not (CSV, where all data is represented in string form). - -The _logical representation_ of data refers to the "ideal" representation of the data in terms of primitive types, data structures, and relations, all as defined by the specification. We could say that the specification is about the logical representation of data, as well as about ways in which to handle conversion of a physical representation to a logical one. - -In this document, we'll explicitly refer to either the _physical_ or _logical_ representation in places where it prevents ambiguity for those engaging with the specification, especially implementors. - -For example, `constraints` `SHOULD` be tested on the logical representation of data, whereas a property like `missingValues` applies to the physical representation of the data. +- [Tabular Data](../glossary/#tabular-data) +- [Data Representation](../glossary/#data-representation) ## Descriptor