Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify physical/logical representation in Table Schema #864

Closed
roll opened this issue Jan 3, 2024 · 28 comments
Closed

Clarify physical/logical representation in Table Schema #864

roll opened this issue Jan 3, 2024 · 28 comments
Labels

Comments

@roll
Copy link
Member

roll commented Jan 3, 2024

Overview

This paragraph - https://datapackage.org/specifications/table-schema/#physical-and-logical-representation

I think physical term might be confusing (see #621) as it seems to be really meaning lexical or textual while logical sounds easy to understand in my opinion while it might still need to be brainstormed


Subissues:

@nichtich
Copy link
Contributor

nichtich commented Jan 4, 2024

The distinction between physical representation and logical representation is known under many names, e.g. lexical space vs. value space in XSD. Any name may be confusing without explanation. The current form is ok but it might be better to switch to other names. In this case I'd also change "representation" because "representation of data" is confusing as well. My current best suggestion is to use lexical value and logical value instead of physical representation and logical representation.

The current spec also uses "physical contents", this should be changed as well.

@roll
Copy link
Member Author

roll commented Jan 4, 2024

Thanks @nichtich, I agree

I think, currently, confusion might occur because physical implies being textual:

The physical representation of data refers to the representation of data as text on disk

Although, in general, I guess for majority of people physical regarding data storage means something different

@roll
Copy link
Member Author

roll commented Jan 5, 2024

BTW lexical is already actually used in the spec - http://localhost:8080/specifications/table-schema/#number

The lexical formatting follows that of decimal in XMLSchema

This sentence I think is very easy to understand so I guess lexical is a good choice

@khusmann
Copy link
Contributor

khusmann commented Jan 10, 2024

Hmm, I think a danger of replacing physical with lexical or textual here is that a given logical value can have many different lexical / textual representations... The textual representation of a date is an easy example. What we're wanting to refer to here is specifically the particular lexical/textual form being stored in the actual source data file.

So I actually prefer the current term physical here for that reason, provided we repeatedly emphasize that physical here implies textual as @roll noted.

Although reading through the standards again I'm also now realizing that's not quite the case because we're allowing type info to be associated with JSON source data... so it's actually not purely textual/lexical in a strict sense, which complicates things. Does this mean we throw an error or warn if a numeric field finds numeric values as strings (e.g. "0", "1", "2") in JSON source data? What if a string field schema gets numeric values? etc.

It'd simplify these cases if all "raw" data was just guaranteed to be parsed by the field schema as pure lexical/textual/string, and field props referencing physical values always used strings. If we're including / allowing type info other than string to come from the underlying source data representation, I may reconsider my position on #621, because it makes a case for props referencing physical values be allowed to be any JSON type.

In the spirit of brainstorming to get more ideas flowing:

Other possible terms for physical, lexical, textual value: raw value, source value, underlying value...

Other possible terms for logical value: typed value, parsed value, conceptual value, ... (I actually like the term conceptual value quite a bit; logical has always sounded like a boolean to me...)

@nichtich
Copy link
Contributor

Ok, the issue needs more. The whole section on Concepts needs to be rewritten to better clarify what is meant by "tabular data". Because we also have two levels of description:

  • the physical table, consisting where every cell is an untyped string
  • the logical table with typed cell values

There are "raw" tabular data formats (TSV/CSV) and there are tabular data formats with typed values (Excel, SQL, JSON, Parquet... limited to non-nested cells...). I'd say a Table Schema only refers to the former. A SQL Table can be converted to a raw table (just export as CSV) plus a Table Schema (inferred from the SQL Table definition) but SQL Tables are not directly described by Table Schema, nor is any JSON data as wrongly exeplified in the current specification.

@khusmann
Copy link
Contributor

There are "raw" tabular data formats (TSV/CSV) and there are tabular data formats with typed values (Excel, SQL, JSON, Parquet... limited to non-nested cells...). I'd say a Table Schema only refers to the former.

Agreed!

Perhaps it would clear some of the confusion if we renamed "Table Schema" to "Textual Table Schema" or "Delimited Table Schema" to reflect that the schema definition is specifically designed for textual data.

It would also pave the way for future frictionless table schema standards for other types of physical data, e.g. "JSON Table schema", "Excel Table Schema", "SQL Table Schema", which would be designed around the particularities of the types found in those formats.

In that case, we'd have:

The physical values of Textual Table Schema are all strings
The physical values of JSON Table Schemas are all JSON data types
The physical values of Excel Table Schemas are all Excel data types
etc.

As you say, it's much easier to think about conversions between formats, rather than type coercions if we try to use a textual table schema to parse an excel file, for example. The latter has a lot of potential complexity / ambiguity.

@roll
Copy link
Member Author

roll commented Jan 25, 2024

Although reading through the standards again I'm also now realizing that's not quite the case because we're allowing type info to be associated with JSON source data... so it's actually not purely textual/lexical in a strict sense, which complicates things. Does this mean we throw an error or warn if a numeric field finds numeric values as strings (e.g. "0", "1", "2") in JSON source data? What if a string field schema gets numeric values? etc.

In frictionless-py:

  • strings in numeric fields will be parsed (no error)
  • numbers in string fields won't be coerced (error)

@roll
Copy link
Member Author

roll commented Jan 25, 2024

The conversation is happening here so I'm adding @pwalsh's comment:

@nichtich @roll the original terminology seems pretty standard, eg

https://aws.amazon.com/compare/the-difference-between-logical-and-physical-data-model/

https://www.gooddata.com/blog/physical-vs-logical-data-model/

Whereas I have never come across using "lexical" to represent what is called "physical" in the current terminology.

I read #864 but honestly physical vs logical seems the most common terminology for describing this and I am not sure I see a good reason to change it.

@roll
Copy link
Member Author

roll commented Jan 25, 2024

First of all, probably I did not understand it correctly but I never thought about physical and logical in terms described here - https://www.gooddata.com/blog/physical-vs-logical-data-model/. I was thinking that in the case of Table Schema we're talking about basically a data source (like 1010101 on the disc or so-called text in csv) and data target (native programming types like in python and SQL).

So my understanding is that every tabular data resource has a physical data representation (in my understanding of this term). On current computers, it's always just a binary that can be decoded to text in the CSV case or just read "somehow" in case of a non-textual format e.g Parquet. For every format there is a corresponding reader that converts that physical representation to a logic representation (e.g. a pandas dataframe from a csv or parquet file).

I think here it's important to note that the Table Schema implementors never deal with any physical data representation (again based on my understanding of this term). Table Schema doesn't declare rules for csv parsers or parquet readers. In my opinion, Table Schema actually declared only post-processing rules for data that is already in its logical form (read by native readers).

Physical Data -> [ native reader ] -> Source Logical Data -> [ table schema processor ] -> Target Logical Data

For example, for this JSON cell 2000-01-01:

  • physical data -- binary
  • source logical data -- string
  • target logical data -- date (the point where Table Schema adds its value)

Another note, that from a implementor perspective, as said we only have access to Source Logical Data. It means that the only differentiable parameter for a data value is an source logical data type. For example, a Table Schema implementation can parse 2000-01-01 string for a date field because it knows an input logical type and a desired logical type. There is no access to underlying physical representation to have more information about this value. We only see that the input is string. For example, frictionless-py differentiates all the input values into two groups:

So for me it feels that Table Schema's level of abstraction is to provide rules for processing "not typed" string values (lexical representation) and that's basically the only thing this spec really can define while low-level reading can't be really covered. So my point is that physical is not a wrong term or whatever but that we really need to describe parsing lexical values e.g. for dates or missing values rather talking about physical.

cc @peterdesmet

@akariv
Copy link
Member

akariv commented Jan 25, 2024 via email

@roll
Copy link
Member Author

roll commented Feb 21, 2024

Hi @nichtich,

Are you interested in working on the updated version of #17 that incorporates comments from this issue?

After working closely with the specs last month and refreshing in my memory implementation details from frictionless-py I came to the conclusion that we actually don't have a very complex problem here.

For example, for a JSON data file like this:

[
  ["id", "date"],
  [1, "2012-01-01"]
]

We have:

  • [physical] UTF-8 encoded bytes representing the text above. We send these bytes to a JSON parser
  • [native: logical + lexical] As an output, we get a "data stream" of cells with native to the format types, BUT for Table Schema, some of them are still represented lexically and require additional processing. So for the example above, 1 is already a logical value, and 2012-01-01 is still a lexical value
  • [logical] After full Table Schema processing, we get a "data stream" of cells in fully logical form in Table Schema terms i.e. 1 and Date('2012-01-01')

I think this tiering is applicable to basically any input data source from csv to parquet or sql.

I guess we need to rename the section to something like Data Processing and mention this workflow. Although, we have 3 tiers I would personally focus the explanation on lexically represented cells because basically all Table Schema data type descriptions is about of how to parse lexically represented data e.g. date/times, objects, arrays, numbers (basically all the types).

@nichtich
Copy link
Contributor

nichtich commented Feb 21, 2024

I guess we need to rename the section to something like Data Processing and mention this workflow.

Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update. To quickly rephrase your words:

We have three levels of data processing:

  1. The native format of tabular data, e.g. JSON, CSV with some specific CSV Dialect, Open Document Spreadsheet, SQL...
  2. An abstract table of cells, each given as abstract value with data type from the underlying data format (e.g. plain strings for CSV, SQL types for SQL, JSON scalar types for JSON...)
  3. A logical table of cells, each having a typed value.

Table Schema specification defines how to map from level 2 to level 3.

@roll
Copy link
Member Author

roll commented Feb 21, 2024

Table Schema specification defines how to map from level 2 to level 3.

I think it's a good wording!

Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update.

Of course, no hurry at all. Let's just self-assign ourselfes to this issue if one of us decide start working (currently, I also have other issue to deal with first)

@akariv
Copy link
Member

akariv commented Feb 21, 2024 via email

@roll
Copy link
Member Author

roll commented Feb 21, 2024

It might also be converted to None, e.g. in case missingValues=[1].

Currently, it cannot because missingValues items in v1 have to be strings. So basically, I think we found the root cause and the real decision to make (related to #621 as well) what is our data model:

  • (1) physical/logical+lexical/logical -- Table Schema processes only strings
  • (2) physical/native/logical -- Table Schema processes all the native values

I guess (2) might be cleaner and easier to explain. In this case it will be something like this e.g. for datetime:

datetime: if on the native-data level a value is represented lexically than it MUST be in a form defined by XML Schema containing required date and time parts, followed by optional milliseconds and timezone parts

@roll roll removed the proposal label Feb 21, 2024
@roll
Copy link
Member Author

roll commented Feb 22, 2024

The spec is currently written / defined as (a) a 2-layer scheme. This is why missingValues is string[], and why trueValues/falseValues are string[]: everything should hit the TableSchema as a string physical value type, no matter its native origin type. I think this fits with TableSchema being billed as a description of textual data. (I realize the way the implementation currently handles JSON data is inconsistent with this; I'm referring to the broad intent of the spec in my reading here).

Note that it's not only about JSON; frictionless-py supports a dozen formats and in-memory data. It never worked like this at least in Python and JavaScript, the parsers get an input cell and forward it as it is if it's not a string and process if it's a string. So, currently, these implementations based on the (1) model from above

@roll
Copy link
Member Author

roll commented Feb 22, 2024

I think it will be simple and correct to say that regarding the data model, Table Schema is no more than an extension of a native data format (all of them). This concept is quite simple, for example, we have JSON and there is SUPERJSON that adds support for date/time, regexp, etc. It's achieved via an additional layer of serialization and deserialization for lexical values. If we think about Table Schema that way than it's still the (1) data model and missing/false/true values need to stay strings only. But this model doesn't imply that all the input data need to be strings or it's only for textual data sources, not at all; it just means that Table Schema comes into play only when additional serialization/deserialization is needed.

PS.
Thought a little bit more about it and I would say that on the Table Schema level, here is basically only two relevant concepts (while Data Resource and Table Dialect deal with physical representation):

  • data-format data model
  • Table Schema data model

@akariv
Copy link
Member

akariv commented Feb 22, 2024 via email

@roll
Copy link
Member Author

roll commented Feb 22, 2024

But what if 'n/a' is the empty value? Would we say missingValues="n/a" or ""n/a"" (as is the physical representation of the value)?

I'm getting to thinking that we actually need to isolate Table Schema from any physical data representation and let it operate only on the logical level. On the logical level it's n/a no matter how it's stored

@nichtich
Copy link
Contributor

nichtich commented Feb 22, 2024

It's 3 layers but we only have to think about two levels:

  1. cells with native-typed values (in case of CSV all values have native type string) aka data-format data model
  2. cells with logical values of types in the type system of Table Schema specification aka Table Schema data model

Furthermore, with 3 layers we also need a way to losslessly represent native values in the Table Schema.

We should aim to be able to represent common data types in the type system of Table Schema but we don't have to ensure lossless mappings of native type systems. We define a set of data types such as string, number types, boolean, n/a... and either types of native format X directly map to one of these Table Schema types or implementations must downgrade their values, e.g. by serialization to string type values.

P.S: Maybe this table of common native scalar data types helps to find out what is needed (also for #867).

@roll
Copy link
Member Author

roll commented Feb 22, 2024

I bootstrapped a new specification called "Terminology" - https://datapackage.org/specifications/glossary/ - I think it will be great to define everything we need there and then refer it across the specs. Lately I encountered that e.g. physical and logical data are also needed to define descriptor serialization. And e.g. Tabular Data defined in the Table Schema spec is really needed in other places as well. Aslo, we often mention implementations, data publishers/producers, consumers etc so it will be good to define it

@khusmann
Copy link
Contributor

It's 3 layers but we only have to think about two levels:

I agree. It's always technically (at least) 3 layer, in that the source format needs to be parsed to get at the value cells. What I'm trying to get at is how we define the type signature of our field parsers.

Right now the spec defines field / schema parsers as mappings from string -> FieldType.

If we promote this to NativeType -> FieldType, then we introduce a lot of validation ambiguity in the form of type coercion rules in field definitions.

We define a set of data types such as string, number types, boolean, n/a... and either types of native format X directly map to one of these Table Schema types or implementations must downgrade their values, e.g. by serialization to string type values.

I think I agree. As a textual format, the TableSchema should be defined (as it currently is) in terms of always be parsing serialized string values, no matter the source. In the special case where the native format directly maps, we can take a shortcut and directly import the data.

This way we keep missingValues: string[] (match missing value cells on serialized value strings)

and can avoid missingValues: (string | NativeType)[], (match missing value cells on serialized strings or their NativeType, and have to think about type precedence / coercion rules)

I'm getting to thinking that we actually need to isolate Table Schema from any physical data representation and let it operate only on the logical level. On the logical level it's n/a no matter how it's stored

This is another good approach worth exploring. The challenge will be to keep it backwards compatible...

@roll roll modified the milestones: v2-draft, v2-final Mar 11, 2024
@roll roll self-assigned this Apr 2, 2024
@roll roll added proposal and removed discussion labels Apr 3, 2024
@roll
Copy link
Member Author

roll commented Apr 3, 2024

Dear all,

Here is a pull request based on @akariv's data model - #49

I think this simple 3-layered model highly improves the quality of the building blocks on which Data Package stands and simplifies field types a lot conceptually. Initially, I was more in favour of thinking about Table Schmea as a string processor (serialize/deserialize) but having native data representation makes things way easier and more consistent internally.

An interesting fact is that after the separation of the native representation sections for field types, we can realize that field types basically don't have any description on a logical level—something to improve in the future, I guess, as currently, we mostly define only serialization/deserialization rules.

Please take a look!

@akariv
Copy link
Member

akariv commented Apr 3, 2024

Great work @roll! I reviewed the PR and left a few minor comments.

@roll roll modified the milestones: v2.0-final, v2.1 Jun 24, 2024
@roll roll removed their assignment Jun 26, 2024
@frictionlessdata frictionlessdata locked and limited conversation to collaborators Oct 21, 2024
@roll roll converted this issue into discussion #1043 Oct 21, 2024
@roll roll added feat and removed Table Schema labels Oct 22, 2024
@roll roll removed this from the v2.1 milestone Oct 22, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →