-
Notifications
You must be signed in to change notification settings - Fork 118
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify physical/logical
representation in Table Schema
#864
Comments
The distinction between physical representation and logical representation is known under many names, e.g. lexical space vs. value space in XSD. Any name may be confusing without explanation. The current form is ok but it might be better to switch to other names. In this case I'd also change "representation" because "representation of data" is confusing as well. My current best suggestion is to use lexical value and logical value instead of physical representation and logical representation. The current spec also uses "physical contents", this should be changed as well. |
Thanks @nichtich, I agree I think, currently, confusion might occur because
Although, in general, I guess for majority of people |
BTW
This sentence I think is very easy to understand so I guess |
Hmm, I think a danger of replacing So I actually prefer the current term Although reading through the standards again I'm also now realizing that's not quite the case because we're allowing type info to be associated with JSON source data... so it's actually not purely textual/lexical in a strict sense, which complicates things. Does this mean we throw an error or warn if a numeric field finds numeric values as strings (e.g. "0", "1", "2") in JSON source data? What if a string field schema gets numeric values? etc. It'd simplify these cases if all "raw" data was just guaranteed to be parsed by the field schema as pure lexical/textual/string, and field props referencing In the spirit of brainstorming to get more ideas flowing: Other possible terms for Other possible terms for |
Ok, the issue needs more. The whole section on Concepts needs to be rewritten to better clarify what is meant by "tabular data". Because we also have two levels of description:
There are "raw" tabular data formats (TSV/CSV) and there are tabular data formats with typed values (Excel, SQL, JSON, Parquet... limited to non-nested cells...). I'd say a Table Schema only refers to the former. A SQL Table can be converted to a raw table (just export as CSV) plus a Table Schema (inferred from the SQL Table definition) but SQL Tables are not directly described by Table Schema, nor is any JSON data as wrongly exeplified in the current specification. |
Agreed! Perhaps it would clear some of the confusion if we renamed "Table Schema" to "Textual Table Schema" or "Delimited Table Schema" to reflect that the schema definition is specifically designed for textual data. It would also pave the way for future frictionless table schema standards for other types of physical data, e.g. "JSON Table schema", "Excel Table Schema", "SQL Table Schema", which would be designed around the particularities of the types found in those formats. In that case, we'd have: The physical values of Textual Table Schema are all strings As you say, it's much easier to think about conversions between formats, rather than type coercions if we try to use a textual table schema to parse an excel file, for example. The latter has a lot of potential complexity / ambiguity. |
In
|
The conversation is happening here so I'm adding @pwalsh's comment:
|
First of all, probably I did not understand it correctly but I never thought about So my understanding is that every tabular data resource has a physical data representation (in my understanding of this term). On current computers, it's always just a binary that can be decoded to text in the CSV case or just read "somehow" in case of a non-textual format e.g Parquet. For every format there is a corresponding reader that converts that physical representation to a logic representation (e.g. a pandas dataframe from a csv or parquet file). I think here it's important to note that the Table Schema implementors never deal with any physical data representation (again based on my understanding of this term). Table Schema doesn't declare rules for csv parsers or parquet readers. In my opinion, Table Schema actually declared only post-processing rules for data that is already in its logical form (read by native readers). Physical Data -> [ native reader ] -> Source Logical Data -> [ table schema processor ] -> Target Logical Data For example, for this JSON cell
Another note, that from a implementor perspective, as said we only have access to Source Logical Data. It means that the only differentiable parameter for a data value is an source logical data type. For example, a Table Schema implementation can parse
So for me it feels that Table Schema's level of abstraction is to provide rules for processing "not typed" string values (lexical representation) and that's basically the only thing this spec really can define while low-level reading can't be really covered. So my point is that cc @peterdesmet |
I tend to agree that we actually have 3 states of data in the spec, as you
write.
A few notes, though:
1 - you write "Table Schema doesn't declare rules for csv parsers".
However, the data package spec does have a csv dialect section and a
character encoding setting, which are precisely rules for csv parsers that
interact with the physical layer.
2 - 'source logical data' and 'target logical data' are not great names imo
as they impose some sort of order between the layers (source and target)
which does not apply in many cases (e.g. when writing a data package).
So, I would suggest to follow your lead, and use
- "physical layer" for the lower level binary data,
- "native format layer" for the data that the native, file format specific
drivers work with,
- and "logical layer" for the table-schema typed data
…On Thu, Jan 25, 2024 at 4:24 PM roll ***@***.***> wrote:
First of all, probably I did not understand it correctly but I never
thought about physical and logical in terms described here -
https://www.gooddata.com/blog/physical-vs-logical-data-model/. I was
thinking that in the case of Table Schema we're talking about basically a
data source (like 1010101 on the disc or so-called text in csv) and data
target (native programming types like in python and SQL).
So my understanding is that every tabular data resource has a physical
data representation (in my understanding of this term). On current
computers, it's always just a binary that can be decoded to text in the CSV
case or just read "somehow" in case of a non-textual format e.g Parquet.
For every format there is a corresponding reader that converts that
physical representation to a logic representation (e.g. a pandas dataframe
from a csv or parquet file).
I think here it's important to note that the Table Schema implementors
never deal with any physical data representation (again based on my
understanding of this term). Table Schema doesn't declare rules for csv
parsers or parquet readers. In my opinion, Table Schema actually declared
only post-processing rules for data that is already in its logical form
(read by native readers).
Physical Data -> [ native reader ] -> Source Logical Data -> [ table
schema processor ] -> Target Logical Data
For example, for this JSON cell 2000-01-01:
- physical data -- binary
- source logical data -- string
- target logical data -- date (the point where Table Schema adds its
value)
Another note, that from a implementor perspective, as said we only have
access to Source Logical Data. It means that the only differentiable
parameter for a data value is an input logical data type. For example, a
Table Schema implementation can parse 2000-01-01 string for a date field
because it knows an input logical type and a desired logical type. There is
no access to underlying physical representation to have more information
about this value. We only see that the input is string. For example,
frictionless-py differentiates all the input values into two groups:
- string -> process
- others -> don't process
So for me it feels that Table Schema's level of abstraction is to provide
rules for processing "not typed" string values (lexical representation) and
that's basically the only thing this spec really can define while low-level
reading can't be really covered
cc @peterdesmet <https://github.com/peterdesmet>
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5NUKZH4VDEZHTGRV3TYQJTJ3AVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGMYTMOJTGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi @nichtich, Are you interested in working on the updated version of #17 that incorporates comments from this issue? After working closely with the specs last month and refreshing in my memory implementation details from For example, for a JSON data file like this: [
["id", "date"],
[1, "2012-01-01"]
] We have:
I think this tiering is applicable to basically any input data source from I guess we need to rename the section to something like |
Yes. I'd like to provide an update but I don't know when so it's also ok for me if you come up with an update. To quickly rephrase your words: We have three levels of data processing:
Table Schema specification defines how to map from level 2 to level 3. |
I think it's a good wording!
Of course, no hurry at all. Let's just self-assign ourselfes to this issue if one of us decide start working (currently, I also have other issue to deal with first) |
I agree but I have an observation here -
In @roll's example, it's mentioned that '1' is already a logical value.
I would claim that it's still a native value - a JSON number with the value
of 1. It might represent a table schema value of type integer, number,
year, or even boolean (with trueValues=[1]).
It might also be converted to None, e.g. in case missingValues=[1].
Therefore I would say that the distinction between native and logical is
correct and that all values start out as native values and get processed,
casted and validated into logical values - even if they come from a more
developed file format such as JSON. Then, in each case we require a value
to be present in the descriptor (e.g. in a max constraints, booleans
trueValues of missingValues) we need to specify whether a native value or a
logical value is expected there.
…On Wed, Feb 21, 2024 at 1:46 PM roll ***@***.***> wrote:
Table Schema specification defines how to map from level 2 to level 3.
I think it's a good wording!
Yes. I'd like to provide an update but I don't know when so it's also ok
for me if you come up with an update.
Of course, no hurry at all. Let's just self-assign ourselfes to this issue
if one of us decide start working (currently, I also have other issue to
deal with first)
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5POT5AGUWG4WD4M5BDYUXNCFAVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJWGQ3TOMBUGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Currently, it cannot because
I guess (2) might be cleaner and easier to explain. In this case it will be something like this e.g. for
|
Note that it's not only about JSON; |
I think it will be simple and correct to say that regarding the data model, Table Schema is no more than an extension of a native data format (all of them). This concept is quite simple, for example, we have JSON and there is SUPERJSON that adds support for PS.
|
That might be confusing though.
E.g. a JSON file with -1 denoting an empty value, we would say
missingValues="-1". That's reasonable.
But what if 'n/a' is the empty value? Would we say missingValues="n/a" or
"\"n/a\"" (as is the physical representation of the value)?
What if there is no natural string representation of the value (if the file
format is not text based)?
…On Thu, Feb 22, 2024, 10:49 roll ***@***.***> wrote:
I think it will be simple and correct to say that regarding the data
model, Table Schema is no more than an extension of a native data format
(all of them). This concept is quite simple, for example, we have JSON and
there is SUPERJSON <https://github.com/blitz-js/superjson> that adds
support for date/time, regexp, etc. It's achieved via an additional layer
of serialization and deserialization for lexical values. If we think about
Table Schema that way than it's still the (1) data model and
missing/false/true values need to stay strings only
—
Reply to this email directly, view it on GitHub
<#864 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAY5MOI4IZ2AUDBMQCE7TYU4BB5AVCNFSM6AAAAABBLUSTP6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYHE3TCNBQG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm getting to thinking that we actually need to isolate Table Schema from any physical data representation and let it operate only on the logical level. On the logical level it's |
It's 3 layers but we only have to think about two levels:
We should aim to be able to represent common data types in the type system of Table Schema but we don't have to ensure lossless mappings of native type systems. We define a set of data types such as string, number types, boolean, n/a... and either types of native format X directly map to one of these Table Schema types or implementations must downgrade their values, e.g. by serialization to string type values. P.S: Maybe this table of common native scalar data types helps to find out what is needed (also for #867). |
I bootstrapped a new specification called "Terminology" - https://datapackage.org/specifications/glossary/ - I think it will be great to define everything we need there and then refer it across the specs. Lately I encountered that e.g. |
I agree. It's always technically (at least) 3 layer, in that the source format needs to be parsed to get at the value cells. What I'm trying to get at is how we define the type signature of our field parsers. Right now the spec defines field / schema parsers as mappings from If we promote this to
I think I agree. As a textual format, the TableSchema should be defined (as it currently is) in terms of always be parsing serialized This way we keep and can avoid
This is another good approach worth exploring. The challenge will be to keep it backwards compatible... |
Dear all, Here is a pull request based on @akariv's data model - #49 I think this simple 3-layered model highly improves the quality of the building blocks on which Data Package stands and simplifies field types a lot conceptually. Initially, I was more in favour of thinking about Table Schmea as a string processor (serialize/deserialize) but having An interesting fact is that after the separation of the native representation sections for field types, we can realize that field types basically don't have any description on a logical level—something to improve in the future, I guess, as currently, we mostly define only serialization/deserialization rules. Please take a look! |
Great work @roll! I reviewed the PR and left a few minor comments. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Overview
This paragraph - https://datapackage.org/specifications/table-schema/#physical-and-logical-representation
I think
physical
term might be confusing (see #621) as it seems to be really meaninglexical
ortextual
whilelogical
sounds easy to understand in my opinion while it might still need to be brainstormedSubissues:
true/falseValues
? #621The text was updated successfully, but these errors were encountered: