-
Notifications
You must be signed in to change notification settings - Fork 5
Conversation
Deploying datapackage with Cloudflare Pages
|
|
||
`format`: no options (other than the default). | ||
The list field can be customised with this additional property: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- why the native representation here couldn't also be a native JSON array?
- The strings we get after separating the delimited string representation are then to be treated according to the
itemType
property. However, If they are integers/numbers/bools/date/etc. there are other properties (e.g. format, decimalChar, groupChar, trueValues etc. etc.) and constraints that might be required to validate the values (e.g. a list of country codes).
I suggest we add here anitemOptions
property to contain these item-specific properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a long discussion between extending array
or adding list
so we ended up to this state currently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... but do you agree that having itemType
(e.g. date
) without being able to specify the format makes it a bit pointless (and very hard for implementors to handle)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akariv
I more consider list
a serialization feature rather than a full data description field like others.
For example, we have number
with a lot of options so if there is a CSV having numbers with group char a data publisher can notate it and then implementation can read it.
For the list
, my understanding is that is only one directional, the data needs to be created on the logical level e.g. reading from SQL or writing Python and then implementation will be able to serialize it and the receiver will be able to deserialize it. In this case, you don't need any other formats or options other than default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand (and again, maybe this discussion is out of scope for this PR...)
If I have a list
field in a data package with itemType: date
and delimiter: "|"
and in my CSV file I see the value "11/11/12|11/12/12"
- what is the implementation supposed to do with it? According to the spec, it's supposed to provide a list of logical date
s, but there isn't enough information here to allow it to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, sure, it's not related to this one, but I think it's very useful so I hope we don't create to much noise for others =)
So my understanding is that the value "11/11/12|11/12/12"
does not exist anywhere and won't ever be created. list
values need to be created exclusively by Data Pakage-aware implementations. So as it's serialized according to the Table Schema rules by data publisher it will be 2012-11-11|2012-12-11
and the data consumer will be able to read it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I started from this PR - frictionlessdata/datapackage#31 - that provides full field
definition to array items as in frictionless-py so, of course, I understand what you mean I think we just not sure yet about use cases and the balance between complexity and demand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I now see we had the same discussion 4 years ago :D frictionlessdata/datapackage#409 (comment). I didn't follow the decision made on February, but I still believe the arrayItem
approach was probably the best.... (even if it meant we made the name
mandatory only for field definitions in an object schema and not in an array schema).
Anyway let's leave it here, as that decision was already made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully, s now we're open governance in-place, we can adjust way quicker to user needs so nothing prevent us from extending itemType
later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akariv I also still think the arrayItem
approach was the best -- My use case is for multiple-select options using the categorical type: frictionlessdata/datapackage#31 (comment) . Also, we could potentially get typed-tuple support if we allow it to be defined as an array of field descriptors. (And both of these cases have parallels in nested types with implementations like Polars & parquet)
But as @roll says, we can extend itemType
later, hopefully :)
FWIW, I initially shared your skepticism, but on further reflection I came to the opposite conclusion. Here's why. I spend a lot of time generating data packages for distribution. Those packages, being Frictionless Tabular Data Packages, contain a Table Schema, and for various reasons the actual resources are written in CSV format. I provide a guarantee that these packages can be validated with For the most part, the pipelines that generate these packages are written in Python (using Pandas), and validation (against the schema) is one of the final steps. In that case, it is the Pandas data frame that is being validated, as opposed to the ultimate CSV file which is what the user will often validate before reading the data into their analytic package of choice. Initially, I (and folks working with me) were getting tripped up that the same schema that is valid while running the pipeline was not necessarily valid when validating the CSV file (e.g., #1599). However in retrospect, I realized that part of this was likely due to ignoring the difference between the physical (in the case of the CSV files) and native (in the case of the Pandas data frame) representations.
Actually, I think the additional complexity in this case is explicit; specifically, we are trading misleadingly simpler implicit behavior for apparently more complex explicit behavior. And per the Zen of the Python, the latter is generally preferable. What I do think is that we should do everything we can to make the resulting solution as simple as possible for the greatest number of users and common use cases. I think @roll's work on this PR is excellent, but I think that we can do a bit more to simplify things for large groups of users. For example, we could add a section near the top named something like "Start Here: Simple Cases" that says things like: "If you are working exclusively with CSV files, then the physical and native formats will always be equivalent and will be equal to the string value represented in the (decoded) version of the file." Or something like that (it's been a long day, so this suggestion could almost certainly benefit from some wordsmithing). |
Really appreciate your perspective here, @pschumm!
Exactly. TableSchema was designed with definitions of types derived from the features found in delimited textual data. So when a TableSchema is used to validate a given CSV, there are no surprises, because TableSchema field types are defined by way of physical textual features. The surprises start popping up when we start doing lossy conversions on the data to other formats. When we convert CSV+TableSchema to Pandas data frame (binary data + PandasTypes), we lose information in the conversion (for example, distinct Then when we then use a TableSchema to validate that Pandas dataframe, we're doing yet another lossy conversion: we have features and values in the binary data+PandasTypes that may not round trip in the same way back to TableSchema field definitions. Our original TableSchema is not guaranteed to validate. Using a TableSchema to validate SQL databases (binary data+SqlTypes) requires similarly lossy type conversions. If we use TableSchema as the universal way to validate SQL databases & other typed formats, I think it will lead to similar situations as you experienced with validating the same data as a CSV vs Pandas dataframe. where validation & other behaviors diverge in unexpected ways across different formats. By having different schema types for different typed data formats (e.g. SQLTableSchema, XLSXTableSchema, etc.), we avoid these implicit type conversions in validation:
In other words, when we convert a CSV+TableSchema to an SQL db, I think it should result in a SQL db along with a SQLTableSchema that it can be immediately validated against… not a bare SQL db that may or may not validate with the original TableSchema. That said, I recognize I'm the minority (only?) voice taking this position… I fully expect/accept being overruled on this! I don't mean to kill the momentum here, just want to summarize my concerns moving forward.
I really like this idea! I think it also illustrates how we're thinking on very similar lines here – We get exactly this when we have explicit schemas for different formats with logical field types built on the format's native types. Somebody using SQL can go straight to the SQLTableSchema page to describe their database with frictionless field type definitions that map to their familiar native SQL types, in the same way someone can go to the TableSchema page to describe their delimited text file with type definitions derived from the physical features of delimited text. If I'm creating an XLSXTableSchema, I can focus on defining my logical types from the native types that exist in XLSX, and I don't need to think about the peculiarities of SQL or delimited textual formats. Then conversions between formats become well-defined and easy to explicitly represent as standard conversion rules between these schema types… taking us one step closer to a "pandoc" for data & metadata, which I know is a dream we both share :) |
I do feel that with
The problem I see with this idea is that the intention of the spec becomes unclear wherever the logical type cannot be explicitly stated (the proposed categorical field, missingValues, trueValues, and falseValues come to mind), as then either the native type or its string representation could be considered valid. For example for categorical (#48), this could mean |
@khusmann
The thing here is that the schema is completely abstracted from underlying file protocols and formats dealing only with input tabular data stream. So basically schema doesn't know whether this data comes from SQL or CSV as anyway it's typed based on computational environment. In case of frictionless-py it is Python types, but it can be polars types etc. Probably it's not perfectly puristic approach but it enables to created pluggable implementations that can support multiple file protocols and formats |
I agree it does to some extent but I think the real-world case scenario we got in 7 years was frictionlessdata/datapackage#621 so I would say that ok |
I haven't had my first coffee of the morning yet so this might not make any sense, but while this exercise may feel a bit like peeling an onion, I do feel it's worthwhile. I feel like we're trying to do two things simultaneously: (1) moving the spec forward to v2 in a timely way by incorporating meaningful additions and improvements while (hopefully) not doing any harm; and (2) rethinking some of our fundamental assumptions and objectives. The key is to balance these two (easy to say, harder to do). It may sound hokey, but I think our North Star should be the very word "Frictionless" itself. We may (and should) worry about the distinctions between physical, native and logical values, as well as about edge cases, but for the average user our measure of success will be whether Frictionless makes their life better, i.e., whether it makes data easier to use and share (including making data FAIR), whether it improves data quality (e.g., through validation and avoiding error-prone manipulations), and whether it helps avoid common pitfalls. At least that's what Frictionless means to me. @ezwelty raises a good point above; one could argue that we just need to add a few more |
Being pragmatic was always one of the main principles of Frictionless Data. For me this PR is quite just writing down what is de-facto implemented basically in any existent Data Package implementation. My guess is that, even though we're changing some definitions here it won't require any change for any existing implementation or user |
Agreed! I think this represents a big clarification of direction and scope for frictionless, analogous to something like choosing duck-typing vs static-typing for a language. At the end of the day, it's a value judgment – tradeoff between expressiveness and predictability. For validation and large-scale interoperability, I lean towards predictability, because it allows you to make certain guarantees & invariants that users and implementations can count on (e.g. this datapackage will validate in the same way no matter which implementation we validate it with). But it sounds like the group is wanting to prioritize expressiveness because it'll reduce barriers to adoption, and I can appreciate that too. I think the apparent simplicity of duck-typing is partly why Python got so popular… even if the simplicity came at a cost of predictability / robustness, which is why static typing was eventually retrofitted into the language much later. (So maybe it's something we can revisit down the road, as well)
Right. Up until now, I've treated the handling of formats besides delimited text as somewhat undefined behavior, and this PR makes it more explicit, which is good. So if the group is in agreement that duck-typing is what we want for the spec, then I do think this PR is the way to go! I've upgraded my vote from 👎 to 👀. That said, like @ezwelty points out (and I have also pointed out in previous threads) there's the remaining issue of representing native types when they are not representable in JSON. To summarize our options, we have:
@ezwelty – by adding more "missingValues": [
-99,
"NA",
{
"value": "1900-01-01",
"type": "date"
}
] But as @roll points out, this is vanishingly rare in practice -- if we have a native date type then it's most likely going to be null-able anyway, so the above definition is not something you'd see. Similarly, when there is no native type for So I think maybe the key for me in the present PR is to make this a guarantee -- as long as we're ensuring that all places where we're wanting to specify native type values are ALWAYS variations of string or number, and we're not concerned about edge cases (like loss of precision when working with native floating point, etc.), we're golden right? As @roll says, it gives us a good pragmatic representation of how native values are being used by the spec in present implementations. |
I think this one is something to discuss on the CC. TBH, I don't see a big (or even any) difference for users/implementations if we just switch from:
to
The bottom line in my opinion is that anyway we rely on data formats for data reading and Table Schema post-process data types that are natively not supported and stored as strings. Using |
BTW @khusmann what do you think about this one frictionlessdata/datapackage#709 |
Sounds good!
For me, the key difference is clarified by whether we say the following two definitions are identical or not:
If we say "native values", I would expect the two definitions to exhibit different behavior on an integer SQL column: In the first definition, numeric -99 will be used as missing value, in the second there will be no missing values defined because an integer column does not have string values. If we say "lexical values", I would expect these definitions to exhibit identical behavior on an integer SQL column: They both will define string -99 as a missing value, because everything is converted to a string/lexical representation before missingValue comparisons take place.
I 100% agree with what you said in frictionlessdata/datapackage#33 (comment):
(My instinct would be to add a "decimal" type with properties for leftDigits, rightDigits, etc. and use the existing "number" for float types) |
Yes,
In this case i.e. v1, we just don't allow non-strings in missing values. So, basically, it's just a subset of the first example |
@khusmann I guess it makes me a wee bit uneasy to easily generate a Table Schema that behaves differently depending on the data format, since my impression was that Table Schema aimed to be somewhat portable and format agnostic (the format-specific stuff being the concern of the parent Resource). Say we want an {
"type": "integer",
"constraints": {"enum": [1, 2, 3]},
"missingValues": [""]
}
{
"type": "categorical",
"categories": [1, 2, 3],
"missingValues": [""]
} If as proposed the Similar issues arise if we allow numeric missing values (e.g. @roll This feels weird to me, but maybe diverging behavior is already the reality in common use cases of v1 of Table Schema? |
I would say that -99 is needed for the integer field and "-99" for the string field. It's not about formats.
The problem here is that |
What about boolean fields in SQLite vs CSV? SQLite doesn't have a native bool, and by convention uses integers instead: {
"name": "field1",
"type": "boolean",
"trueValues": [1],
"falseValues": [0],
"missingValues": [-99]
} Compared to the equivalent bool field coming from CSV: {
"name": "field1",
"type": "boolean",
"trueValues": ["1"],
"falseValues": ["0"],
"missingValues": ["-99"]
} Side note -- by default {
"name": "field1",
"type": "boolean"
} If we want this def to work with native values, default ...so I agree with @ezwelty's unease. |
There is a reason why several scripting languages automatically convert between number and string depending on context... |
Indeed. I believe there are three primary cases:
The only constraints that are relevant here, I think, are @khusmann, this really isn't the place for this (apologies), but I don't want to forget and I know you'll read this. Your current proposed Once this issue is resolved, we can add (or not) references to native specification to your addition. |
Well put @nichtich, but I just want to make sure I'm following what you're implying here. Are you suggesting that we formally adopt this as part of the Frictionless spec in order to address the |
Where I think we want to avoid confusion here is in the distinction between the following two cases:
At the moment, (1) is potentially dependent on the source of the data (i.e., the native type). And since I would propose that (1) is more a statement about a specific dataset and/or source, while (2) is, at least potentially, a more general statement. Think something like Where I'm going with this is that, as least for me, having to worry about the distinction between |
I don't think the confusion here is related to the nature of So for the example of a boolean field in SQLite vs CSV: 1) both formats do not have a native boolean type and 2) logical booleans are equally valid to derive from native string or native numeric. Therefore whether we need to write To sum up our possible ways forward, I think there are three options that we've covered (to help frame our discussion in the upcoming community call):
As @ezwelty said earlier, I still think we're getting a bit ahead of ourselves with native values. But if we're really wanting to relax our physical value types in v2, I prefer (3) over (2) because, like (1), it decouples the source native format from the way source values are expressed in the schema. (@ezwelty @pschumm I'll address your comments re: categorical in the categorical PR in a bit!) |
CLOSED in favour of frictionlessdata/datapackage#71 (probably will just go back to the issues for more discussion) |
physical/logical
representation in Table Schema datapackage#864true/falseValues
? datapackage#621It might be easier to review using a preview deployment: