-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pattern - Table Schema: Relationship between Fields #859
Add pattern - Table Schema: Relationship between Fields #859
Conversation
Thanks @loco-philippe! I'm going to request a review from Working Group on Monday |
Thank you @roll ! Is there enough information or should I add something ? |
First, let me say that I recognize how much work and thought has gone into this proposal. I read through it carefully, and have several thoughts:
Personally, I don't believe that these warrant an addition to the official table schema at the moment (I might be persuaded by different examples or use cases). That said, I do see the value in facilitating validation of these types of relationships, and would suggest as an alternative adding such things to the Python framework as built-in validation checks. |
Thank you also for this relevant and constructive feedback !
I agree with this analysis.
Indeed, the examples chosen are too basic and unrealistic (I used another example in the
The french IRVE open-data (used for the simplified example shown above) is interesting to look at:
Regarding the subject as a whole, in my opinion there are several points to consider:
|
If you can, I would focus on v5 since it's been out for a while now. My suggestion would be to write a plugin to provide a new "relationship" check that takes three arguments: name of field 1, name of field 2, and type of relationship (i.e., derived, coupled or crossed). It might be helpful to look at the row-constraint check as an example. Note that you will most likely want to specify all of validate_start() (to verify that both fields are present in the resource), validate_row() (to build up your two lists of values as the data are streamed), and validate_end() (to do the appropriate check based on the specified relationship). Once your new check is available, you may then specify it in your data package via an inquiry. This makes it easy to validate the data, regardless of whether you're working with the CLI or the Python API. And as I said above, I think it would then be more appropriate to add this check to the list of built-in checks, rather than to add it to the official table schema. If you need help doing this, let me know (but I'm afraid I'm tied up with another project at the moment). And I defer to @roll if he wants to suggest an alternative strategy. |
Thank you @pschumm for these detailed answers ! I also won't have much time to work on this implementation in January but I think I can work on it in February. On the other hand, @roll, in terms of methodology, how do we create a data schema from a data model (if the notion of relationship does not exist) and then validate that a dataset respects the constraints expressed in the data model (it would be difficult to explain that Table Schema processes the data included in the Fields but does not concern itself with the structure of the Dataset)? (of course, from my point of view, the answer is to include the notion of relationship in a data schema and use a corresponding validation tool!) |
Here is a first summary of the discussions:
@roll next steps ?
|
This is an interesting proposal. @loco-philippe I am wondering if it can solve the use case for conditional requirements, brought up here #169 and tdwg/camtrap-dp#32? E.g. for data:
Express that |
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Thank you @peterdesmet !
This topic is interesting and I have several ideas for dealing with it. I'm going to work on it for a while longer, keeping in mind:
and at the same time: the way pandas treats this type of subject and the links with the proposition of 'relationships'. -> I welcome any ideas or comments on the subject ! |
FYI, the |
As noted above, I was able to spend some time in February understanding Table Schema and implementing an example solution for the I also suggest, as indicated in the exchange, to modify the pattern to keep only the third option (new Table descriptor) and to delete the two other options proposed for this descriptor (new Field descriptor and new constraint property). I will then also add a link to this example implementation (after taking your comments into account) @roll, how do you see the continuation of this pattern? |
I took the time to look at how table schema works and I think I can implement a solution for the conditional requirements question (before the end of February). |
Finally, the implementation is simpler than expected ! This second Notebook presents the solution (which works). Do you have more complete examples that I can test to validate the proposed solution? Note: how to take this proposal into account (addition to the Pattern document, addition of comments to open issues)? |
Yes, here's a more complex use case. It has two fields:
The (slightly adapted) rules are:
Not sure all of this should be expressible in Table Schema, but it is a real use case. 😄 |
Yes it works (exactly as you express the constraints) ! I added this example in the Jupyter notebook (example 2). |
The title of this PR includes "Add pattern...", however @roll linked this to #862 as closing that issue, which involves promoting uniqueness constraints to the spec. So are we talking about a pattern or a change to the spec here? Just want to make sure we're clear on this when discussing. While these various constraints are all important, I think we should try to articulate what makes a specific constraint appropriate for inclusion in the spec versus something that should be implemented using a checklist or inquiry. Otherwise, any potential constraint would be eligible for consideration to be included in the spec. And I don't think we want that if we want to avoid unnecessary complexity in the spec. I'm afraid I can't offer a specific proposal at the moment, but I'm keen to hear others' thoughts on this. |
I admit that I did not understand the dependency between Example: In this example
The relationship is validated for the previous table but not for the next one :
How to use |
I appreciate all the thought that's gone into this proposal – representing relationships between fields is something that, in general, has a lot of interesting potential directions. That said, I want to echo @pschumm's concern about this approach being considered for inclusion into the spec. In general, I think there's quite a bit of complexity here that might be better expressed in other ways (e.g. checklists or inquiries). I would prefer to let these ideas simmer a bit longer before promotion to spec-level. In addition to checklists & inquiries, here's other routes we might want to consider:
Tagged unions have the benefit of being a well-established, well-understood abstraction already implemented across a myriad of data parsing & validation libraries: for example, python's pydantic; and typescript's zod. Implementing this behavior as a tagged union field type would also have the advantage of making it a separate, self-contained proposal, decoupled from the more complex idea of "field relationships". Here's an example of how a tagged union field type might look like for @peterdesmet's example (using the proposed categorical syntax in #875: {
"fields": [
{
"name": "measurementType",
"type": "categorical",
"categories": ["cloudiness", "temperature", "wind force"]
},
{
"name": "measurementValue",
"type": "union",
"tag": "measurementType",
"match": {
"cloudiness": {
"type": "categorical",
"categories": ["clear", "mostly clear", "partly cloudy", "mostly cloudy", "cloudy", "unknown"]
},
"temperature": {
"type": "number",
"constraints": {
"min": 0,
"max": 20
}
},
"wind force": {
"type": "categorical",
"categories": [0, 1, 2, 3, 4, 5]
}
}
}
]
} Note that the field-level validation on this type would ensure that all the levels of the
For example, the aforementioned apples table would be better captured by a datapackage with two resources instead of one; one resource to capture the product information, and the other to capture the order line items: Resource "product-info":
Resource "orders":
Now, we can use the following {
"name": "order-database",
"resources": [
{
"name": "product-info",
"schema": {
"fields": [
{
"name": "product",
"type": "string"
},
{
"name": "plants",
"type": "categorical",
"categories": ["fruit", "vegetable", "flower", "herb"]
}
],
"primaryKey": "product"
}
},
{
"name": "orders",
"schema": {
"fields": [
{
"name": "product",
"type": "string"
},
{
"name": "quantity",
"type": "number",
"bareNumber": false
},
{
"name": "price",
"type": "number"
}
],
"foreignKeys": [
{
"fields": "product",
"reference": {
"resource": "product-info",
"fields": "product"
}
}
]
}
}
]
} With this datapackage, if the data consumer wishes to obtain a table with columns product, plants, quantity, price, it's simply a I realize this requires the data producer to normalize their resource tables, and the producer may not want to for whatever reason. And that's fine! In those cases, I think checklists or inquiries should be used, as mentioned by @pschumm, because they offer full flexibility. But I think adding validation support for relationships within unnormalized tables directly into the spec goes a bit too far – it introduces quite a bit of complexity into the spec for something that is already solved with good database design (or ameliorated with checklists / inquiries). |
Thank you @khusmann for these comments. Here are my remarks:
|
@khusmann Oh, I like the tagged union approach to conditional validation! Very easy to read. Would you mind submitting this as a separate issue? |
As indicated in a comment above, i update pattern document (only third option, link to the custom-check).
I didn't get any feedback on this point (I had in mind a process similar to the one I used for the pandas specifications). Other questions:
|
@loco-philippe As per the Working Group discussion, from Data Package (v2), patterns (recipes) become more autonomous, so we merge it as it is based on your vision, and you control the following evolution of the approach based on user feedback and implementation experience. |
Also published here - https://datapackage.org/recipes/relationship-between-fields/ |
Thanks roll, I think the proposed solution (recipes) is good. It allows you to make a proposal available while waiting for user feedback. Furthermore, the discussion in this PR also focused on conditional constraints (see request @peterdesmet). |
Thanks @loco-philippe, Thanks! It will be great! Also, it's a good point about recipe namespace for properties. I think we need to discuss it on the next Working Group meeting (cc @sapetti9) |
This pull request adds a pattern to express relationships between Fields and define associated validation tool.
This PR follows the issue #803 and is proposed by @roll.
Several option are proposed for the associated Descriptor.
Thank you very much for your review and comments.
Note: I also added a Table of contents