Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a codeList property to the field descriptor #888

Closed
djvanderlaan opened this issue Mar 2, 2024 · 10 comments
Closed

Introduce a codeList property to the field descriptor #888

djvanderlaan opened this issue Mar 2, 2024 · 10 comments

Comments

@djvanderlaan
Copy link

djvanderlaan commented Mar 2, 2024

We work a lot with surveydata and administrative data. In both cases files often contain fields where the values in the field should come from a limited list of possible values. These values also have a specific meaning. Some examples:

  • Questions from a questionary coded as 1-5 and 9, where 1="Strongly disagree", 2="Disagree", ... 5="Strongly agree" and 9 = "Did not answer"
  • Administrative data where one of the columns is the economic classification of a company (e.g. using NACE).

Properties of these codes:

  • In the examples the codes are either integer of string values.
  • These values should be from a limited list of valid codes.
  • Some values indicate missing values.
  • The values usually have labels and/or descriptions.
  • Sometimes the codes contain a hierarchy. This is for example often the case with the NACE codes.
  • The lists can be small (2-3 codes) or very large (70,000 in the case of ICD10).

We are aware of the suggestion in issue #875 for supporting categories which is the same issue/problem. However, there are a few 'wishes' that are not covered by the suggestion in that issue and we believe the suggestion below is also easier to implement.

What we would like/need:

  • Possibility to indicate that a given field should use values/codes from a given list.
  • It should be possible to store this list in the datapackage meta data (datapackage.json) itself or have the codes in a file as large lists of codes make the meta data too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally.
  • It should preferably be possible to define hierarchies in the codes.

What we suggest:

  • Add a property codeList to the FieldDescriptor. This MUST be string with the name of a DataResource in the DataPackage (if there is a syntax for referencing to a DataResource in an external DataPackage, this would also be valid).

This has a number of advantages:

  1. DataResources allow for inline data in the data property; for files in the DataPackage using the path property with a relative path and for external data using the path property with a URL.
  2. Furthermore, we can reuse properties like name, descriptor source and license and schema to describe the code list.
  3. We can use all of the properties and tooling that are available already for DataResources. This makes implementation less work. Indicating a codelist for a field can already be useful without specific tooling for codelists. The user can see that the field uses a codelist and can manually read the corresponding data resource using currently existing tooling.

Code List Resource
We don't yet have a concrete suggestion as to what should be in the dataset containing the code list and what format this dataset should have. We currently have an implementation that assumes that the first column in the dataset contains the codes and the second the labels of the codes. This is, however, minimal functionality. Some thoughts:

  • We could add fields to the DataResource that describe what fields are the code, labels etc.
  • It would be nice if it was not required that the dataresource containing the codelist is a TabularDataResource. That would also allow the data resource to point to, for example, a SDMX codelist (common for many official statistical datasets) or ClaML (used for some medical classifications). The type can be indicated using the format and/or mediatype properties of the data resource. However, the default (and only supported format) would be regular Tabular Data Resources.
  • We can take inspiration from how codelists are handled in SDMX. For example, SDMX2 allows for codes, labels, descriptions, parents (for simple hierarchies) and multi lingual labels etc. The functionality in SDMX3 is much more extensive (codes can have periods of validity; codelists can be part of multiple hierarchies; codelists can be selections or extensions of existing codelists). If we follow the basic SDMX codelist, then a basic code list would have the columns code (or id) and name ; with optional columns description, locale and parent (indicating missing values seems to be missing; I can check with SDMX experts how this is handled; probably using custom annotations).

Example
Possible example with both a codelist in a file and inline data:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
            "codeList": "codelist-regions"
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level"
          }
        ]
      }
    }, {
      "name": "codelist-regions",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "string"
          }, { 
            "name": "name",
            "type": "string"
          }, {
            "name": "parent",
            "type": "string"
          } 
        ]
      },
      "path": "codelist-regions.csv"
    }, {
      "name": "codelist-edu_level",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "integer"
          }, { 
            "name": "name",
            "type": "string"
          }
        ]
      },
      "data": [
        {"code": 1, "name": "Low education"},
        {"code": 2, "name": "Medium education"},
        {"code": 3, "name": "High education"}
      ]
    }
  ]
}

@fomcl

@fomcl
Copy link

fomcl commented Mar 2, 2024

I like this proposal. I have, however, two suggestions. Both are inspired by Swagger's $ref notation

External codeLists

A reference to an external codeList may be made using JSON Reference syntax (RFC3986) and JSON Pointer syntax (RFC6901)

Example:

...
"path": "http://path/to/your/frictionless.json#/codelist-edu_level"
...

Non-basic codeLists

Jan's example assumes a "code" and a "label" in the first and second column in the file, respectively. This will work in many cases and it is easy and clean to specify this. However, we become more flexible by allowing an optional codeListSpec attribute that has codeField and labelField keys.

Tabular data

The codeField and labelField attributes take either a column name (string) or a column number (int).

Example:

...
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level",
            "codeListSpec":  {"codeField":  "MyCode", 
                                           "labelField": "TheLabel"}
          }
...

Non-tabular data (json, xml)

The codeField and labelField keys also use JSON Pointer notation to specify the codes and labels in a hierarchical json or xml data source. This source may typically be the response of an HTTP request to a REST API.

Example:

"path": "http://some-server.com/api/v1/data/#common/prefix",
...
...
          }, { 
            "name": "edu_level",
            "type": "integer",
            "codeList": "codelist-edu_level",
            "codeListSpec":  {"codeField":  "/bar/code", 
                                          "labelField": "/foo/label"}
          }

...

Complex data

What about other data sources, like the more complicated ICD-10? In cases like this, a (custom) mimetype/mediatype (e.g. application/vnd.sdmx.generic+xml;version=2.1)
might be the only thing we can do in terms of the specification. This indicates that it's up to the client tool how to interpret that data source.

@khusmann
Copy link
Contributor

khusmann commented Mar 5, 2024

Thanks for jumping into the conversation here! Coded categorical data is something I'm very interested in for frictionless as well.

I think the functionality you're after here might be more generally modeled / solved by indicating primaryKey / foreignKey relationships between tables. This would have the added benefit of having these relationships be exportable to formats like SQL.

For example, the example you include could be modeled like this:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
          }, { 
            "name": "edu_level",
            "type": "integer",
          }
        ],
        "primaryKey": ["id"],
        "foreignKeys": [
          {
            "fields": "place_of_residence",
            "reference": {
              "resource": "codelist-regions",
              "fields": "code"
            }
          }, {
            "fields": "edu_level",
            "reference": {
              "resource": "codelist-edu_level",
              "fields": "code"
            }
          }
        ]
      }
    }, {
      "name": "codelist-regions",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "string"
          }, { 
            "name": "name",
            "type": "string"
          }, {
            "name": "parent",
            "type": "string"
          } 
        ],
        "primaryKey": ["code"],
        "foreignKeys": [
            "fields": "parent",
            "reference": {
              "resource": "",
              "fields": "code"
            }
        ]
      },
      "path": "codelist-regions.csv",
    }, {
      "name": "codelist-edu_level",
      "schema": { 
        "fields": [
          { 
            "name": "code",
            "type": "integer"
          }, { 
            "name": "name",
            "type": "string"
          }
        ],
       "primaryKey": ["code"]
      },
      "data": [
        {"code": 1, "name": "Low education"},
        {"code": 2, "name": "Medium education"},
        {"code": 3, "name": "High education"}
      ]
    }
  ]
}

I think this solves almost all the requirements you listed above:

✔️ Possibility to indicate that a given field should use values/codes from a given list.
❌ It should be possible to store this list in the datapackage meta data (datapackage.json) itself
✔️ Have the codes in a file as large lists of codes make the metadata too bloated and this makes maintenance also more difficult. This file could be part of the datapackage or could be hosted externally.
✔️ It should preferably be possible to define hierarchies in the codes. (Notice the self reference in the region codes!)

It additionally has the benefits of allowing for more flexible naming of fields, being SQL compatible, as mentioned above, and leveraging all of the existing validation built around primary & foreign keys.

For the item that it doesn't address (The simple case of using values / codes inline when they are not long & do not have other complex relationships), I think the current categorical proposal #875 is still a good solution:

{ 
  "name": "highest_education",
  "resources": [
    { 
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": { 
        "fields": [
          { 
            "name": "id",
            "type": "integer"
          }, {
            "name": "place_of_residence",
            "type": "string",
          }, { 
            "name": "edu_level",
            "type": "categorical",
            "categories": [
              {"value": 1, "label": "Low education"},
              {"value": 2, "label": "Medium education"},
              {"value": 3, "label": "High education"}
            ]
          }
        ],
      }
    }
  ]
}

The advantage here of using a specific categorical type, is for its usability in a wide range of statistical software, e.g. R, SAS, SPSS, Stata, etc. (Whereas more complex & interrelated code lists are better modeled in the context of a database, with the primaryKey / foreignKey references)

Are there any situations I'm missing here that a combination of categorical field types and resource references would not solve?

@khusmann
Copy link
Contributor

khusmann commented Mar 5, 2024

(Tagging @pschumm and @peterdesmet for their thoughts as well)

@djvanderlaan
Copy link
Author

Thanks for the comments. We have also thought about using the foreignKeys property. The main issue I have with using the foreignKeys property, is that it is the property of the field and not the data set that a field should have codes from a code list/the field is a categorical variable. Therefore, I feel that this should be a property of the field descriptor.

Also in the current implementation/description it not really described what the foreign key should be used for. As far as I understand it just indicates another dataset that shares a key with the current data set. It could be a code list, but also something else completely. When working with code lists from R or python you will probably want to convert to factor, but then the code needs to now that it is a code list. This could be solved by adding something like a relation property indicating the relation between the data sets.

Using both foreignKey and categories for basically the same thing also makes thing more complicated for tools and users. The sized of code lists are on a continuum: we have lists with a few codes to lists with thousands of codes. And it is not really possible to determine when and if a user wants to use a variable as a categorical variable or as is. I have run plenty of regressions with e.g. municipality as categorical variable (approx 350 categories); but also with educational level as numeric variable (we use numeric codes; that are ordered; which is fine when using tree-methods). The choice of putting the codes in foreignKeys or in categories then seems a bit arbitrary.

@khusmann
Copy link
Contributor

khusmann commented Mar 6, 2024

Thanks for your clarifications. I think my biggest concern regarding the codeList approach is that it effectively duplicates a bunch of the functionality of primaryKey and foreignKeys, rather than extending them.

In the current implementation/description it not really described what the foreign key should be used for.
It could be a code list, but also something else completely.

Right – and I think that's a good thing in this case, because when you're referencing codes in this way, there's a lot of different ways an implementation might want to follow these relationships. For example, for a dropdown selection widget for codes, you might want to list code abbreviations, but when you're populating a larger table, you might want to grab their full descriptions, for example. Which field an implementation will want to use to represent the level depends on the application, the current usage context, and the available properties of the levels…

Using both foreignKey and categories for basically the same thing also makes thing more complicated for tools and users.

I disagree on this point – I think there's a strong precedent for having both flat categorical value types and richer categorical level entities via table relationships. For example, DuckDB has categorical types as well as the ability to define primaryKey / foreignKey relationships. Categorical types are used when your levels are simple values, and relationships are used when you're wanting to represent your levels as more complex & general entities (that is, objects with multiple properties).

Perhaps we could get the desired functionality here by extending the foreignKeys definitions so they could receive a default human-readable label when importing into a flat value type (e.g. an R factor)…? maybe we could add a defaultLabelField or something to that effect? Actually, what about this instead:

{
  "name": "highest_education",
  "resources": [
    {
      "name": "edulevel",
      "format": "csv",
      "mediatype": "text/csv",
      "path": "edulevel.csv",
      "encoding": "utf-8",
      "schema": {
        "fields": [
          {
            "name": "id",
            "type": "integer"
          },
          {
            "name": "place_of_residence",
            "type": "string"
          },
          {
            "name": "edu_level",
            "type": "categorical",
            "categories": [
              { "value": 1, "label": "LOW_EDU" },
              { "value": 2, "label": "MED_EDU" },
              { "value": 3, "label": "HIGH_EDU" }
            ]
          }
        ],
        "foreignKeys": {
          "fields": "edu_level",
          "reference": {
            "resource": "codelist-edu_level",
            "fields": "code"
          }
        }
      }
    },
    {
      "name": "codelist-edu_level",
      "schema": {
        "fields": [
          {
            "name": "code",
            "type": "integer"
          },
          {
            "name": "name",
            "type": "string"
          },
          {
            "name": "description",
            "type": "string"
          },
          {
            "name": "field_color",
            "type": "categorical",
            "categories": ["red", "green", "blue"]
          }
        ],
        "primaryKey": ["code"]
      },
      "data": [
        {
          "code": 1,
          "name": "Low education",
          "description": "Primary education",
          "field_color": "red"
        },
        {
          "code": 2,
          "name": "Medium education",
          "description": "Secondary education",
          "field_color": "green"
        },
        {
          "code": 3,
          "name": "High education",
          "description": "Tertiary education",
          "field_color": "blue"
        }
      ]
    }
  ]
}

By using the categorical field type AND foreignKeys together, the levels can act as entities with multiple properties for each level (and the ability to build hierarchies by following foreign key references across level entities), but when importing the field into statistical software like R as a flat value we have labels defined for the factor levels. That's actually pretty nice.

It's still rough for fields with 100s of levels, or when many variables that share the same categorical scales -- for those cases what if we just let the categories field reference an external json with the array of value-label combos, like @peterdesmet mentioned on the last community call? Would that cover all of the use cases discussed so far?

@pschumm
Copy link
Contributor

pschumm commented Mar 9, 2024

Having spent a fair amount of time myself on #875 (and its precursor pattern) and believing that the extended discussion that led to it improved it considerably, I was initially not very keen to see this proposed as an alternative. However, I read this proposal and thought about it carefully, and I must admit that it has grown on me. I too use large code lists, as well as those with hierarchical structure (e.g., ICD9/10 codes, Multum drug codes, etc.), so I can appreciate those use cases. And I agree with @djvanderlaan that these are conceptually field-level properties as opposed to properties of the data resource, so using the foreignKeys construct to represent them seems inappropriate.

That said, IIUC, this proposal addresses just two features that #875 does not provide:

  1. Handling very large code lists (e.g., hundreds of values or more) that would be extremely awkward to handle using the syntax proposed in #875; and
  2. Representing additional information beyond the set of codes itself, such as a hierarchical structure among them (or, for that matter, any arbitrary information about the codes).

Item (1) can be addressed using a JSON reference (as noted above by @fomcl and previously by @peterdesmet). Item (2) is a different matter, and I rather like the syntax proposed by @djvanderlaan above. I can see how that could be used to accomplish everything that #875 does.

But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this. Thus, while I can appreciate the elegance in a more general solution that can accomplish everything that #875 does as well as Item (2) above, I do not believe that it is justified in this case. Specifically, IMO the syntax proposed in #875 (and especially the syntactic sugar described by @khusmann at the bottom of that discussion) would be considerably simpler and more intuitive for the most common use cases of categorical variables.

In sum, I agree with @khusmann above that addressing Item (2) with a separate strategy is warranted in this case, and I would support @djvanderlaan's proposal above for that purpose.

@djvanderlaan
Copy link
Author

djvanderlaan commented Mar 15, 2024

Thanks for the reply.

Being able to store code lists externally (in a separate file or using a url) is not just beneficial for large code lists. It also makes it easier for different datasets to use the same codelists. Many organisations have coordinated code lists for various subjects. Using a data resource allows one to link to such lists.

I can see the amount of effort that went into #875. It is perhaps also interesting to tell a bit about the history of the current proposal: We initially had an implementation that is similar to the one proposed in issue #875 (see #156 (comment)) . In order to allow for large code lists and reuse of code lists we wanted to allow that field to also refer to files. After discussion we concluded that it then made more sense to refer to a data resource instead of a file directly. And since a data resource allows for inline data anyway, we could then always refer to a data resource.

Edit: I removed a bit here as I didn't completely managed to formulate that the way I wanted. The gist is that I do not see a fundamental difference between a variable with a codelist and a categorical variable. From a user perspective the codelist proposal doesn't seem to be more complex (e.g. see demo: https://gist.github.com/djvanderlaan/f898bd8b4416dfe6157a7c45c616eecb )

I do like the 'syntactic sugar' proposed in the other proposal. This is difficult to do with the codelist proposal

{
  "name": "gender",
  "type": "string",
  "categories": ["Female", "Male", "Other"]
} 

Would it be possible to merge both proposals? The categories field can be

  1. An array of objects
  2. An array of strings (the syntactic sugar)
  3. The name of a dataresource
    (I am not completely sure if (1) is really needed when you can have inline data in (3)). Or is the categories field then a union of too many types?

@khusmann
Copy link
Contributor

But the rationale for #875 is a very specific one; namely, to facilitate (if not encourage) the use of categorical variables in the analytic software packages that support them. Item (2) strikes me as quite distinct from this.

Agreed. #875 allows frictionless to be a drop in replacement for the proprietary formats currently dominating a bunch of scientific fields. I think it should stand as it is, and we should try to figure out how to work these features in as a somewhat separate concern.

It also makes it easier for different datasets to use the same codelists.

And similarly allows the same dataset to use the same codelist many times -- that's something I could really make use of in my data, along with the ability to store more metadata about categorical levels.

Would it be possible to merge both proposals? The categories field can be

An array of objects
An array of strings (the syntactic sugar)
The name of a dataresource

I like this direction. I would prefer though if we use an object instead of just the data resource name, so we can explicitly assign which fields the labels and values should come from. Something like this:

{
  "name": "edu_level",
  "type": "categorical",
  "categories": {
    "resource": "codelist-edu_level",
    "valueField": "code",
    "labelField": "name"
  }
}

I still wish we could connect this to foreign keys somehow, because it's an existing practice in data warehousing for automatically recognizing and traversing properties of hierarchical categorical structures – see zillion for a good example. But I agree, it's nice to have it specified in the field itself rather than be a table-level prop. I suppose we can leave it up to implementations to recognize this as a foreign key situation.

@peterdesmet
Copy link
Member

Very late to this party. While I understand that it can be useful to represent a code list as a Data Resource, I consider code lists more similar to Table Schemas. A schema can:

  1. Be included verbosely in datapackage.json
  2. Be referenced with a path or URL

Similarly, categories (as proposed in #48) could be:

  1. Included verbosely as an array of objects
  2. Included verbosely as an array of strings
  3. Be referenced with a path or URL (I've just suggested this in datapackages - title element - SHOULD or MAY #48):
    • This file could be local or remote
    • This file would need to be JSON
    • This file would need to have the fields value and label
    • This file could have other properties, like hierarchy
    • This file could still be described as a Data Resource if you want to

Personally, I'm not in fan of "categories could be a Data Resource". It's quite bespoke and thus requires more software implementation work. It would also require an alternative approach (than #48) to handling missing values that are included in the codeList.

@djvanderlaan
Copy link
Author

An advantage of using a data resource instead of a separate json file, is that a data resource also has functionality for storing additional meta data for the code list. For example, in my file I could could have one variable that uses NACE to code companies. In the data resource for the categories of that field, I can then indicate using the title, description, author, license and source fields which specific version of NACE I am using, the license (which might be different from that of the data set itself) and also refer to the original author of that NACE classification. Without that, I would just have a list of codes and labels and would not even know I was looking at NACE codes (unless you put all that information in the description field of the field which then becomes quite overloaded).

Also, #48 (comment) mentions that it would be useful to be able to have one json file with a collection of definitions for categories. We already have an object to store a list of 'resources', namely a data package. If we can refer to a specific data resource in a datapackage, we can could store all categories definitions (inline) in one data package.

@frictionlessdata frictionlessdata locked and limited conversation to collaborators Oct 21, 2024
@roll roll converted this issue into discussion #1027 Oct 21, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

6 participants