Updated docs with Croissant information. #330

Reikyo · 2025-01-13T19:12:34Z

No description provided.

Reikyo · 2025-01-13T19:34:08Z

Hi @amercader, just moving the conversation from #328 to here.

Many thanks for the changes to enable Croissant features as a plugin, and to provide a dedicated endpoint. I synced with these latest changes and tried them out, and all works as expected. I wouldn't have so easily figured out what's needed to enable these new features, so your work is a big help, and of course it then naturally fits in with your desired structure too.

I looked at the existing docs, and thought that maybe the Croissant info could better sit in the existing files rather than a new file. Firstly, the schema info naturally sits alongside other schema info in getting-started.md (FYI I notice that reference to dcat_us_recommended.yaml and dcat_us_full.yaml is currently missing here, in case it should be updated.) Secondly, the structured data info naturally sits alongside other structured data info in google-dataset-search.md, so I wrote a modified version of that file which discusses the structured_data plugin and the croissant plugin side-by-side, seeing as they are so similar in use and effect. I provided a couple of complete new examples there using the default CKAN data entry forms, and manually adjusted the output for the examples so that the data appears to sit at http://demo.ckan.org, as per the previous example.

Finally, when producing the new examples just mentioned, I noticed that the schemaorg profile currently outputs the same schema:contactPoint info twice. Looking at schemaorg.py, this is because of two calls to self._agent_graph, once for publisher and once for creator, even though it picks up the same contact info each time. Not sure if this is desired behaviour, so thought I'd flag here.

Please let me know what you think of the above when you have time.

amercader · 2025-01-17T15:08:36Z

Hi @Reikyo thanks for this, see below:

I totally get that croissant and structured_data are really similar implementation-wise but I still think it's valuable to separate them in the docs as they target two different users and we want to have a single place to point people interested in CKAN - Croissant integration. You added all the parts that should be in the docs so don't worry I'll rearrange them myself. One think I want to add is a small example of actual usage of a Croissant enabled CKAN dataset, perhaps adapting the tensorflow_datasets example in the croissant package README. But alas when trying it out I got validation errors that we should address first.
Valid output: Obviously we want the generated croissant output to be valid. Running the croissant validator in one of the CKAN dataset croissant endpoints gives the following:

mlcroissant validate --jsonld ~/Downloads/croissant.jsonld
W0117 16:00:54.340307 139754582292288 rdf.py:78] WARNING: The JSON-LD `@context` is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'conformsTo', 'column', 'fileObject', 'citeAs', '@language', 'dataType', 'source', 'fileProperty', 'parentField', '@vocab', 'extract', 'includes', 'isLiveDataset', 'dct', 'jsonPath', 'md5', 'transform', 'rai', 'sc', 'regex', 'path', 'examples', 'key', 'separator', 'recordSet', 'replace', 'subField', 'data', 'references', 'repeated', 'format', 'fileSet', 'field'}

E0117 16:00:54.345736 139754582292288 validate.py:55] Found the following 1 error(s) during the validation:
  -  The current JSON-LD doesn't extend https://schema.org/Dataset

This looks pretty high level so maybe there's something obvious that needs to be tweaked in the profile?

This eventually needs to run in an automated test so we make sure the output remains valid in the future. I added the test in 7fd67fe so you can run it locally if you want but it's essentially the same as running the validator, only the test uses a dataset with all properties present.

I'm getting errors when providing the id_given fields, I need to look into that as well but if you have any ideas that would be great:

    def make_path(path: PathLike) -> abstract_path.Path:
      """Create a generic `pathlib.Path`-like abstraction.
    
      Depending on the input (e.g. `gs://`, `github://`, `ResourcePath`,...), the
      system (Windows, Linux,...), the function will create the right pathlib-like
      abstraction.
    
      Args:
        path: Pathlike object.
    
      Returns:
        path: The `pathlib.Path`-like abstraction.
      """
      is_windows = os.name == 'nt'
      if isinstance(path, str):
        uri_splits = path.split('://', maxsplit=1)
        if len(uri_splits) > 1:  # str is URI (e.g. `gs://`, `github://`,...)
          # On windows, `PosixGPath` is created for `gs://` paths
>         return _URI_PREFIXES_TO_CLS[uri_splits[0] + '://'](path)  # pytype: disable=bad-return-type
E         KeyError: '[\n  {\n    "@id": "my-custom-resource-id",\n    "@type": [\n      "http://'

/home/adria/.pyenv/versions/3.11.6/envs/ckan-ml/lib/python3.11/site-packages/etils/epath/register.py:100: KeyError

Updated docs with Croissant information.

7f27a88

Reikyo mentioned this pull request Jan 13, 2025

Add support for the Croissant metadata specification #328

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated docs with Croissant information. #330

Updated docs with Croissant information. #330

Reikyo commented Jan 13, 2025

Reikyo commented Jan 13, 2025

amercader commented Jan 17, 2025

Updated docs with Croissant information. #330

Are you sure you want to change the base?

Updated docs with Croissant information. #330

Conversation

Reikyo commented Jan 13, 2025

Reikyo commented Jan 13, 2025

amercader commented Jan 17, 2025