Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated docs with Croissant information. #330

Open
wants to merge 1 commit into
base: croissant
Choose a base branch
from

Conversation

Reikyo
Copy link

@Reikyo Reikyo commented Jan 13, 2025

No description provided.

@Reikyo
Copy link
Author

Reikyo commented Jan 13, 2025

Hi @amercader, just moving the conversation from #328 to here.

Many thanks for the changes to enable Croissant features as a plugin, and to provide a dedicated endpoint. I synced with these latest changes and tried them out, and all works as expected. I wouldn't have so easily figured out what's needed to enable these new features, so your work is a big help, and of course it then naturally fits in with your desired structure too.

I looked at the existing docs, and thought that maybe the Croissant info could better sit in the existing files rather than a new file. Firstly, the schema info naturally sits alongside other schema info in getting-started.md (FYI I notice that reference to dcat_us_recommended.yaml and dcat_us_full.yaml is currently missing here, in case it should be updated.) Secondly, the structured data info naturally sits alongside other structured data info in google-dataset-search.md, so I wrote a modified version of that file which discusses the structured_data plugin and the croissant plugin side-by-side, seeing as they are so similar in use and effect. I provided a couple of complete new examples there using the default CKAN data entry forms, and manually adjusted the output for the examples so that the data appears to sit at http://demo.ckan.org, as per the previous example.

Finally, when producing the new examples just mentioned, I noticed that the schemaorg profile currently outputs the same schema:contactPoint info twice. Looking at schemaorg.py, this is because of two calls to self._agent_graph, once for publisher and once for creator, even though it picks up the same contact info each time. Not sure if this is desired behaviour, so thought I'd flag here.

Please let me know what you think of the above when you have time.

@amercader
Copy link
Member

Hi @Reikyo thanks for this, see below:

  • I totally get that croissant and structured_data are really similar implementation-wise but I still think it's valuable to separate them in the docs as they target two different users and we want to have a single place to point people interested in CKAN - Croissant integration. You added all the parts that should be in the docs so don't worry I'll rearrange them myself. One think I want to add is a small example of actual usage of a Croissant enabled CKAN dataset, perhaps adapting the tensorflow_datasets example in the croissant package README. But alas when trying it out I got validation errors that we should address first.
  • Valid output: Obviously we want the generated croissant output to be valid. Running the croissant validator in one of the CKAN dataset croissant endpoints gives the following:
mlcroissant validate --jsonld ~/Downloads/croissant.jsonld
W0117 16:00:54.340307 139754582292288 rdf.py:78] WARNING: The JSON-LD `@context` is not standard. Refer to the official @context (e.g., from the example datasets in https://github.com/mlcommons/croissant/tree/main/datasets/1.0). The different keys are: {'conformsTo', 'column', 'fileObject', 'citeAs', '@language', 'dataType', 'source', 'fileProperty', 'parentField', '@vocab', 'extract', 'includes', 'isLiveDataset', 'dct', 'jsonPath', 'md5', 'transform', 'rai', 'sc', 'regex', 'path', 'examples', 'key', 'separator', 'recordSet', 'replace', 'subField', 'data', 'references', 'repeated', 'format', 'fileSet', 'field'}

E0117 16:00:54.345736 139754582292288 validate.py:55] Found the following 1 error(s) during the validation:
  -  The current JSON-LD doesn't extend https://schema.org/Dataset

This looks pretty high level so maybe there's something obvious that needs to be tweaked in the profile?

This eventually needs to run in an automated test so we make sure the output remains valid in the future. I added the test in 7fd67fe so you can run it locally if you want but it's essentially the same as running the validator, only the test uses a dataset with all properties present.

I'm getting errors when providing the id_given fields, I need to look into that as well but if you have any ideas that would be great:

    def make_path(path: PathLike) -> abstract_path.Path:
      """Create a generic `pathlib.Path`-like abstraction.
    
      Depending on the input (e.g. `gs://`, `github://`, `ResourcePath`,...), the
      system (Windows, Linux,...), the function will create the right pathlib-like
      abstraction.
    
      Args:
        path: Pathlike object.
    
      Returns:
        path: The `pathlib.Path`-like abstraction.
      """
      is_windows = os.name == 'nt'
      if isinstance(path, str):
        uri_splits = path.split('://', maxsplit=1)
        if len(uri_splits) > 1:  # str is URI (e.g. `gs://`, `github://`,...)
          # On windows, `PosixGPath` is created for `gs://` paths
>         return _URI_PREFIXES_TO_CLS[uri_splits[0] + '://'](path)  # pytype: disable=bad-return-type
E         KeyError: '[\n  {\n    "@id": "my-custom-resource-id",\n    "@type": [\n      "http://'

/home/adria/.pyenv/versions/3.11.6/envs/ckan-ml/lib/python3.11/site-packages/etils/epath/register.py:100: KeyError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants