Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic LDNodes and RO-Crate parsing rework #496

Open
wants to merge 56 commits into
base: main
Choose a base branch
from
Open

Conversation

HLWeil
Copy link
Member

@HLWeil HLWeil commented Feb 27, 2025

Reworked Generic JSON-LD Objects

This is a generic JSON-LD implementation that represents those aspects of the JSON-LD specification which are relevant for handling ARC-RO-Crate:

  • Datamodel
    • LDContext (Contains mappings from compacted terms to extended IRIs)
    • LDNode (Our basic objects with @id, types and fields)
    • LDGraph (Basically a collection of nodes; Top-level object in the RO-Crate)
    • LDRef (Object containing only @id, meant for linking)
    • LDValue Expanded, self-contextualizing values
  • Algorithms
    • Flatten
    • Compact
    • WIP Unflatten
  • Generic JSON encoders and decoders

Added ARC-RO-Crate profile layer

This layer acts like a ARC-RO-Crate view on the generic JSON-LD model. All actual handled objects are either FSharp primitive types or generic JSON-LD objects. Through making use of static methods which in turn use static IRIs, a loose typing is emulated. These static IRI mostly consist of types and properties from schema.org and bioschemas.org.

  • Added a set of static classes, containing functions for

    • getting properties
    • setting properties
    • generating default @ids
    • validating given LDNode against requirements for this type
    • create functions for LDNodes matching the requiremenets for this type
  • For the following types/profiles:

    • Comment
    • DefinedTerm
    • PropertyValue
    • PostalAddress
    • Organization
    • Person
    • ScholarlyArticle
    • Sample
    • File
    • LabProtocol
    • LabProcess
    • Dataset
  • Added contexts of RO-Crate specification 1.1 and 1.2DRAFT

  • Added context for all other fields exclusive to the ARC-RO-Crate profile

ARC Scaffold <-> ARC-RO-Crate conversion layer

Added functions for Conversion between Scaffold and RO-Crate representation types.
E.g.

  • Ontology Annotation <-> DefinedTerm
  • ArcTable <-> LabProcess Sequence
  • ArcInvestigation <-> Dataset

Added top-level members to ARC Scaffold object to parse to and from an RO-Crate JSON-LD metadata string.


To-do:

  • Top level parsers
  • DOI and PubmedID as identifier
  • Bioschemas context
  • Some @id are empty (LabProtocol)
  • Top-level roundabout tests
  • Link datafiles instead of fragment selectors from datasets

To-do after this PR is merged:

  • Datamap support
  • Make use of @list for ordered process inputs and outputs
  • CI job, running the RO-Crate output for ARC(Prototype?) against RO-Crate validator

closes #326
closes #400
closes #429
closes #465
closes #494
closes #495

kMutagene and others added 30 commits February 10, 2025 15:30
- Json: parse `@type`/`additionalType` from either string (-> singleton) or array field, always write array
- add tests
@HLWeil
Copy link
Member Author

HLWeil commented Feb 27, 2025

@floWetzels could you also take a look onto whether #326 is actually fixed or not (e.g. by inspecting a written RO-Crate metadata json)

@floWetzels
Copy link
Collaborator

I tested the metadata json file generated for the ArcPrototype with the rocrate-validator. I found the following errors that need to be fixed:

  • Studies und assays have wrong ids - the path to the isa-xlsx file. It should be the path to the directory ending with a /.
  • Studies and assays are missing the hasPart property
  • The root data entity has no license
  • The root data entity has no datePublished

Furthermore, I noticed the following when parsing the file manually:

  • Paths must be relative to the root of the ro-crate
  • Ad-hoc ids (e.g. for processes, paramters, parameter values) should begin with a # symbol
  • The id of PropertyValue objects looks weird to me. It contains the propertyID, which could be a potential source for errors. Previously, it was generated only from name and value, if I am not mistaken.

@floWetzels
Copy link
Collaborator

Looks a lot better now! The validator accepts the RO-Crate. When looking through the file, I noticed a few more things:

  • Some @id values contain spaces. This appears almost throughout all types, e.g. Organization, PropertyValue, Process, Protocol, Sample etc. This is not necessarily a problem, the json-ld playground is able to handle it for example. It is also not completely avoidable, since file paths can contain spaces. I would prefer to replace them in ad-hoc ids though. The old export did so with a simple replace statement.
  • For some types, ad-hoc ids are generated although a URI or similar identifier exists. I found two examples in the ArcPrototype, DefinedTerm and Person. A DefinedTerm should use its termCode as id, if it exists. For Person, I found an instance where an email address exists, but an ad-hoc id was assigned.
  • Some termCode entries of DefinedTerm contain multiple URIs separated by an underscore. This should not be the case, right? Here's an example: "termCode": "http://purl.obolibrary.org/obo/scoro_http://purl.org/spar/scoro/principal-investigator".

@HLWeil
Copy link
Member Author

HLWeil commented Mar 4, 2025

I incorporated your suggested fixes and included the test file in d3d786e

@HLWeil HLWeil marked this pull request as ready for review March 4, 2025 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants