Thoughts on Solr Indexing

I have got the JSON-LDs in the MongoDB and the next step is to push them to Solr. Solr requires a schema to provide a room for the coming data. I will discuss my approach towards indexing data in Solr here -

Before we proceed, let me give a sample of what our data looks like -

{
   "@context":"http://schema.org",
   "@type":"DataRecord",
   "identifier":"biosamples:SAMEA103996091",
   "dateModified":"2018-01-20T13:42:43.039Z",
   "dateCreated":"2016-08-30T23:00:00Z",
   "isPartOf":{
      "@type":"Dataset",
      "@id":"https://www.ebi.ac.uk/biosamples/samples"
   },
   "datasetPartOf":{
      "@type":"Dataset",
      "@id":"https://www.ebi.ac.uk/biosamples/samples"
   },
   "mainEntity":{
      "dataset":[
         "http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-4567"
      ],
      "@context":"http://schema.org",
      "@type":[
         "BioChemEntity",
         "Sample"
      ],
      "name":"source hESC-H9-primed_3",
      "url":"https://www.ebi.ac.uk/biosamples/sample/SAMEA103996091",
      "identifiers":[
         "biosamples:SAMEA103996091"
      ],
      "additionalProperty":[
         {
            "name":"Organism",
            "value":"Homo sapiens",
            "valueReference":[
               {
                  "url":"http://purl.obolibrary.org/obo/NCBITaxon_9606",
                  "@type":"CategoryCode"
               }
            ],
            "@type":"PropertyValue"
         },
         {
            "name":"cell line",
            "value":"H9",
            "valueReference":[
               {
                  "url":"http://www.ebi.ac.uk/efo/EFO_0003045",
                  "@type":"CategoryCode"
               }
            ],
            "@type":"PropertyValue"
         },
         {
            "name":"cell type",
            "value":"embryonic stem cell",
            "valueReference":[
               {
                  "url":"http://purl.obolibrary.org/obo/CL_0002322",
                  "@type":"CategoryCode"
               }
            ],
            "@type":"PropertyValue"
         },
         {
            "name":"genetic modification",
            "value":"transfected with doxycycline inducible MCRS1, THAP11, TET1 construct",
            "@type":"PropertyValue"
         },
         {
            "name":"growth condition",
            "value":"W8 media",
            "@type":"PropertyValue"
         },
         {
            "name":"phenotype",
            "value":"primed pluripotent state",
            "@type":"PropertyValue"
         }
      ]
   }
}

Let us break this nested JSON-LD in parts - The first level of this tree has the following keys -

"@context"
"@type"
"identifier"
"dateModified"
"dateCreated"
"isPartOf"
"datasetPartOf"
"mainEntity"

The @type specification says that this level is a type of DataRecord. Thus, it follows the schemas defined here - http://bioschemas.org/types/DataRecord/specification/ . Everything goes well except -

The schema does not define the key - "datasetPartOf". What is the reason? For now, let's ignore this specification.

Yeah, this is because the spec and what people are actually marking up are not in perfect alignment. In this case, I think we are justified in ignoring the extra field for now. In other cases, we may want to process markup that is not strictly to spec, as part of dealing with the messiness of the web (justincc).

Next, the second level in this nested tree is created by "isPartOf" and "mainEntity".

"isPartOf" has the "@type" specified as "Dataset" that takes us to this - http://bioschemas.org/specifications/Dataset/specification/ schema. In our example, "isPartOf" has only an identifier - @id and no other specifications from "Dataset".
"mainEntity" has the "@type" specified as "BioChemEntity" and "Sample". These two types redirects us to - http://bioschemas.org/types/BioChemEntity/specification/ and http://bioschemas.org/specifications/Sample/specification/ schema respectively. Further specifications in the "mainEntity" (including "dataset", "name", "url", "identifiers", "additionalProperty") comes from these two schemas that we pointed out previously. Precisely, "name", "url", "additionalProperty" comes from "BioChemEnntity" schema and "dataset", "name", "url", "additionalProperty" comes from the "sample" schema. Here we notice that "BioChemEntity" and "Sample" has similarity in schema(except "dataset" which is in "sample" but not in "BioChemEntity"). The next question is -

"identifiers" is not present in either of the schemas ("BioChemEntity" or "sample"). Both of them specify "identifier". Should this be considered as a spelling mistake, or there is a particular reason for this?

I believe this is a mistake, I will contact biosamples about this and cc you (ankit) in (justincc)

Next level is - "mainEntity" > "additionalProperty" which has the "@type" - "PropertyValue". Schema.org has defined the schema for "PropertyValue" - https://schema.org/PropertyValue.

Coming back to the Indexing approach We have seen that the JSON-LD we downloaded encompasses schemas from "DataRecord", "BioChemEntity", "Sample", "PropertyValue", "Dataset" and "CategoryCode". Every type has a different schema defined and some of them have multiple values ( eg. "additionalProperty"). Approach 1 - We can make separate Solr cores for each @type, segregate the JSON-LD and put data in multiple cores respectively. They can be linked to each other with a common index. This would, however, mean that we have to search for a query in multiple cores. Approach 2: We can make separate cores only for "@type" that has multiple values (eg. "additionalProperty").

PS: I guess we can modify this approach to make a general schema for our Solr core.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on Solr Indexing

Clone this wiki locally