Skip to content

Backend ‐ Categories

Jeremy Asuncion edited this page Sep 12, 2023 · 1 revision

Categories represent an information architecture that allows us to associate unique tags to plugins on the hub. This makes it easier for users of the hub to navigate and access plugins associated with a particular category.

Hub to EDAM Mappings

Category data is sourced from portions of the alpha06 version of the EDAM Bioimaging ontology, with some terms remapped based on the hub mappings defined in hub-mapping-alpha06.json.

Data storage

Generating category data

The category data is generated using the backend/category/edam.py script. To run the script, create a new virtual environment with dependencies from requirements.txt and run the script:

cd backend
python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt
cd category
python edam.py

This will generate a new file backend/category/data/EDAM-BIOIMAGING/alpha06.json that contains every category mapping on the hub.

S3

Category mappings are manually uploaded to S3 using the path category/<version>.json. Right now there is no automation for generating and uploading this file, so you will need to manually generate and upload the .json file generated.

DynamoDB

Category data on DynamoDB are stored by fetching the category mapping file from S3 in the previous step. Eventually as we migrate away from S3, we'll move this process to a data workflow that populates DynamoDB directly. The schema for each row in the the table is defined as:

interface CategoryRow {
  name: string
  version_hash: string
  version: string
  formatted_name: string
  dimension: string
  hierarchy: string[]
  label: string
  last_updated_timestamp: number
}

The name column is the hash key and the version_hash column is the range key. The name column is a sluggified version of the category key while formatted_name is the unmodified key with spacing and punctuation. For example, Scanning electron cryomicroscopy will become as scanning-electron-cryomicroscopy.

The version_hash column is a combination of the category version and an MD5 hash of all the contents for a category entry. This is required to store the categories that have multiple entries, such as Scanning electron cryomicroscopy. Looking at the response for https://api.napari-hub.org/categories/Scanning%20electron%20cryomicroscopy, we see that this category has two entries that only differ by own item in the hierarchy list:

[
  {
    "dimension": "Image modality",
    "hierarchy": [
      "Electron microscopy",
      "Cryo electron microscopy",
      "Scanning electron cryomicroscopy"
    ],
    "label": "Electron microscopy"
  },
  {
    "dimension": "Image modality",
    "hierarchy": [
      "Electron microscopy",
      "Scanning electron microscopy",
      "Scanning electron cryomicroscopy"
    ],
    "label": "Electron microscopy"
  }
]

API

Categories can be accessed using the following APIs:

The APIs are relatively simple and mostly work by fetching the mappings from DynamoDB and returning the category list by <name>.

References