This repository provides instructions, documentation, and examples regarding deployment of the Knowledge Lake Management System (KLMS) developed by the STELAR project. The STELAR KLMS supports and facilitates a holistic approach for FAIR (Findable, Accessible, Interoperable, Reusable) and AI-ready (high-quality, reliably labeled) data. It allows to (semi-)automatically turn a raw data lake into a knowledge lake by: (a) enhancing the data lake with a knowledge layer; and (b) developing and integrating a set of data management tools and workflows. The knowledge layer comprises: (a) a data catalog that offers automatically enhanced metadata for the raw data assets in the lake; and (b) a knowledge graph that semantically describes and interlinks these data assets using suitable domain ontologies and vocabularies. The provided STELAR tools and workflows offer novel functionalities for: (a) data discovery and quality management; (b) data linking and alignment, and (c) data annotation and synthetic data generation.
-
Keycloak is used for Identity and Access Management;
-
Data Catalog of datasets in KLMS, deployed as a CKAN site. Metadata about published datasets (i.e., CKAN packages and resources) is stored in a PostgreSQL database.
-
A Knowledge Graph is deployed via Ontop, employing mappings from the database to a virtual RDF graph according to the KLMS ontology.
-
MinIO serves as a storage layer for the files in the data lake.
-
Stelar Operator necessary to design and implement workflows inside the STELAR KLMS using the Apache Airflow workflow engine.
-
An instance of MLFlow maintains metadata regarding all executions in the same PostgreSQL database (the one also used by Data Catalog).
-
Dashboards offer a quick overview about datasets, workflows and tasks managed by the KLMS.
-
A RESTful Data API is used for managing and searching resources in the KLMS.
The STELAR KLMS supports two alternative workflow engines:
-
In its Community Edition, it supports Apache Airflow, which is a very popular open-source platform for this purpose.
-
In its Professional and Enterprise editions, it supports the RapidMiner Studio & AI Hub, which is a widely used commercial platform for machine learning and data science workflows.
-
Synopses Data engine for Extreme Scale Analytics-as-a-Service.
-
GeoTriples for publishing geospatial data as Linked Geospatial Data in RDF.
-
pyJedAI for Schema Matching and Entity Linking.
-
JedAI-spatial for computing topological relations between datasets with geometric entities.
-
Correlation detective (CorDet) for finding interesting multivariate correlations in vector datasets.
-
Data Profiler, a library for profiling different types of data and files.
-
Data Selection interface for searching, ranking, and comparing datasets available in the KLMS Data Catalog.
-
GenericNER for named entity recognition (NER) on input texts.
-
FoodNER, a service for detecting and extracting Name Entities from Food Science text files.
-
Synthetic Data Generation for textual data in agri-food domain.
-
Hazard-classification from incidents reported in agri-food domain.
- Orchestration of several KLMS components for entity extraction and linking over unstructured food safety data employing Airflow workflow engine and the Data API for publishing and searching in the Data Catalog.
The contents of this project are licensed under the GPL-2.0 license.