Skip to content

opencitations/oc_ds_converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Run tests Coverage

GitHub code size in bytes

OpenCitations Data Sources Converter

This repository contains scripts for converting scholarly bibliographic metadata from various data sources into the format accepted by OpenCitations Meta. The data sources currently supported are:

Table of Contents

  1. About The Project
  2. Software Components
  3. ID Validation Process
  4. How to Run the Software
  5. How to Extend the Software
  6. License
  7. Contacts
  8. Acknowledgements
  9. References

About The Project

The main function of this software is to perform a metadata crosswalk between the data sources providing bibliographic and citation data and the OpenCitations Data Model (OCDM). At the same time, the software also handles the normalization and validation of the identifiers provided by the data source in all cases where it is not also the registration agency for those identifiers. The software generates two main output datasets, based on the data provided by a specific source:

  • Bibliographic entities CSV tables; which are meant to be used as input for the META software, an OpenCitations tool and database for managing bibliographic entities' metadata. Example:
id title author pub_date venue volume issue page type publisher editor
doi:10.9799/ksfan.2012.25.1.069 Nonthermal Sterilization and Shelf-life Extension of Seafood Products by Intense Pulsed Light Treatment Cheigh, Chan-Ick [orcid:0000-0002-6227-4053]; Mun, Ji-Hye [orcid:0000-0002-6227-4053]; Chung, Myong-Soo 2012-3-31 The Korean Journal of Food And Nutrition [issn:1225-4339] 25 1 69-76 journal article The Korean Society of Food and Nutrition [crossref:4768]
  • AnyID-to-Any-ID citations CSV tables; which will be used as input for the INDEX software, an OpenCitations tool and database for producing and managing citations between bibliographic entities identified by OMIDs (internal and unique identifiers assigned by OpenCitations). Example:
citing cited
doi:10.11426/nagare1970.2.4_1 doi:10.1295/kobunshi.16.921

In practice, the outputs generated by oc_ds_converter are used in subsequent steps of the data ingestion process within the OpenCitations infrastructure. Specifically, the metadata tables are used as input for META. The software assigns an OMID identifier to new entities and propagates an existing OMID to the entities already present in the OpenCitations databases, thereby deduplicating identical entities ingested from different data sources.

Subsequently, the INDEX software, responsible for producing citation data compliant with the OpenCitations data model, takes as input the anyID-to-anyID citation tables produced by the oc_ds_converter software. It queries the META database to retrieve the OMID associated with each entity's identifier and produces OMID-to-OMID citations in various formats (RDF, SCHOLIX, CSV) as output.

Here, a diagram of the OpenCitations ingestion workflow: OpenCitations Ingestion Workflow

Software Components

The software is built upon three fundamental components, each addressing specific needs:

  1. Metadata Crosswalk (oc_ds_converter)
  2. Identifier Validation (oc_ds_converter/oc_idmanager)
  3. Data Storage Management (oc_ds_converter/oc_idmanager/oc_data_storage)

Metadata Crosswalk

Within this layer, there is a specific plugin for each data source, which, in turn, contains a Python file (usually named after the data source + "_ processing.py," e.g., oc_ds_converter/datacite/datacite_processing.py). In this file, a class is defined to convert metadata provided by the specific source into metadata compliant with OCDM. For example, in the file datacite_processing.py, the class DataciteProcessing(RaProcessor) is defined, which contains the method `csv_creator(self, item: dict) -> dict`. This method is responsible for producing a dictionary of metadata representing a bibliographic entity extracted from the dump provided by DataCite.

Identifier Validation

This software validates all identifiers not provided by the identifier registration agency itself. Currently, the identifiers handled by OpenCitations are: DOI; PMID; PMC; VIAF; WIKIDATA; WIKIPEDIA; ROR; ORCID; ARXIV; JID; ISSN; ISBN; URL.

Each identifier schema has its own class (e.g.: PMIDManager(IdentifierManager), defined in oc_ds_converter/oc_idmanager/pmid.py), instantiated according to the model provided by the abstract class IdentifierManager(metaclass=ABCMeta), defined in oc_ds_converter/oc_idmanager/base.py. Each class provides methods for:

  1. normalising the id string
  2. checking the correctness of the id syntax
  3. verifying its existence using specific API services (if available)

Data Storage Management

OpenCitations ds_converter currently offers three storage systems, which can be alternatively used:
  • In Memory (class InMemoryStorageManager(StorageManager), defined in oc_ds_converter/oc_idmanager/oc_data_storage/in_memory_manager.py)
  • Redis (class RedisStorageManager(StorageManager), defined in oc_ds_converter/oc_idmanager/oc_data_storage/redis_manager.py)
  • Sqlite (class SqliteStorageManager(StorageManager), defined in oc_ds_converter/oc_idmanager/oc_data_storage/sqlite_manager.py).

Each of these classes is defined as an instance of the abstract class StorageManager(metaclass=ABCMeta), defined in oc_ds_converter/oc_idmanager/oc_data_storage/storage_manager.py.

The type of storage manager used for a specific data source process can be chosen by the user (however, we suggest using the Redis storage manager). An instance of the chosen storage manager will be used by all the ID Managers instantiated in the process to store validation data at the end of each data chunk management. The temporary storage manager used while processing a data chunk is instead always an instance of the In-Memory storage manager (which is based on the use of a python dictionary). The reason for this choice lies in the fact that, in case of a run stop, the execution would restart processing from the beginning of the chunk that was being managed at the time of the interruption, and thus the data already memorized by a redis or sqlite storage manager would be duplicated, while the data memorized in an instance of an in-memory storage manager are just lost and reprocessed.

ID Validation Process

In order to avoid redundant API checks, we rely on an ad-hoc data storage system. More in detail, in case the data source is also the id registration agency of at least a part of the identifiers provided in a data dump, we perform a full preliminary iteration of the data to store these identifiers as valid, without any further check.

Perliminary data dump iteration

Subsequently, we perform another full iteration, validating all identifiers not registered by the data source itself.

Data dump iteration for data validation

Note that, to manage the large amount of data provided by each data source, the input dataset is generally divided into data chunks. As mentioned above, in order to avoid data duplication in case of a process interruption and restart, data concerning each chunk are temporarily stored in an instance of the in-memory storage manager (see InMemoryStorageManager(StorageManager) in oc_ds_converter/oc_idmanager/oc_data_storage/in_memory_manager.py). The data stored in the temporary storage manager is transferred to the main storage manager (containing the ID validation data of the full input dataset) at the end of the chunk's process, when both the CSV tables concerning bibliographic metadata and citations are produced. For each encountered identifier to be validated, an ordered list of checks should be performed, stopping as soon as the validity value can be assessed:

  1. Search for the identifier in the in-memory storage manager, containing data concerning the current data chunk;
  2. Search for the identifier in the main storage manager, containing data concerning the whole dataset;
  3. Search for the identifier in the OpenCitations databases, containing data of all the datasets ever ingested in OpenCitations.
  4. Use ID-schema specific API services to retrieve the validity information of the ID.

How to Run the Software

To produce the citations and metadata CSV output from a data source, it is possible to execute its specific process by selecting the correct source from oc_ds_converter/run/ directory. For example, the oc_ds_converter process for JaLC data source can be launched as follows:

python oc_ds_converter/run/jalc_process.py -ja /Volumes/my_disk/JALC_INPUT -out /Volumes/my_disk/JALC_OUTPUT -ca /Volumes/my_disk/JOCI_CACHE.json -r -m 3

This command launches a process of data conversion from the input data dump (located at /Volumes/my_disk/JALC_INPUT) into metadata CSV tables (stored at /Volumes/my_disk/JALC_OUTPUT) and citation CSV tables (stored in a directory automatically generated at /Volumes/my_disk/JALC_OUTPUT_citations), using up to 3 workers for the process parallelization (-m 3) and Redis as storage system (-r) . While the process is being executed, a cache file at /Volumes/my_disk/JOCI_CACHE.json is created and updated.

More in detail, each data source run script has a set of arguments that can be adapted to meet the peculiarities of the dataset. However, all the sources should accept a similar list of arguments:

  • '--config': The path to a configuration file, where the other arguments can be declared;
  • '--input_location': The path to the input data;
  • '--output_location': The path to the output directory where the metadata CSV files will be stored. From the name of this directory, the name of the directory where to store the citation CSV files will be derived automatically.
  • '--publishers': The path to an optional support CSV file containing additional information about publishers, their crossref members and the DOI prefix they are associated with (id, name, prefix), used to enrich the metadata.
  • '--orcid': The path to an optional support table mapping DOIs to ORCIDs of the publications' authors, used to enrich the metadata.
  • '--wanted': The path to an optional CSV filepath containing a list of DOIs to process.
  • '--cache': The cache file path, that will be automatically deleted at the end of the process.
  • '--verbose': Argument which allows to declare whether a verbose description of the process execution is required.
  • '--storage_path': An argument to optionally choose the path of the file where to store data concerning validated IDs information, in case the process is executed using either an In-Memory or a Sqlite storage manager. Pay attention to specify a ".db" file in case a SqliteStorageManager is chosen and a ".json" file otherwise.
  • '--testing': The parameter to define whether or not the script is to be run in testing mode.
  • '--redis_storage_manager': A parameter to define whether or not to use redis as storage manager. In case Redis is not used, the storage manager type is derived by the storage path type (i.e. : In Memory storage in case the file is a JSON file, Sqlite in case of a .db file)
  • '--max_workers': The integer number of workers used to run the process in parallel executions.

How to Extend the Software

Manage a new Data Source

In order to manage a new data source, two main software components need to be developed:

  1. a script for reading the data source, extract the bibliographic entities' metadata, and produce the output tables;
  2. a script for reshaping the metadata of each bibliographic entity according to the OpenCitation data model.

In addition to that, if the data source uses persistent identifiers not managed by OpenCitations yet, a new identifier manager should be developed too.

Data Source Reader Script

For each new data source, a python file should be added to the directory oc_ds_converter/run/. The file should be named after the data source, and perform the following tasks:

  1. decopress and read the source dataset;
  2. manage the identifiers' validation process;
  3. extract from the source data a data structure representing each bibliographic resource;
  4. call a source-specific metadata crosswalk method to convert this data structure into an OCDM-compliant dictionary representing the bibliographic resource, to be stored as a CSV row;
  5. produce the output tables (citations and metadata)

Metadata Crosswalk Script

All source represents bibliographic records according to a specific data model, which has to be mapped into OCDM. To do so, we implement a source-specific child class of the class RaProcessor (defined in oc_ds_converter/ra_processor.py) for each new data source. The main method of all RaProcessor children classes is csv_creator, which is aimed at producing a row for an OpenCitations metadata table from a data structure representing a bibliographic entry according the source data model. As an example, see OpenaireProcessing(RaProcessor) class (in oc_ds_converter/openaire/openaire_processing.py).

Add a new ID Manager

For adding a new ID Manager:

  1. create a python file at oc_ds_converter/oc_idmanager, named after the id schema, e.g. oc_ds_converter/oc_idmanager/viaf.py.
  2. create a new class as an instance of the abstract class IdentifierManager (defined in oc_ds_converter/oc_idmanager/base.py), e.g.: ViafManager(IdentifierManager), thus following the provided template. In particular:
  3. define all the id-schema specific required methods, i.e.: syntax_ok, to check whether the ID is compliant to its own schema syntax, exists, to check the ID's existence using the ID-specific API, normalise, to normalise the identifier string (for example by removing unexpected character and turing the uppercase into lowercase characters), and is_valid, for assessing the overall validity of the identifier.
  4. if possible, add additional ID-schema specific methods. For example, some ID schemas (such as ORCID and ISSN) are formed by following a specific check-digit mechanism, which provides a further control system to verify the ID validity: in these cases, it is possible to add also a check_digit method.

Add a new Storage Manager

For adding a new type of Storage Manager, i.e. relying on another storage system:

  1. create a python file at oc_ds_converter/oc_idmanager/oc_data_storage named after the storage system, e.g.: oc_ds_converter/oc_idmanager/oc_data_storage/redis_manager.py.
  2. create a new class as an instance of the abstract class StorageManager (defined in oc_ds_converter/oc_idmanager/oc_data_storage/storage_manager.py), e.g.: RedisStorageManager(StorageManager), thus following the provided template. In particular:
  3. define all the storage-type-specific required methods, i.e.: set_value, to add a single key-value pair to the storage, set_multi_value, to store a list of key-value tuple pairs all at once, get_value, to retrieve the value associated to a specific key, del_value, to delete a key-value pair, delete_storage, to delete all the data previously saved in the storage system, and get_all_keys, to retrieve the list of all the keys in the storage.

Test

The repository is managed with poetry. To activate the virtual environment:

poetry shell

To add a package as a dependency to the project:

poetry add <package>

To run all tests with poetry:

poetry run test

To run specific tests:

python -m unittest discover -s test -p "*.py" 

License

Distributed under the ISC License. See LICENSE for more information.

Contacts

Authors and Current maintainers of the repository

Project Link: https://github.com/opencitations/oc_ds_converter

Acknowledgements

This project has been developed under the supervision of Prof. Silvio Peroni.

References