This repository contains scripts for converting scholarly bibliographic metadata from various data sources into the format accepted by OpenCitations Meta. The data sources currently supported are:
- About The Project
- Software Components
- ID Validation Process
- How to Run the Software
- How to Extend the Software
- License
- Contacts
- Acknowledgements
- References
The main function of this software is to perform a metadata crosswalk between the data sources providing bibliographic and citation data and the OpenCitations Data Model (OCDM). At the same time, the software also handles the normalization and validation of the identifiers provided by the data source in all cases where it is not also the registration agency for those identifiers. The software generates two main output datasets, based on the data provided by a specific source:
- Bibliographic entities CSV tables; which are meant to be used as input for the META software, an OpenCitations tool and database for managing bibliographic entities' metadata. Example:
id | title | author | pub_date | venue | volume | issue | page | type | publisher | editor |
---|---|---|---|---|---|---|---|---|---|---|
doi:10.9799/ksfan.2012.25.1.069 | Nonthermal Sterilization and Shelf-life Extension of Seafood Products by Intense Pulsed Light Treatment | Cheigh, Chan-Ick [orcid:0000-0002-6227-4053]; Mun, Ji-Hye [orcid:0000-0002-6227-4053]; Chung, Myong-Soo | 2012-3-31 | The Korean Journal of Food And Nutrition [issn:1225-4339] | 25 | 1 | 69-76 | journal article | The Korean Society of Food and Nutrition [crossref:4768] |
- AnyID-to-Any-ID citations CSV tables; which will be used as input for the INDEX software, an OpenCitations tool and database for producing and managing citations between bibliographic entities identified by OMIDs (internal and unique identifiers assigned by OpenCitations). Example:
citing | cited |
---|---|
doi:10.11426/nagare1970.2.4_1 | doi:10.1295/kobunshi.16.921 |
In practice, the outputs generated by oc_ds_converter
are used in subsequent steps of the data ingestion process within the OpenCitations infrastructure. Specifically, the metadata tables are used as input for META. The software assigns an OMID identifier to new entities and propagates an existing OMID to the entities already present in the OpenCitations databases, thereby deduplicating identical entities ingested from different data sources.
Subsequently, the INDEX software, responsible for producing citation data compliant with the OpenCitations data model, takes as input the anyID-to-anyID citation tables produced by the oc_ds_converter
software. It queries the META database to retrieve the OMID associated with each entity's identifier and produces OMID-to-OMID citations in various formats (RDF, SCHOLIX, CSV) as output.
Here, a diagram of the OpenCitations ingestion workflow:
The software is built upon three fundamental components, each addressing specific needs:
- Metadata Crosswalk (oc_ds_converter)
- Identifier Validation (oc_ds_converter/oc_idmanager)
- Data Storage Management (oc_ds_converter/oc_idmanager/oc_data_storage)
Each identifier schema has its own class (e.g.: PMIDManager(IdentifierManager)
, defined in oc_ds_converter/oc_idmanager/pmid.py
), instantiated according to the model provided by the abstract class IdentifierManager(metaclass=ABCMeta)
, defined in oc_ds_converter/oc_idmanager/base.py.
Each class provides methods for:
- normalising the id string
- checking the correctness of the id syntax
- verifying its existence using specific API services (if available)
- In Memory (class
InMemoryStorageManager(StorageManager)
, defined inoc_ds_converter/oc_idmanager/oc_data_storage/in_memory_manager.py)
- Redis (class
RedisStorageManager(StorageManager)
, defined inoc_ds_converter/oc_idmanager/oc_data_storage/redis_manager.py
) - Sqlite (class
SqliteStorageManager(StorageManager)
, defined inoc_ds_converter/oc_idmanager/oc_data_storage/sqlite_manager.py
).
Each of these classes is defined as an instance of the abstract class StorageManager(metaclass=ABCMeta)
, defined in oc_ds_converter/oc_idmanager/oc_data_storage/storage_manager.py
.
The type of storage manager used for a specific data source process can be chosen by the user (however, we suggest using the Redis storage manager). An instance of the chosen storage manager will be used by all the ID Managers instantiated in the process to store validation data at the end of each data chunk management. The temporary storage manager used while processing a data chunk is instead always an instance of the In-Memory storage manager (which is based on the use of a python dictionary). The reason for this choice lies in the fact that, in case of a run stop, the execution would restart processing from the beginning of the chunk that was being managed at the time of the interruption, and thus the data already memorized by a redis or sqlite storage manager would be duplicated, while the data memorized in an instance of an in-memory storage manager are just lost and reprocessed.
In order to avoid redundant API checks, we rely on an ad-hoc data storage system. More in detail, in case the data source is also the id registration agency of at least a part of the identifiers provided in a data dump, we perform a full preliminary iteration of the data to store these identifiers as valid, without any further check.Subsequently, we perform another full iteration, validating all identifiers not registered by the data source itself.
Note that, to manage the large amount of data provided by each data source, the input dataset is generally divided into data chunks. As mentioned above, in order to avoid data duplication in case of a process interruption and restart, data concerning each chunk are temporarily stored in an instance of the in-memory storage manager (see InMemoryStorageManager(StorageManager) in oc_ds_converter/oc_idmanager/oc_data_storage/in_memory_manager.py). The data stored in the temporary storage manager is transferred to the main storage manager (containing the ID validation data of the full input dataset) at the end of the chunk's process, when both the CSV tables concerning bibliographic metadata and citations are produced. For each encountered identifier to be validated, an ordered list of checks should be performed, stopping as soon as the validity value can be assessed:
- Search for the identifier in the in-memory storage manager, containing data concerning the current data chunk;
- Search for the identifier in the main storage manager, containing data concerning the whole dataset;
- Search for the identifier in the OpenCitations databases, containing data of all the datasets ever ingested in OpenCitations.
- Use ID-schema specific API services to retrieve the validity information of the ID.
To produce the citations and metadata CSV output from a data source, it is possible to execute its specific process by selecting the correct source from oc_ds_converter/run/
directory. For example, the oc_ds_converter process for JaLC data source can be launched as follows:
python oc_ds_converter/run/jalc_process.py -ja /Volumes/my_disk/JALC_INPUT -out /Volumes/my_disk/JALC_OUTPUT -ca /Volumes/my_disk/JOCI_CACHE.json -r -m 3
This command launches a process of data conversion from the input data dump (located at /Volumes/my_disk/JALC_INPUT
) into metadata CSV tables (stored at /Volumes/my_disk/JALC_OUTPUT
) and citation CSV tables (stored in a directory automatically generated at /Volumes/my_disk/JALC_OUTPUT_citations
), using up to 3 workers for the process parallelization (-m 3
) and Redis as storage system (-r
) . While the process is being executed, a cache file at /Volumes/my_disk/JOCI_CACHE.json
is created and updated.
More in detail, each data source run script has a set of arguments that can be adapted to meet the peculiarities of the dataset. However, all the sources should accept a similar list of arguments:
- '--config': The path to a configuration file, where the other arguments can be declared;
- '--input_location': The path to the input data;
- '--output_location': The path to the output directory where the metadata CSV files will be stored. From the name of this directory, the name of the directory where to store the citation CSV files will be derived automatically.
- '--publishers': The path to an optional support CSV file containing additional information about publishers, their crossref members and the DOI prefix they are associated with (id, name, prefix), used to enrich the metadata.
- '--orcid': The path to an optional support table mapping DOIs to ORCIDs of the publications' authors, used to enrich the metadata.
- '--wanted': The path to an optional CSV filepath containing a list of DOIs to process.
- '--cache': The cache file path, that will be automatically deleted at the end of the process.
- '--verbose': Argument which allows to declare whether a verbose description of the process execution is required.
- '--storage_path': An argument to optionally choose the path of the file where to store data concerning validated IDs information, in case the process is executed using either an In-Memory or a Sqlite storage manager. Pay attention to specify a ".db" file in case a SqliteStorageManager is chosen and a ".json" file otherwise.
- '--testing': The parameter to define whether or not the script is to be run in testing mode.
- '--redis_storage_manager': A parameter to define whether or not to use redis as storage manager. In case Redis is not used, the storage manager type is derived by the storage path type (i.e. : In Memory storage in case the file is a JSON file, Sqlite in case of a .db file)
- '--max_workers': The integer number of workers used to run the process in parallel executions.
In order to manage a new data source, two main software components need to be developed:
- a script for reading the data source, extract the bibliographic entities' metadata, and produce the output tables;
- a script for reshaping the metadata of each bibliographic entity according to the OpenCitation data model.
In addition to that, if the data source uses persistent identifiers not managed by OpenCitations yet, a new identifier manager should be developed too.
For each new data source, a python file should be added to the directory oc_ds_converter/run/
. The file should be named after the data source, and perform the following tasks:
- decopress and read the source dataset;
- manage the identifiers' validation process;
- extract from the source data a data structure representing each bibliographic resource;
- call a source-specific metadata crosswalk method to convert this data structure into an OCDM-compliant dictionary representing the bibliographic resource, to be stored as a CSV row;
- produce the output tables (citations and metadata)
All source represents bibliographic records according to a specific data model, which has to be mapped into OCDM. To do so, we implement a source-specific child class of the class RaProcessor
(defined in oc_ds_converter/ra_processor.py
) for each new data source. The main method of all RaProcessor
children classes is csv_creator
, which is aimed at producing a row for an OpenCitations metadata table from a data structure representing a bibliographic entry according the source data model. As an example, see OpenaireProcessing(RaProcessor)
class (in oc_ds_converter/openaire/openaire_processing.py
).
For adding a new ID Manager:
- create a python file at
oc_ds_converter/oc_idmanager
, named after the id schema, e.g.oc_ds_converter/oc_idmanager/viaf.py
. - create a new class as an instance of the abstract class
IdentifierManager
(defined inoc_ds_converter/oc_idmanager/base.py
), e.g.:ViafManager(IdentifierManager)
, thus following the provided template. In particular: - define all the id-schema specific required methods, i.e.:
syntax_ok
, to check whether the ID is compliant to its own schema syntax,exists
, to check the ID's existence using the ID-specific API,normalise
, to normalise the identifier string (for example by removing unexpected character and turing the uppercase into lowercase characters), andis_valid
, for assessing the overall validity of the identifier. - if possible, add additional ID-schema specific methods. For example, some ID schemas (such as ORCID and ISSN) are formed by following a specific check-digit mechanism, which provides a further control system to verify the ID validity: in these cases, it is possible to add also a
check_digit
method.
For adding a new type of Storage Manager, i.e. relying on another storage system:
- create a python file at
oc_ds_converter/oc_idmanager/oc_data_storage
named after the storage system, e.g.:oc_ds_converter/oc_idmanager/oc_data_storage/redis_manager.py
. - create a new class as an instance of the abstract class
StorageManager
(defined inoc_ds_converter/oc_idmanager/oc_data_storage/storage_manager.py
), e.g.:RedisStorageManager(StorageManager)
, thus following the provided template. In particular: - define all the storage-type-specific required methods, i.e.:
set_value
, to add a single key-value pair to the storage,set_multi_value
, to store a list of key-value tuple pairs all at once,get_value
, to retrieve the value associated to a specific key,del_value
, to delete a key-value pair,delete_storage
, to delete all the data previously saved in the storage system, andget_all_keys
, to retrieve the list of all the keys in the storage.
The repository is managed with poetry. To activate the virtual environment:
poetry shell
To add a package as a dependency to the project:
poetry add <package>
To run all tests with poetry:
poetry run test
To run specific tests:
python -m unittest discover -s test -p "*.py"
Distributed under the ISC License. See LICENSE
for more information.
- Arianna Moretti - @ariannamorettj - [email protected]
- Arcangelo Massari - @arcangelo7 - [email protected]
- Elia Rizzetto - @eliarizzetto - [email protected]
- Marta Soricetti - @martasoricetti - [email protected]
Project Link: https://github.com/opencitations/oc_ds_converter
This project has been developed under the supervision of Prof. Silvio Peroni.
- Silvio Peroni - @essepuntato - [email protected]