Dataverse Metadata Crawler

📜Description

A Python CLI tool for extracting and exporting metadata from Dataverse repositories. It supports bulk extraction of dataverses, datasets, and data file metadata from any chosen level of dataverse collection (an entire Dataverse repository/sub-Dataverse), with flexible export options to JSON and CSV formats.

✨Features

Bulk metadata extraction from Dataverse repositories at any chosen level of collection (top level or selected collection)
JSON & CSV file export options

📦Prerequisites

Git
Python 3.10+

⚙️Installation

Clone the repository

git clone https://github.com/scholarsportal/dataverse-metadata-crawler.git

Change to the project directory
```
cd ./dataverse-metadata-crawler
```

Create an environment file (.env)

touch .env  # For Unix/MacOS
nano .env   # or vim .env, or your preferred editor
# OR
New-Item .env -Type File   # For Windows (Powershell)
notepad .env

Configure the environment (.env) file using the text editor of your choice.

# .env file
BASE_URL = "TARGET_REPO_URL"  # Base URL of the repository; e.g., "https://demo.borealisdata.ca/"
API_KEY = "YOUR_API_KEY"      # Found in your Dataverse account settings. Can also be specified in the CLI interface using the -a flag.

Your .env file should look like this:

BASE_URL = "https://demo.borealisdata.ca/"
API_KEY = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXX"

Set up virtual environment (recommended)

python3 -m venv .venv
source .venv/bin/activate     # For Unix/MacOS
# OR
.venv\Scripts\activate       # For Windows

Install dependencies
```
pip install -r requirements.txt
```

🛠️Usage

Basic Command

python3 dvmeta/main.py [-a AUTH] [-l] [-d] [-p] [-f] [-e] [-s] -c COLLECTION_ALIAS -v VERSION

Required arguments:

Option	Short	Type	Description	Default
--collection_alias	-c	TEXT	The alias of the collection to crawl. See the guide here to learn how to look for a the collection alias. [required]	None
--version	-v	TEXT	The Dataset version to crawl. Options include: • `draft` - The draft version, if any • `latest` - Either a draft (if exists) or the latest published version • `latest-published` - The latest published version • `x.y` - A specific version [required]	None (required)

Optional arguments:

Option	Short	Type	Description	Default
--auth	-a	TEXT	Authentication token to access the Dataverse repository.	None
--log --no-log	-l		Output a log file. Use `--no-log` to disable logging.	`log` (unless `--no-log`)
--dvdfds_metadata	-d		Output a JSON file containing metadata of Dataverses, Datasets, and Data Files.
--permission	-p		Output a JSON file that stores permission metadata for all Datasets in the repository.
--emptydv	-e		Output a JSON file that stores all Dataverses which do not contain Datasets (though they might have child Dataverses which have Datasets).
--failed	-f		Output a JSON file of Dataverses/Datasets that failed to be crawled.
--spreadsheet	-s		Output a CSV file of the metadata of Datasets.
--help			Show the help message.

Examples

# Export the metadata of latest version of datasets under collection 'demo' to JSON
python3 dvmeta/main.py -c demo -v latest -d

# Export the metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV
python3 dvmeta/main.py -c demo -v 1.0 -d -s

# Export the metadata and permission metadata of version 1.0 of all datasets under collection 'demo' to JSON and CSV, with the API token specified in the CLI interface
python3 dvmeta/main.py -c demo -v 1.0 -d -s -p -a xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx

📂Output Structure

File	Description
ds_metadata_yyyymmdd-HHMMSS.json	Datasets representation & data files metadata in JSON format.
empty_dv_yyyymmdd-HHMMSS.json	The id of empty dataverse(s) in list format.
failed_metadata_uris_yyyymmdd-HHMMSS.json	The URIs (URL) of datasets failed to be downloaded.
permission_dict_yyyymmdd-HHMMSS.json	The perission metadata of datasets with their dataset id.
pid_dict_yyyymmdd-HHMMSS.json	Datasets' basic info with hierarchical information dictionary.Only exported if -p (permission) flag is used without -d (metadata) flag.
pid_dict_dd_yyyymmdd-HHMMSS.json	The Hierarchical information of deaccessioned/draft datasets.
ds_metadata_yyyymmdd-HHMMSS.csv	Datasets and their data files' metadata in CSV format.
log_yyyymmdd-HHMMSS.txt	Summary of the crawling work.

exported_files/
├── json_files/
│   └── ds_metadata_yyyymmdd-HHMMSS.json # With -d flag enabled
│   └── empty_dv_yyyymmdd-HHMMSS.json # With -e flag enabled
│   └── failed_metadata_uris_yyyymmdd-HHMMSS.json  # With -f flag enabled
│   └── permission_dict_yyyymmdd-HHMMSS.json # With only -p flag enabled
│   └── pid_dict_yyyymmdd-HHMMSS.json # With only -p flag enabled
│   └── pid_dict_dd_yyyymmdd-HHMMSS.json # Hierarchical information of deaccessioned/draft datasets.
├── csv_files/
│   └── ds_metadata_yyyymmdd-HHMMSS.csv # with -s flag enabled
└── logs_files/
    └── log_yyyymmdd-HHMMSS.txt # Exported by default, without specifying --no-log

⚠️Disclaimer

Warning

To retrieve data about unpublished datasets or information that is not available publicly (e.g. collaborators/permissions), you will need to have necessary access rights. Please note that any publication or use of non-publicly available data may require review by a Research Ethics Board.

✅Tests

No tests have been written yet. Contributions welcome!

💻Development

Dependencies managment: poetry - Use poetry to manage dependencies and reflect changes in the pyproject.toml file.
Linter: ruff - Follow the linting rules outlined in the pyproject.toml file.

🙌Contributing

Fork the repository
Create a feature branch
Submit a pull request

📄License

MIT

🆘Support

Create an issue in the GitHub repository

📚Citation

If you use this software in your work, please cite it using the following metadata.

APA:

Lui, L. H. (2025). Dataverse Metadata Crawler (Version 0.1.1) [Computer software]. https://github.com/scholarsportal/dataverse-metadata-crawler

BibTeX:

@software{Lui_Dataverse_Metadata_Crawler_2025,
  author = {Lui, Lok Hei},
  month = {jan},
  title = {Dataverse Metadata Crawler},
  url = {https://github.com/scholarsportal/dataverse-metadata-crawler},
  version = {0.1.1},
  year = {2025}
}

✍️Authors

Ken Lui - Data Curation Specialist, Map and Data Library, University of Toronto - [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github/workflows		.github/workflows
dvmeta		dvmeta
res		res
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataverse Metadata Crawler

📜Description

✨Features

📦Prerequisites

⚙️Installation

🛠️Usage

Basic Command

Examples

📂Output Structure

⚠️Disclaimer

✅Tests

💻Development

🙌Contributing

📄License

🆘Support

📚Citation

✍️Authors

About

Releases 2

Packages

Languages

License

scholarsportal/dataverse-metadata-crawler

Folders and files

Latest commit

History

Repository files navigation

Dataverse Metadata Crawler

📜Description

✨Features

📦Prerequisites

⚙️Installation

🛠️Usage

Basic Command

Examples

📂Output Structure

⚠️Disclaimer

✅Tests

💻Development

🙌Contributing

📄License

🆘Support

📚Citation

✍️Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages