API to extract rows of 🤗 datasets
- Python 3.8+
- Poetry 1.1.7+
- make
- libicu-dev
Install with:
make install
Launch with:
make run
Set environment variables to configure the following aspects:
APP_HOSTNAME
: the hostname used by the app. Defaults to"localhost"
.APP_PORT
: the port used by the app. Defaults to8000
.ASSETS_DIRECTORY
: directory where the asset files are stored. Defaults to empty, in which case the assets are located in thedatasets_preview_backend_assets
subdirectory inside the OS default cache directory.DATASETS_ENABLE_PRIVATE
: enable private datasets. Defaults toFalse
.DATASETS_REVISION
: git reference for the canonical datasets on https://github.com/huggingface/datasets. Defaults tomaster
.LOG_LEVEL
: log level, amongDEBUG
,INFO
,WARNING
,ERROR
andCRITICAL
. Defaults toINFO
.MAX_AGE_LONG_SECONDS
: number of seconds to set in themax-age
header on data endpoints. Defaults to120
(2 minutes).MAX_AGE_SHORT_SECONDS
: number of seconds to set in themax-age
header on technical endpoints. Defaults to10
(10 seconds).MONGO_CACHE_DATABASE
: the name of the database used for storing the cache. Defaults to"datasets_preview_cache"
.MONGO_QUEUE_DATABASE
: the name of the database used for storing the queue. Defaults to"datasets_preview_queue"
.MONGO_URL
: the URL used to connect to the mongo db server. Defaults to"mongodb://localhost:27018"
.ROWS_MAX_BYTES
: max size of the /rows endpoint response in bytes. Defaults to1_000_000
(1 MB).ROWS_MAX_NUMBER
: max number of rows in the /rows endpoint response. Defaults to100
.WEB_CONCURRENCY
: the number of workers. For now, it's ignored and hardcoded to 1 because the cache is not shared yet. Defaults to2
.
For example:
APP_PORT=80 WEB_CONCURRENCY=4 make run
To reload the application on file changes while developing, run:
make watch
To launch a worker, which will take jobs from the queue:
MAX_LOAD_PCT=50 MAX_MEMORY_PCT=60 WORKER_SLEEP_SECONDS=5 make worker
Every WORKER_SLEEP_SECONDS
(defaults to 5 seconds) when idle, the worker will check if resources are available, and update the cache entry for a dataset, if it could get a job from the queue. Then loop to start again. The resources are considered available if all the conditions are met:
- the load percentage (the max of the 1m/5m/15m load divided by the number of cpus *100) is below
MAX_LOAD_PCT
(defaults to 50%) - the memory (RAM + SWAP) on the machine is below
MAX_MEMORY_PCT
(defaults to 60%) - the number of started jobs for the same dataset is under
MAX_JOBS_PER_DATASET
Also specify HF_TOKEN
with a User Access Token (see https://huggingface.co/settings/token, only the read
role is required) to allow the worker to download gated models from the hub. Defaults to empty.
Also specify MAX_SIZE_FALLBACK
with the maximum size in bytes of the dataset to fallback in normal mode if streaming fails. Note that it requires to have the size in the info metadata. Set to 0
to disable the fallback. Defaults to 100_000_000
.
The WORKER_QUEUE
variable specifies which jobs queue the worker will pull jobs from. It can be equal to datasets
(default) or splits
. The datasets
jobs should be a lot faster than the splits
ones, so that we should need a lot more workers for splits
than for datasets
.
To warm the cache, ie. add all the missing Hugging Face datasets to the queue:
make warm
To empty the databases:
make clean
or individually:
make clean-cache
make clean-queues # delete all the jobs
See also:
make cancel-started-jobs
make cancel-waiting-jobs
Ensure the app is running
Example: https://datasets-preview.huggingface.tech/healthcheck
Method: GET
Parameters: none
Responses:
200
: text contentok
Give statistics about the content of the cache
Example: https://datasets-preview.huggingface.tech/cache
Method: GET
Parameters: none
Responses:
200
: JSON content which gives statistics about the datasets in the cache, with the following structure:
{
"datasets": { "empty": 0, "error": 0, "stalled": 0, "valid": 1 },
"splits": { "empty": 1, "error": 1, "stalled": 0, "valid": 0 },
"created_at": "2022-01-20T14:40:50Z"
}
Beware: a "dataset" is considered valid if it has fetched correctly the configs and splits. The splits themselves can have errors (ie: the rows or columns might have errors)
Give detailed reports on the content of the cache
Example: https://datasets-preview.huggingface.tech/cache-reports
Method: GET
Parameters: none
Responses:
200
: JSON content which the dataset cache reports, with the following structure:
{
"datasets": {
"empty": [],
"error": [],
"stalled": [],
"valid": [{ "dataset": "sent_comp", "status": "VALID", "error": null }]
},
"splits": {
"empty": [
{
"dataset": "sent_comp",
"config": "default",
"split": "train",
"status": "EMPTY",
"error": null
}
],
"error": [
{
"dataset": "sent_comp",
"config": "default",
"split": "validation",
"status": "error",
"error": {
"status_code": 400,
"exception": "Status400Error",
"message": "Cannot get the first rows for the split.",
"cause_exception": "FileNotFoundError",
"cause_message": "[Errno 2] No such file or directory: 'https://github.com/google-research-datasets/sentence-compression/raw/master/data/comp-data.eval.json.gz'",
"cause_traceback": [
"Traceback (most recent call last):\n",
" File \"/home/slesage/hf/datasets-preview-backend/src/datasets_preview_backend/models/row.py\", line 61, in get_rows\n rows = extract_rows(dataset_name, config_name, split_name, num_rows, hf_token)\n",
" File \"/home/slesage/hf/datasets-preview-backend/src/datasets_preview_backend/models/row.py\", line 32, in decorator\n return func(*args, **kwargs)\n",
" File \"/home/slesage/hf/datasets-preview-backend/src/datasets_preview_backend/models/row.py\", line 55, in extract_rows\n return list(iterable_dataset.take(num_rows))\n",
" File \"/home/slesage/hf/datasets-preview-backend/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py\", line 341, in __iter__\n for key, example in self._iter():\n",
" File \"/home/slesage/hf/datasets-preview-backend/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py\", line 338, in _iter\n yield from ex_iterable\n",
" File \"/home/slesage/hf/datasets-preview-backend/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py\", line 273, in __iter__\n yield from islice(self.ex_iterable, self.n)\n",
" File \"/home/slesage/hf/datasets-preview-backend/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py\", line 78, in __iter__\n for key, example in self.generate_examples_fn(**self.kwargs):\n",
" File \"/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/sent_comp/512501fef5db888ec620cb9e4943420ea7c7c244c60de9222fb50bca1232f4b5/sent_comp.py\", line 136, in _generate_examples\n with gzip.open(filepath, mode=\"rt\", encoding=\"utf-8\") as f:\n",
" File \"/home/slesage/.pyenv/versions/3.9.6/lib/python3.9/gzip.py\", line 58, in open\n binary_file = GzipFile(filename, gz_mode, compresslevel)\n",
" File \"/home/slesage/.pyenv/versions/3.9.6/lib/python3.9/gzip.py\", line 173, in __init__\n fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')\n",
"FileNotFoundError: [Errno 2] No such file or directory: 'https://github.com/google-research-datasets/sentence-compression/raw/master/data/comp-data.eval.json.gz'\n"
]
}
}
],
"stalled": [],
"valid": []
},
"created_at": "2022-01-20T14:40:27Z"
}
Beware: a "dataset" is considered valid if it has fetched correctly the configs and splits. The splits themselves can have errors (ie: the rows or columns might have errors)
Give the list of the valid datasets. Here, a dataset is considered valid if
/splits
returns a valid response, and if/rows
returns a valid response for at least one split. Note that stalled cache entries are considered valid.
Example: https://datasets-preview.huggingface.tech/valid
Method: GET
Parameters: none
Responses:
200
: JSON content which gives the list of the datasets per status, with the following structure.
{
"valid": ["discovery"],
"created_at": "2021-10-07T13:33:46Z"
}
Tells if a dataset is valid. A dataset is considered valid if
/splits
returns a valid response, and if/rows
returns a valid response for at least one split. Note that stalled cache entries are considered valid.
Example: https://datasets-preview.huggingface.tech/is-valid?dataset=glue
Method: GET
Parameters:
dataset
(required): the dataset ID
Responses:
200
: JSON content which tells if the dataset is valid or not
{
"valid": true
}
Give statistics about the datasets of the hub
Example: https://datasets-preview.huggingface.tech/hf-datasets-count-by-cache-status
Method: GET
Parameters: none
Responses:
200
: JSON content which gives statistics about the status of the public datasets of the Hub, split by canonical or community, with the following structure:
{
"canonical": {
"valid": 0,
"error": 0,
"missing": 1
},
"community": {
"valid": 0,
"error": 0,
"missing": 1
},
"created_at": "2022-01-20T13:52:05Z"
}
The meaning is the following:
- "valid": the list of splits and the 100 first rows of every split are available (maybe stalled)
- "error": the list of splits could not be fetched, or the rows could not be fetched for some splits
- "missing": the list of splits is missing, or the rows are missing for some splits
Give statistics about the content of the queue
Example: https://datasets-preview.huggingface.tech/queue
Method: GET
Parameters: none
Responses:
200
: JSON content which gives statistics about the queue, with the following structure:
{
"datasets": {
"waiting": 0,
"started": 0,
"success": 1,
"error": 0,
"cancelled": 0
},
"splits": {
"waiting": 0,
"started": 0,
"success": 0,
"error": 0,
"cancelled": 34
},
"created_at": "2022-01-20T13:52:05Z"
}
Give the queue entries, classed by status
Example: https://datasets-preview.huggingface.tech/queue-dump
Method: GET
Parameters: none
Responses:
200
: JSON content which the queue content, by status, with the following structure:
{
"datasets": {
"waiting": [],
"started": [],
"success": [
{
"dataset_name": "glue",
"status": "SUCCESS",
"created_at": "2022-01-20T13:48:06.705000",
"started_at": "2022-01-20T13:48:21.615000",
"finished_at": "2022-01-20T13:48:27.898000"
}
],
"error": [],
"cancelled": []
},
"splits": {
"waiting": [],
"started": [],
"success": [],
"error": [],
"cancelled": [
{
"dataset_name": "glue",
"config_name": "cola",
"split_name": "test",
"status": "CANCELLED",
"created_at": "2022-01-20T13:48:27.846000",
"started_at": null,
"finished_at": "2022-01-20T13:51:51.411000"
}
]
},
"created_at": "2022-01-20T13:59:03Z"
}
Give the queue entries, classed by status, only for "waiting" and "started" statuses
Example: https://datasets-preview.huggingface.tech/queue-dump-waiting-started
Method: GET
Parameters: none
Responses:
200
: JSON content which the queue content, by status, with the following structure:
{
"datasets": {
"waiting": [],
"started": []
},
"splits": {
"waiting": [],
"started": []
},
"created_at": "2022-01-20T13:59:03Z"
}
Adds, updates or removes a cache entry
Example: https://datasets-preview.huggingface.tech/webhook
Method: POST
Body:
{
"add": "datasets/dataset1",
"update": "datasets/dataset1",
"remove": "datasets/dataset1"
}
The three keys are optional, and moonlanding should send only one of them. The dataset identifiers are full names, ie. they must include the datasets/
prefix, which means that a community dataset will have two slashes: datasets/allenai/c4
for example.
Responses:
-
200
: JSON content with the following structure:{ "status": "ok" }
-
400
: the payload is erroneous, or a 400 error raised during the cache operation -
500
: application error
Note: if you want to refresh multiple datasets at a time, you have to call the endpoint again and again. You can use bash for example:
MODELS=(amazon_polarity ami arabic_billion_words)
for model in ${MODELS[@]}; do curl -X POST https://datasets-preview.huggingface.tech/webhook -H 'Content-Type: application/json' -d '{"update": "datasets/'$model'"}'; done;
Refresh the cache of rows and columns of a split
Example: https://datasets-preview.huggingface.tech/refresh-split
Method: POST
Body:
{
"dataset": "glue",
"config": "ax",
"split": "test"
}
Responses:
-
200
: JSON content with the following structure:{ "status": "ok" }
-
400
: the payload is erroneous, or a 400 error raised during the cache operation -
500
: application error
Lists the HuggingFace datasets: canonical and community
Example: https://datasets-preview.huggingface.tech/hf_datasets
Method: GET
Parameters: none
Responses:
-
200
: JSON content with the following structure:{ "datasets": [ { "id": "acronym_identification", "tags": [ "annotations_creators:expert-generated", "language_creators:found", "languages:en", "licenses:mit", "multilinguality:monolingual", "size_categories:10K<n<100K", "source_datasets:original", "task_categories:structure-prediction", "task_ids:structure-prediction-other-acronym-identification" ], "citation": "@inproceedings{veyseh-et-al-2020-what,\n title={{What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation}},\n author={Amir Pouran Ben Veyseh and Franck Dernoncourt and Quan Hung Tran and Thien Huu Nguyen},\n year={2020},\n booktitle={Proceedings of COLING},\n link={https://arxiv.org/pdf/2010.14678v1.pdf}\n}", "description": "Acronym identification training and development sets for the acronym identification task at SDU@AAAI-21.", "paperswithcode_id": "acronym-identification", "downloads": 5174 }, { "id": "aeslc", "tags": ["languages:en"], "citation": "@misc{zhang2019email,\n title={This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation},\n author={Rui Zhang and Joel Tetreault},\n year={2019},\n eprint={1906.03497},\n archivePrefix={arXiv},\n primaryClass={cs.CL}\n}", "description": "A collection of email messages of employees in the Enron Corporation.\n\nThere are two features:\n - email_body: email body text.\n - subject_line: email subject text.", "paperswithcode_id": "aeslc", "downloads": 3053 }, { "id": "afrikaans_ner_corpus", "tags": [ "annotations_creators:expert-generated", "language_creators:expert-generated", "languages:af", "licenses:other-Creative Commons Attribution 2.5 South Africa License", "multilinguality:monolingual", "size_categories:1K<n<10K", "source_datasets:original", "task_categories:structure-prediction", "task_ids:named-entity-recognition" ], "citation": "@inproceedings{afrikaans_ner_corpus,\n author = {\tGerhard van Huyssteen and\n Martin Puttkammer and\n E.B. Trollip and\n J.C. Liversage and\n Roald Eiselen},\n title = {NCHLT Afrikaans Named Entity Annotated Corpus},\n booktitle = {Eiselen, R. 2016. Government domain named entity recognition for South African languages. Proceedings of the 10th Language Resource and Evaluation Conference, Portorož, Slovenia.},\n year = {2016},\n url = {https://repo.sadilar.org/handle/20.500.12185/299},\n}", "description": "Named entity annotated data from the NCHLT Text Resource Development: Phase II Project, annotated with PERSON, LOCATION, ORGANISATION and MISCELLANEOUS tags.", "paperswithcode_id": null, "downloads": 229 } ] }
-
500
: application error
Lists the splits names for a dataset
Example: https://datasets-preview.huggingface.tech/splits?dataset=glue
Method: GET
Parameters:
dataset
(required): the dataset ID
Responses:
-
200
: JSON content with the following structure:{ "splits": [ { "dataset": "glue", "config": "cola", "split": "test", "num_bytes": 217556, "num_examples": 1821 }, { "dataset": "glue", "config": "cola", "split": "train", "num_bytes": 4715283, "num_examples": 67349 }, { "dataset": "glue", "config": "cola", "split": "validation", "num_bytes": 106692, "num_examples": 872 } ] }
-
400
: the dataset script is erroneous -
404
: the dataset or config cannot be found, or it's not in the cache -
500
: application error
Note that the value of "num_bytes"
and "num_examples"
is set to null
if the data is not available.
Extract the first rows for a split of a dataset config
Example: https://datasets-preview.huggingface.tech/rows?dataset=glue&config=ax&split=test
Method: GET
Parameters:
dataset
(required): the dataset IDconfig
(required): the configuration namesplit
(required): the split name
Responses:
-
200
: JSON content that provides the types of the columns (see features at https://huggingface.co/docs/datasets/about_dataset_features.html) and the data rows, with the following structure. Note that the features are ordered and this order can be used to display the columns in a table for example. Binary values are transmitted in UTF-8 encoded base64 strings. The number of rows depends onROWS_MAX_BYTES
andROWS_MAX_NUMBER
.{ "columns": [ { "dataset": "glue", "config": "ax", "split": "test", "column_idx": 0, "column": { "name": "premise", "type": "STRING" } }, { "dataset": "glue", "config": "ax", "split": "test", "column_idx": 1, "column": { "name": "hypothesis", "type": "STRING" } }, { "dataset": "glue", "config": "ax", "split": "test", "column_idx": 2, "column": { "name": "label", "type": "CLASS_LABEL", "labels": ["entailment", "neutral", "contradiction"] } }, { "dataset": "glue", "config": "ax", "split": "test", "column_idx": 3, "column": { "name": "idx", "type": "INT" } } ], "rows": [ { "dataset": "glue", "config": "ax", "split": "test", "row_idx": 0, "row": { "premise": "The cat sat on the mat.", "hypothesis": "The cat did not sit on the mat.", "label": -1, "idx": 0 } }, { "dataset": "glue", "config": "ax", "split": "test", "row_idx": 1, "row": { "premise": "The cat did not sit on the mat.", "hypothesis": "The cat sat on the mat.", "label": -1, "idx": 1 } }, { "dataset": "glue", "config": "ax", "split": "test", "row_idx": 2, "row": { "premise": "When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.", "hypothesis": "When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.", "label": -1, "idx": 2 } } ] }
-
400
: the dataset script is erroneous, or the data cannot be obtained. -
404
: the dataset, config or script cannot be found, or it's not in the cache -
500
: application error
Return an asset
Example: https://datasets-preview.huggingface.tech/assets/food101/--/default/train/0/image/2885220.jpg
Method: GET
Path parameters:
/assets/:dataset/--/:config/:split/:row_idx/:column/:filename
dataset
(required): the dataset IDconfig
(required): the configuration name. If the dataset does not contain configs, you must explicitly pass "config=default"split
(required): the split namerow_idx
(required): the 0-based row indexcolumn
(required): the column namefilename
(required): the asset file name
Responses:
200
: the asset file400
: the dataset script is erroneous, or the data cannot be obtained.404
: the dataset, config, script, row, column, filename or data cannot be found500
: application error