Worker that pre-computes and caches the response to /splits, /first-rows or /parquet-and-dataset-info.
Use environment variables to configure the worker. The prefix of each environment variable gives its scope.
Set environment variables to configure the datasets-based worker (DATASETS_BASED_
prefix):
DATASETS_BASED_ENDPOINT
: the endpoint on which the worker will work (pre-compute and cache the response). The same worker is used for different endpoints to reuse shared code and dependencies. But at runtime, the worker is assigned only one endpoint. Allowed values:/splits
,/first_rows
, and/parquet-and-dataset-info
. Defaults to/splits
.DATASETS_BASED_HF_DATASETS_CACHE
: directory where thedatasets
library will store the cached datasets' data. If not set, the datasets library will choose the default location. Defaults to None.
Also, set the modules cache configuration for the datasets-based worker. See ../../libs/libcommon/README.md. Note that this variable has no DATASETS_BASED_
prefix:
HF_MODULES_CACHE
: directory where thedatasets
library will store the cached dataset scripts. If not set, the datasets library will choose the default location. Defaults to None.
Note that both directories will be appended to WORKER_LOOP_STORAGE_PATHS
(see ../../libs/libcommon/README.md) to hold the workers when the disk is full.
Numba requires setting the NUMBA_CACHE_DIR
environment variable to a writable directory to cache the compiled functions. Required on cloud infrastructure (see https://stackoverflow.com/a/63367171/7351594):
NUMBA_CACHE_DIR
: directory where thenumba
decorators (used bylibrosa
) can write cache.
Note that this directory will be appended to WORKER_LOOP_STORAGE_PATHS
(see ../../libs/libcommon/README.md) to hold the workers when the disk is full.
If the Hub is not https://huggingface.co (i.e., if you set the COMMON_HF_ENDPOINT
environment variable), you must set the HF_ENDPOINT
environment variable to the same value. See huggingface/datasets#5196 (comment) for more details:
HF_ENDPOINT
: the URL of the Hub. Defaults tohttps://huggingface.co
.
Only needed when the DATASETS_BASED_ENDPOINT
is set to /first-rows
.
Set environment variables to configure the first rows worker (FIRST_ROWS_
prefix):
FIRST_ROWS_FALLBACK_MAX_DATASET_SIZE
: the maximum size in bytes of the dataset to fall back into normal mode if streaming fails. Note that it requires to have the size in the info metadata. Set to0
to disable the fallback. Defaults to100_000_000
.FIRST_ROWS_MAX_BYTES
: the max size of the /first-rows endpoint response in bytes. Defaults to1_000_000
(1 MB).FIRST_ROWS_MAX_NUMBER
: the max number of rows fetched by the worker for the split and provided in the /first-rows endpoint response. Defaults to100
.FIRST_ROWS_MIN_CELL_BYTES
: the minimum size in bytes of a cell when truncating the content of a row (seeFIRST_ROWS_ROWS_MAX_BYTES
). Below this limit, the cell content will not be truncated. Defaults to100
.FIRST_ROWS_MIN_NUMBER
: the min number of rows fetched by the worker for the split and provided in the /first-rows endpoint response. Defaults to10
.
Also, set the assets-related configuration for the first-rows worker. See ../../libs/libcommon/README.md.
Only needed when the DATASETS_BASED_ENDPOINT
is set to /parquet-and-dataset-info
.
Set environment variables to configure the parquet worker (PARQUET_AND_DATASET_INFO_
prefix):
PARQUET_AND_DATASET_INFO_BLOCKED_DATASETS
: comma-separated list of the blocked datasets. If empty, no dataset is blocked. Defaults to empty.PARQUET_AND_DATASET_INFO_COMMIT_MESSAGE
: the git commit message when the worker uploads the parquet files to the Hub. Defaults toUpdate parquet files
.PARQUET_AND_DATASET_INFO_COMMITTER_HF_TOKEN
: the user token (https://huggingface.co/settings/tokens) to commit the parquet files to the Hub. The user must be allowed to create therefs/convert/parquet
branch (seePARQUET_AND_DATASET_INFO_TARGET_REVISION
) (Hugging Face organization members have this right). It must also have the right to push to therefs/convert/parquet
branch (Datasets maintainers members have this right). It must have permission to write. If not set, the worker will fail. Defaults to None.PARQUET_AND_DATASET_INFO_MAX_DATASET_SIZE
: the maximum size in bytes of the dataset to pre-compute the parquet files. Bigger datasets, or datasets without that information, are ignored. Defaults to100_000_000
.PARQUET_AND_DATASET_INFO_SOURCE_REVISION
: the git revision of the dataset to use to prepare the parquet files. Defaults tomain
.PARQUET_AND_DATASET_INFO_SUPPORTED_DATASETS
: comma-separated list of the supported datasets. The worker does not test the size of supported datasets against the maximum dataset size. Defaults to empty.PARQUET_AND_DATASET_INFO_TARGET_REVISION
: the git revision of the dataset where to store the parquet files. Make sure the committer token (PARQUET_AND_DATASET_INFO_COMMITTER_HF_TOKEN
) has the permission to write there. Defaults torefs/convert/parquet
.PARQUET_AND_DATASET_INFO_URL_TEMPLATE
: the URL template to build the parquet file URLs. Defaults to/datasets/%s/resolve/%s/%s
.
The splits worker does not need any additional configuration.
See ../../libs/libcommon/README.md for more information about the common configuration.