diff --git a/DEVELOPER_GUIDE.md b/DEVELOPER_GUIDE.md index 613aaf8cee..374ac64e31 100644 --- a/DEVELOPER_GUIDE.md +++ b/DEVELOPER_GUIDE.md @@ -53,39 +53,49 @@ If you use VSCode, it might be useful to use the ["monorepo" workspace](./.vscod The repository is structured as a monorepo, with Python libraries and applications in [jobs](./jobs), [libs](./libs) and [services](./services): -- [jobs](./jobs) contains the one-time jobs run by Helm before deploying the pods. For now, the only job migrates the databases when needed. -- [libs](./libs) contains the Python libraries used by the services and workers. For now, the only library is [libcommon](./libs/libcommon), which contains the common code for the services and workers. -- [services](./services) contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point), the reverse proxy, and the worker that processes the queue asynchronously: it gets a "job" (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the cache. +The following diagram represents the general architecture of the project: +![Architecture](architecture.png) + +- [Mongo Server](https://www.mongodb.com/), a Mongo server with databases for: "cache", "queue" and "maintenance". +- [jobs](./jobs) contains the jobs run by Helm before deploying the pods or scheduled basis. +For now there are two type of jobs: + - [cache maintenance](./jobs/cache_maintenance/) + - [mongodb migrations](./jobs/mongodb_migration/) +- [libs](./libs) contains the Python libraries used by the services and workers. +For now, there are two libraries + - [libcommon](./libs/libcommon), which contains the common code for the services and workers. + - [libapi](./libs/libapi/), which contains common code for authentication, http requests, exceptions and other utilities for the services. +- [services](./services) contains the applications: + - [api](./services/api/), the public API, is a web server that exposes the [API endpoints](https://huggingface.co/docs/datasets-server). All the responses are served from pre-computed responses in Mongo server. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users. + The API service exposes the `/webhook` endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database. + - [rows](./services/rows/) + - [search](./services/search/) + - [admin](./services/admin/), the admin API (which is separated from the public API and might be published under its own domain at some point) + - [reverse proxy](./services/reverse-proxy/) the reverse proxy + - [worker](./services/worker/) the worker that processes the queue asynchronously: it gets a "job" collection (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the "cache" collection. + Note also that the workers create local files when the dataset contains images or audios. A shared directory (`ASSETS_STORAGE_ROOT`) must therefore be provisioned with sufficient space for the generated files. The `/first-rows` endpoint responses contain URLs to these files, served by the API under the `/assets/` endpoint. + - [sse-api](./services/sse-api/) +- Clients + - [Admin UI](./front/admin_ui/) + - [Hugging Face Hub](https://huggingface.co/) If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5. -The application is distributed in several components. - -[api](./services/api) is a web server that exposes the [API endpoints](https://huggingface.co/docs/datasets-server). Apart from some endpoints (`valid`, `is-valid`), all the responses are served from pre-computed responses. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users. - -The precomputed responses are stored in a Mongo database called "cache". They are computed by [workers](./services/worker) which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see [libcommon](./libs/libcommon)). - -The API service exposes the `/webhook` endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database. - -Note that every worker has its own job queue: - -- `/splits`: the job is to refresh a dataset, namely to get the list of [config](https://huggingface.co/docs/datasets/v2.1.0/en/load_hub#select-a-configuration) and [split](https://huggingface.co/docs/datasets/v2.1.0/en/load_hub#select-a-split) names, then to create a new job for every split for the workers that depend on it. -- `/first-rows`: the job is to get the columns and the first 100 rows of the split. -- `/parquet`: the job is to download the dataset, prepare a parquet version of every split (various sharded parquet files), and upload them to the `refs/convert/parquet` "branch" of the dataset repository on the Hub. - -Note also that the workers create local files when the dataset contains images or audios. A shared directory (`ASSETS_STORAGE_ROOT`) must therefore be provisioned with sufficient space for the generated files. The `/first-rows` endpoint responses contain URLs to these files, served by the API under the `/assets/` endpoint. - -Hence, the working application has: +Hence, the working application has the following core components: +- a Mongo server with two main databases: "cache" and "queue" - one instance of the API service which exposes a port -- N1 instances of the `splits` worker, N2 instances of the `first-rows` worker (N2 should generally be higher than N1), N3 instances of the `parquet` worker -- a Mongo server with two databases: "cache" and "queue" -- a shared directory for the assets +- one instance of the ROWS service which exposes a port +- one instance of the SEARCH service which exposes a port +- N instances of worker that processes the pending "jobs" and stores the results in the "cache" -The application also has: +The application also has optional components: - a reverse proxy in front of the API to serve static files and proxy the rest to the API server - an admin server to serve technical endpoints +- a shared directory for the assets and cached-assets in [S3](https://aws.amazon.com/s3/) (It can be configured to point to a local storage instead) +- a shared storage for temporal files created by the workers in [EFS](https://aws.amazon.com/efs/) (It can be configured to point to a local storage instead) + The following environments contain all the modules: reverse proxy, API server, admin API server, workers, and the Mongo database. diff --git a/architecture.png b/architecture.png new file mode 100644 index 0000000000..0b0eb831fe Binary files /dev/null and b/architecture.png differ diff --git a/jobs/cache_maintenance/README.md b/jobs/cache_maintenance/README.md index 6c646b5277..a731e2bd5d 100644 --- a/jobs/cache_maintenance/README.md +++ b/jobs/cache_maintenance/README.md @@ -15,15 +15,41 @@ Available actions: The script can be configured using environment variables. They are grouped by scope. +- `CACHE_MAINTENANCE_ACTION`: the action to launch, among `backfill`, `collect-cache-metrics`, `collect-queue-metrics`, `clean-directory` and `post-messages`. Defaults to `skip`. + +### Backfill job configurations + +See [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for the following configurations: +- Cached Assets +- Assets +- S3 +- Cache +- Queue + +### Collect Cache job configurations + +See [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for the following configurations: +- Cache + +### Collect Queue job configurations + +See [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for the following configurations: +- Queue + +### Clean Directory job configurations + +- `DIRECTORY_CLEANING_CACHE_DIRECTORY`: directory location to clean up. +- `DIRECTORY_CLEANING_SUBFOLDER_PATTERN`: sub folder pattern inside the cache directory. +- `DIRECTORY_CLEANING_EXPIRED_TIME_INTERVAL_SECONDS`: time in seconds after a file is deleted since its last accessed time. + +### Post Messages job configurations + +Set environment variables to configure the `post-messages` job: + - `DISCUSSIONS_BOT_ASSOCIATED_USER_NAME`: name of the Hub user associated with the Datasets Server bot app. - `DISCUSSIONS_BOT_TOKEN`: token of the Datasets Server bot used to post messages in Hub discussions. - `DISCUSSIONS_PARQUET_REVISION`: revision (branch) where the converted Parquet files are stored. -### Actions - -Set environment variables to configure the job (`CACHE_MAINTENANCE_` prefix): - -- `CACHE_MAINTENANCE_ACTION`: the action to launch, among `backfill`, `metrics`, `skip`. Defaults to `skip`. ### Common diff --git a/libs/libapi/README.md b/libs/libapi/README.md index dc6290597a..54a1a0df77 100644 --- a/libs/libapi/README.md +++ b/libs/libapi/README.md @@ -1,6 +1,12 @@ # libapi -A Python library for the API services +A Python library with common code for authentication, http requests, exceptions and other utilities. +Used by the following services: +- [admin](https://github.com/huggingface/datasets-server/tree/main/services/admin) +- [api](https://github.com/huggingface/datasets-server/tree/main/services/api) +- [rows](https://github.com/huggingface/datasets-server/tree/main/services/rows) +- [search](https://github.com/huggingface/datasets-server/tree/main/services/search) +- [sse-api](https://github.com/huggingface/datasets-server/tree/main/services/sse-api) ## Configuration diff --git a/libs/libcommon/README.md b/libs/libcommon/README.md index 41b8f51b13..ae87cb5580 100644 --- a/libs/libcommon/README.md +++ b/libs/libcommon/README.md @@ -64,3 +64,11 @@ Set environment variables to configure the connection to S3. - `S3_REGION_NAME`: bucket region name when using `s3` as storage protocol for assets or cached assets. Defaults to `us-east-1`. - `S3_ACCESS_KEY_ID`: unique identifier associated with an AWS account. It's used to identify the AWS account that is making requests to S3. Defaults to empty. - `S3_SECRET_ACCESS_KEY`: secret key associated with an AWS account. Defaults to empty. + +## Parquet Metadata + +- `PARQUET_METADATA_STORAGE_DIRECTORY`: storage directory where parquet metadata files are stored. See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.FileMetaData.html for more information. + +## Rows Index + +- `ROWS_INDEX_MAX_ARROW_DATA_IN_MEMORY`: The maximun number of row groups to be loaded in memory. \ No newline at end of file diff --git a/services/admin/README.md b/services/admin/README.md index 9b0b88c896..bc23f92d1c 100644 --- a/services/admin/README.md +++ b/services/admin/README.md @@ -1,10 +1,10 @@ -# Datasets server admin machine +# Datasets server admin service > Admin endpoints ## Configuration -The worker can be configured using environment variables. They are grouped by scope. +The service can be configured using environment variables. They are grouped by scope. ### Admin service @@ -29,6 +29,13 @@ The following environment variables are used to configure the Uvicorn server (`A - `PROMETHEUS_MULTIPROC_DIR`: the directory where the uvicorn workers share their prometheus metrics. See https://github.com/prometheus/client_python#multiprocess-mode-eg-gunicorn. Defaults to empty, in which case every worker manages its own metrics, and the /metrics endpoint returns the metrics of a random worker. +### Storage + +- `DATASETS_BASED_HF_DATASETS_CACHE`: storage directory where job runners that used `datasets` library store cache files. +- `DESCRIPTIVE_STATISTICS_CACHE_DIRECTORY`: storage directory where `split-descriptive-statistics` job runner stores temporal downloaded parquet files. +- `DUCKDB_INDEX_CACHE_DIRECTORY`: storage directory where `split-duckdb-index` job runner stores temporal downloaded parquet files. +Same directory is used for /search and /filter endpoint for temporal duckdb index files are downloaded. + ### Common See [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for more information about the common configuration. @@ -37,15 +44,20 @@ See [../../libs/libcommon/README.md](../../libs/libcommon/README.md) for more in The admin service provides endpoints: -- `/healthcheck` +- `/healthcheck`: ensure the app is running - `/metrics`: give info about the cache and the queue -- `/cache-reports{processing_step}`: give detailed reports on the content of the cache for a processing step -- `/cache-reports-with-content{processing_step}`: give detailed reports on the content of the cache for a processing step, including the content itself, which can be heavy - `/pending-jobs`: give the pending jobs, classed by queue and status (waiting or started) +- `/dataset-status`: give the dataset status including cache records and pending jobs. +- `/num-dataset-infos-by-builder-name`: give a report about number of datasets by builder name (parquet, csv, text, imagefolder, audiofolder, json, arrow and webdataset). +- `/recreate-dataset`: deletes all the cache entries related to a specific dataset, then run all the steps in order. It's a POST endpoint. Pass the requested parameters: + - `dataset`: the dataset name + - `priority`: `low` (default), `normal` or `high` + +### Endpoints by processing step + - `/force-refresh{processing_step}`: force refresh cache entries for the processing step. It's a POST endpoint. Pass the requested parameters, depending on the processing step's input type: - `dataset`: `?dataset={dataset}` - `config`: `?dataset={dataset}&config={config}` - `split`: `?dataset={dataset}&config={config}&split={split}` -- `/recreate-dataset`: deletes all the cache entries related to a specific dataset, then run all the steps in order. It's a POST endpoint. Pass the requested parameters: - - `dataset`: the dataset name - - `priority`: `low` (default), `normal` or `high` +- `/cache-reports{processing_step}`: give detailed reports on the content of the cache for a processing step +- `/cache-reports-with-content{processing_step}`: give detailed reports on the content of the cache for a processing step, including the content itself, which can be heavy diff --git a/services/api/README.md b/services/api/README.md index f691fc9f15..8bd70a9c32 100644 --- a/services/api/README.md +++ b/services/api/README.md @@ -1,6 +1,6 @@ # Datasets server API -> API on 🤗 datasets +> API for HugginFace 🤗 datasets viewer ## Configuration @@ -21,7 +21,10 @@ See https://huggingface.co/docs/datasets-server - /healthcheck: Ensure the app is running - /metrics: Return a list of metrics in the Prometheus format - /webhook: Add, update or remove a dataset +- /croissant: Return the [croissant](https://github.com/mlcommons/croissant) specification for a dataset. - /is-valid: Tell if a dataset is [valid](https://huggingface.co/docs/datasets-server/valid) - /splits: List the [splits](https://huggingface.co/docs/datasets-server/splits) names for a dataset - /first-rows: Extract the [first rows](https://huggingface.co/docs/datasets-server/first_rows) for a dataset split - /parquet: List the [parquet files](https://huggingface.co/docs/datasets-server/parquet) auto-converted for a dataset +- /opt-in-out-urls: Return the number of opted-in/out image URLs. See [Spawning AI](https://api.spawning.ai/spawning-api) for more information. +- /statistics: Return some basic statistics for a dataset split. \ No newline at end of file diff --git a/services/reverse-proxy/README.md b/services/reverse-proxy/README.md index e612c239ae..c02808b2f7 100644 --- a/services/reverse-proxy/README.md +++ b/services/reverse-proxy/README.md @@ -19,11 +19,11 @@ It takes various environment variables, all of them are mandatory: - `OPENAPI_FILE`: the path to the OpenAPI file, eg `docs/source/openapi.json` - `HOST`: domain of the reverse proxy, eg `localhost` - `PORT`: port of the reverse proxy, eg `80` -- `URL_ADMIN`= URL of the admin, eg `http://admin:8081` -- `URL_API`= URL of the API, eg `http://api:8080` -- `URL_ROWS`= URL of the rows service, eg `http://rows:8082` -- `URL_SEARCH`= URL of the search service, eg `http://search:8083` -- `URL_SSE_API`= URL of the SSE API service, eg `http://sse-api:8085` +- `URL_ADMIN`: URL of the admin, eg `http://admin:8081` +- `URL_API`: URL of the API, eg `http://api:8080` +- `URL_ROWS`: URL of the rows service, eg `http://rows:8082` +- `URL_SEARCH`: URL of the search service, eg `http://search:8083` +- `URL_SSE_API`: URL of the SSE API service, eg `http://sse-api:8085` The image requires three directories to be mounted (from volumes): diff --git a/services/rows/README.md b/services/rows/README.md index ecf8a2ef4c..f6d6e78a91 100644 --- a/services/rows/README.md +++ b/services/rows/README.md @@ -1,6 +1,8 @@ # Datasets server API - rows endpoint -> /rows endpoint +> **GET** /rows + +See [usage](https://huggingface.co/docs/datasets-server/rows) for more details. ## Configuration diff --git a/services/search/README.md b/services/search/README.md index 6b3435cbbf..7c0722c75c 100644 --- a/services/search/README.md +++ b/services/search/README.md @@ -1,13 +1,17 @@ -# Datasets server API - search service +# Datasets server API - search and filter endpoints -> /search endpoint -> /filter endpoint +> **GET** /search +> +> **GET** /filter + +See [search](https://huggingface.co/docs/datasets-server/search) and [filter](https://huggingface.co/docs/datasets-server/filter) usage for more details. ## Configuration The service can be configured using environment variables. They are grouped by scope. -### Duckdb index full text search +### Duckdb index + - `DUCKDB_INDEX_CACHE_DIRECTORY`: directory where the temporal duckdb index files are downloaded. Defaults to empty. - `DUCKDB_INDEX_TARGET_REVISION`: the git revision of the dataset where the index file is stored in the dataset repository. diff --git a/services/sse-api/README.md b/services/sse-api/README.md index b0177f5156..2c9170ab84 100644 --- a/services/sse-api/README.md +++ b/services/sse-api/README.md @@ -1,6 +1,6 @@ # Datasets server SSE API -> Server-sent events API for the Datasets server. It's used to update the Hub's backend cache. +> Server-sent events API for the Datasets server. It's used to update the Hugging Face Hub's backend cache. ## Configuration diff --git a/services/storage-admin/README.md b/services/storage-admin/README.md index 1bf04f1f21..1bdfec154b 100644 --- a/services/storage-admin/README.md +++ b/services/storage-admin/README.md @@ -1,3 +1,9 @@ # Datasets server - storage admin > A Ubuntu machine to log into and manage the storage manually + +This container has connectivity to storage for: +- Descriptive statistics job runner +- Duckdb job runner +- Datasets cache +- Parquet metadata \ No newline at end of file diff --git a/services/worker/README.md b/services/worker/README.md index 541eb07f26..bf36023937 100644 --- a/services/worker/README.md +++ b/services/worker/README.md @@ -1,6 +1,6 @@ # Datasets server - worker -> Workers that pre-compute and cache the response to /splits, /first-rows, /parquet, /info and /size. +> Workers that pre-compute and cache the response for each of the processing steps. ## Configuration