From ca4904364db8a4555b26b570b983b6923c1f26f1 Mon Sep 17 00:00:00 2001 From: Mirek Simek Date: Sun, 19 Jan 2025 19:03:20 +0100 Subject: [PATCH] Documented API update from RFC 0072 * See https://github.com/inveniosoftware/rfcs/pull/91 for details Co-authored-by: Mirek Simek --- docs/reference/file_storage.md | 177 ++++++++++++++++++++++++++------- 1 file changed, 140 insertions(+), 37 deletions(-) diff --git a/docs/reference/file_storage.md b/docs/reference/file_storage.md index 88431f76..7a9bc8d7 100644 --- a/docs/reference/file_storage.md +++ b/docs/reference/file_storage.md @@ -3,28 +3,51 @@ There are two different concepts when handling file storage in InvenioRDM. One is the backend, meaning the actual technology that is used to store it. For example, the local file system or S3. You can find more information about storage backends in the -[customize](../customize/s3.md) section. +[customize](../customize/s3.md) section. Most of the time the backend is transparent to +the user, as the InvenioRDM API abstracts it away. -Moreover, the origin or method used to transport the files is also important. In InvenioRDM -there are three defined types. +Moreover, the origin or method used to transport the files is also important. +InvenioRDM implements an extensible mechanism for transporting files. Out of the box, +the following 4 transport mechanisms are supported: -- Local, which represents the files that are managed by the InvenioRDM instance, +- **Local**, which represents the files that are managed by the InvenioRDM instance, independently of the backend. -- Fetch, these are files that are not managed by the instance but will be transported. +- **Fetch**, these are files that are not managed by the instance at the beginning when +the file is attached to the record, but will be transported and stored locally. This means that they will eventually become _local_ files. -- Remote, these are represented by a reference to an external storage system. Since +- **Multipart**, these are files that are uploaded in parts. User can upload parts +in parallel or can retransmit each part if the upload fails, for example due to +network errors. After upload, the parts are assembled into a single file and the +file becomes a _local_ file. +- **Remote**, these are represented by a reference to an external storage system. Since the files are not managed by the instance there is no possible way to guarantee their -availability or integrity. At the moment this type of files are **not supported** by -InvenioRDM. +availability or integrity. -These file types are stored in the `storage_class` attribute of the file model, and +These file types are stored in the `transfer.type` attribute of the file model, and represented by a one character encoding: -| Type | Representation | -|:------:|:--------------:| -| Local | L | -| Fetch | F | -| Remote | R | +| Type | Representation | +|:----------:|:--------------:| +| Local | L | +| Fetch | F | +| Multipart | M | +| Remote | R | + +Example of selecting transfer type on file creation: + +```http +POST /api/records/{id}/draft/files +Content-Type: application/json + +[{ + "key": "dataset.zip", + "transfer": { + "type": "F", + "url": "https://example.org/files/dataset.zip?token=" + } + "metadata": {...} +}] +``` ## Local files (L) @@ -33,28 +56,17 @@ Local files are managed as defined in the ## Files fetching (F) -_Introduced in InvenioRDM v11_ - -!!! warning "Experimental feature" - - The file fetching mechanism in InvenioRDM v11 has a few limitations. Be aware that - future releases of InvenioRDM might introduce breaking changes. We will document them - as extensively as possible. - - **Use it at your own risk!** - -Fetched files accept two more arguments than a local files on their -[initialization](rest_api_drafts_records.md#start-draft-file-uploads): _storage\_class_, and -_uri_: +During initialization, fetched files are created using [the same protocol as local files](rest_api_drafts_records.md#start-draft-file-uploads). +Additionally you need to provide a `transfer` object with `type` and `url` fields. **Parameters** | Name | Type | Location | Description | | --------------- | ------ | -------- | -------------------------- | -| `storage_class` | string | body | "L" | -| `uri` | string | body | URL to fetch the file from | +| `type` | string | body | "F" | +| `url` | string | body | URL to fetch the file from | -The `uri` must be a URL, accessible from the server's network and resolving to a file +The `url` must be a URL, accessible from the server's network and resolving to a file that can be fetched. No authentication mechanism (e.g. `Authorization` header) is supported for the request process, so any authentication has to be part of the URL itself (e.g. a token passed in a query string). @@ -68,8 +80,10 @@ Content-Type: application/json [ { "key": "dataset.zip", - "uri": "https://example.org/files/dataset.zip?token=", - "storage_class": "F", + "transfer": { + "type": "F", + "url": "https://example.org/files/dataset.zip?token=", + } }, ... ] @@ -92,8 +106,9 @@ Content-Type: application/json "created": "2020-11-27 11:17:10.998919", "metadata": null, "status": "pending", - "storage_class": "F", - "uri": "https://example.org/files/dataset.zip?token=", + "transfer": { + "type": "F", + }, "links": { "content": "/api/records/{id}/draft/files/dataset.zip/content", "self": "/api/records/{id}/draft/files/dataset.zip", @@ -107,9 +122,13 @@ Content-Type: application/json } ``` +**Note**: The response does not contain the URL of the fetched file. This is intentional +as the URL might contain sensitive information (e.g. a token) that should not be exposed +to users. + At this point an asynchronous task will be launched and the file will be transported into the InvenioRDM instance. Once the file transfer is completed, the status field will be -changed to `completed`. At this point the `storage_class` of the files has also changed +changed to `completed`. At this point the `transfer.type` of the files has also changed to `L`. The status can be checked using the _files_ url (`/api/records/{id}/draft/files`). Note, until all the files have been transferred (i.e. their status is `completed`) the record cannot be published. @@ -117,6 +136,11 @@ record cannot be published. More over, while files are being transferred requests to the `content` and `commit` endpoints are not allowed (disabled). +### Error handling + +If the file fetching fails, the status of the file will be set to `failed` +and the error message will be stored in the `transfer.error` field. + ### Security By default file fetching will be refused. Files can only be fetched from a configurable @@ -131,6 +155,85 @@ RECORDS_RESOURCES_FILES_ALLOWED_DOMAINS = [ ## Remote files (R) -!!! info "Not supported" +To link to a remote file, the `transfer` section must contain the `type=R` and `url` fields. + +**Request** + +```http +POST /api/records/{id}/draft/files HTTP/1.1 +Content-Type: application/json + +[ + { + "key": "dataset.zip", + "transfer": { + "type": "R", + "url": "https://mystoragehosting.org/files/dataset.zip", + } + }, + ... +] +``` + +There is no need to call the `commit` endpoint for remote files. The file is considered +committed as soon as it is created. + +**Request** + +```http +POST /api/records/{id}/draft/files/dataset.zip/commit HTTP/1.1 +``` + +### Accessing remote files + +Later on, when user tries to access the file, a 302 redirect will be returned to the +`url` provided in the request. + +**Request** + +```http +GET /api/records/{id}/draft/files/dataset.zip/content HTTP/1.1 +``` + +**Response** + +```http +HTTP/1.1 302 FOUND +Location: https://mystoragehosting.org/files/dataset.zip +``` - Remote files are currently not supported. +### Security + +When a `302` redirect is sent to the user, they will retrieve the file directly +by following the returned URL. Therefore, you must ensure: + +1. **Network Access**: The file’s URL is reachable from the user’s network. +2. **No Sensitive Data**: The URL does not include any sensitive information (such as tokens). + +By default, Invenio refuses references to external files. Files can only be referenced +from a “trusted domains” list, which you can configure in your `invenio.cfg` file: + +```python +RECORDS_RESOURCES_FILES_ALLOWED_REMOTE_DOMAINS = [ + "mystoragehosting.org", +] +``` + +Since the repository cannot guarantee a remote file’s availability or integrity, +file uploads are also restricted to trusted users only. By default, only users with +the superuser access can upload remote files. + +You can change this behavior in your `invenio.cfg` file: + +```python +from invenio_records_resources.services.files.generators import IfTransferType +from invenio_records_resources.services.files.transfer import REMOTE_TRANSFER_TYPE +from invenio_administration.generators import Administration + +class MyRepositoryPermissionPolicy(RDMRecordPermissionPolicy): + can_draft_create_files = RDMRecordPermissionPolicy.can_draft_transfer_files + [ + IfTransferType(REMOTE_TRANSFER_TYPE, Administration()) + ] + +RDM_PERMISSION_POLICY = MyRepositoryPermissionPolicy +```