This service acts as a registry for the internal location and representation of files.
This service provides functionality to administer files stored in an S3-compatible object storage. All file-related metadata is stored in an internal mongodb database, owned and controlled by this service. It exposes no REST API endpoints and communicates with other services via events.
This event signals that there is a file to register in the database. The file-related metadata from this event gets saved in the database and the file is moved from the incoming staging bucket to the permanent storage.
This event signals that there is a file that needs to be staged for download. The file is then copied from the permanent storage to the outbox for the actual download.
This event is published after a file was registered in the database. It contains all the file-related metadata that was provided by the files_to_register event.
This event is published after a file was successfully staged to the outbox.
We recommend using the provided Docker container.
A pre-build version is available at docker hub:
docker pull ghga/internal-file-registry-service:3.0.1
Or you can build the container yourself from the ./Dockerfile
:
# Execute in the repo's root dir:
docker build -t ghga/internal-file-registry-service:3.0.1 .
For production-ready deployment, we recommend using Kubernetes, however, for simple use cases, you could execute the service using docker on a single server:
# The entrypoint is preconfigured:
docker run -p 8080:8080 ghga/internal-file-registry-service:3.0.1 --help
If you prefer not to use containers, you may install the service from source:
# Execute in the repo's root dir:
pip install .
# To run the service:
ifrs --help
The service requires the following configuration parameters:
-
log_level
(string): The minimum log level to capture. Must be one of:["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "TRACE"]
. Default:"INFO"
. -
service_name
(string): Default:"internal_file_registry"
. -
service_instance_id
(string, required): A string that uniquely identifies this instance across all instances of this service. A globally unique Kafka client ID will be created by concatenating the service_name and the service_instance_id.Examples:
"germany-bw-instance-001"
-
log_format
: If set, will replace JSON formatting with the specified string format. If not set, has no effect. In addition to the standard attributes, the following can also be specified: timestamp, service, instance, level, correlation_id, and details. Default:null
.-
Any of
-
string
-
null
-
Examples:
"%(timestamp)s - %(service)s - %(level)s - %(message)s"
"%(asctime)s - Severity: %(levelno)s - %(msg)s"
-
-
log_traceback
(boolean): Whether to include exception tracebacks in log messages. Default:true
. -
object_storages
(object, required): Can contain additional properties.- Additional properties: Refer to #/$defs/S3ObjectStorageNodeConfig.
-
file_registered_event_topic
(string, required): Name of the topic used for events indicating that a new file has been internally registered.Examples:
"internal-file-registry"
-
file_registered_event_type
(string, required): The type used for events indicating that a new file has been internally registered.Examples:
"file_registered"
-
file_staged_event_topic
(string, required): Name of the topic used for events indicating that a new file has been internally registered.Examples:
"internal-file-registry"
-
file_staged_event_type
(string, required): The type used for events indicating that a new file has been internally registered.Examples:
"file_staged_for_download"
-
file_deleted_event_topic
(string, required): Name of the topic used for events indicating that a file has been deleted.Examples:
"internal-file-registry"
-
file_deleted_event_type
(string, required): The type used for events indicating that a file has been deleted.Examples:
"file_deleted"
-
files_to_delete_topic
(string, required): The name of the topic to receive events informing about files to delete.Examples:
"file-deletions"
-
files_to_register_topic
(string, required): The name of the topic to receive events informing about new files to register.Examples:
"file-interrogations"
-
files_to_stage_topic
(string, required): The name of the topic to receive events informing about files to stage.Examples:
"file-downloads"
-
kafka_servers
(array, required): A list of connection strings to connect to Kafka bootstrap servers.- Items (string)
Examples:
[ "localhost:9092" ]
-
kafka_security_protocol
(string): Protocol used to communicate with brokers. Valid values are: PLAINTEXT, SSL. Must be one of:["PLAINTEXT", "SSL"]
. Default:"PLAINTEXT"
. -
kafka_ssl_cafile
(string): Certificate Authority file path containing certificates used to sign broker certificates. If a CA is not specified, the default system CA will be used if found by OpenSSL. Default:""
. -
kafka_ssl_certfile
(string): Optional filename of client certificate, as well as any CA certificates needed to establish the certificate's authenticity. Default:""
. -
kafka_ssl_keyfile
(string): Optional filename containing the client private key. Default:""
. -
kafka_ssl_password
(string, format: password): Optional password to be used for the client private key. Default:""
. -
generate_correlation_id
(boolean): A flag, which, if False, will result in an error when trying to publish an event without a valid correlation ID set for the context. If True, the a newly correlation ID will be generated and used in the event header. Default:true
.Examples:
true
false
-
kafka_max_message_size
(integer): The largest message size that can be transmitted, in bytes. Only services that have a need to send/receive larger messages should set this. Exclusive minimum:0
. Default:1048576
.Examples:
1048576
16777216
-
db_connection_str
(string, format: password, required): MongoDB connection string. Might include credentials. For more information see: https://naiveskill.com/mongodb-connection-string/.Examples:
"mongodb://localhost:27017"
-
db_name
(string, required): Name of the database located on the MongoDB server.Examples:
"my-database"
-
db_version_collection
(string, required): The name of the collection containing DB version information for this service.Examples:
"ifrsDbVersions"
-
migration_wait_sec
(integer, required): The number of seconds to wait before checking the DB version again.Examples:
5
30
180
-
S3Config
(object): S3-specific config params. Inherit your config class from this class if you need to talk to an S3 service in the backend.
Args: s3_endpoint_url (str): The URL to the S3 endpoint. s3_access_key_id (str): Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html s3_secret_access_key (str): Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html s3_session_token (Optional[str]): Optional part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html aws_config_ini (Optional[Path]): Path to a config file for specifying more advanced S3 parameters. This should follow the format described here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-a-configuration-file Defaults to None. Cannot contain additional properties.-
s3_endpoint_url
(string, required): URL to the S3 API.Examples:
"http://localhost:4566"
-
s3_access_key_id
(string, required): Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html.Examples:
"my-access-key-id"
-
s3_secret_access_key
(string, format: password, required): Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html.Examples:
"my-secret-access-key"
-
s3_session_token
: Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html. Default:null
.-
Any of
-
string, format: password
-
null
-
Examples:
"my-session-token"
-
-
aws_config_ini
: Path to a config file for specifying more advanced S3 parameters. This should follow the format described here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-a-configuration-file. Default:null
.-
Any of
-
string, format: path
-
null
-
Examples:
"~/.aws/config"
-
-
-
S3ObjectStorageNodeConfig
(object): Configuration for one specific object storage node and one bucket in it.
The bucket is the main bucket that the service is responsible for. Cannot contain additional properties.-
bucket
(string, required) -
credentials
: Refer to #/$defs/S3Config.
-
A template YAML for configuring the service can be found at
./example-config.yaml
.
Please adapt it, rename it to .ifrs.yaml
, and place it into one of the following locations:
- in the current working directory were you are execute the service (on unix:
./.ifrs.yaml
) - in your home directory (on unix:
~/.ifrs.yaml
)
The config yaml will be automatically parsed by the service.
Important: If you are using containers, the locations refer to paths within the container.
All parameters mentioned in the ./example-config.yaml
could also be set using environment variables or file secrets.
For naming the environment variables, just prefix the parameter name with ifrs_
,
e.g. for the host
set an environment variable named ifrs_host
(you may use both upper or lower cases, however, it is standard to define all env
variables in upper cases).
To using file secrets please refer to the corresponding section of the pydantic documentation.
This is a Python-based service following the Triple Hexagonal Architecture pattern. It uses protocol/provider pairs and dependency injection mechanisms provided by the hexkit library.
For setting up the development environment, we rely on the devcontainer feature of VS Code in combination with Docker Compose.
To use it, you have to have Docker Compose as well as VS Code with its "Remote - Containers"
extension (ms-vscode-remote.remote-containers
) installed.
Then open this repository in VS Code and run the command
Remote-Containers: Reopen in Container
from the VS Code "Command Palette".
This will give you a full-fledged, pre-configured development environment including:
- infrastructural dependencies of the service (databases, etc.)
- all relevant VS Code extensions pre-installed
- pre-configured linting and auto-formatting
- a pre-configured debugger
- automatic license-header insertion
Moreover, inside the devcontainer, a convenience commands dev_install
is available.
It installs the service with all development dependencies, installs pre-commit.
The installation is performed automatically when you build the devcontainer. However,
if you update dependencies in the ./pyproject.toml
or the
./requirements-dev.txt
, please run it again.
This repository is free to use and modify according to the Apache 2.0 License.
This README file is auto-generated, please see readme_generation.md
for details.