-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthesis of research related to deployment of Kedro to modern MLOps platforms #3094
Comments
I've developed M:N functionality example for I will soon write a blog post about |
This is insighful and well framed, thank you for sharing this publicly :) I did some similar thinking about this problem space. Here is my takeaway : Let's say a project is composed of two set of requirements, functional requirements (business logic) and non-functional requirements (app lifecycle, config, data management, runner, logging, web server, ....). I see two approach to satify these requirements :
Beware, the platform approach can lock you in an execution environment, with a fixed or slow evolving set of features, this is mostly due to the coupling between the app code and execution environment. Beware also from the multiplication of platforms in you stack, where each platform target specific part of the ML workloads, This can lead to a complex and costly stack/system. An ML project is a combinaison of many workloads at the same time, ML engineering (Training, eval, tracking, ...), Data engineering (data modeling, data preparation, feature store ..), Data analytics (ad hoc analysis, data viz, ...) , Sotfware engineering (API, ...). That's why i double down in the framework approach by using kedro as the base framework, and extending it to cover these diverse workloads. I believe that kedro users are now in the middle path between framework and platform approach, as kedro do not currenly covers all data workloads. This indeed make the deployment/integration harder, because users need to map kedro concepts with their target plateform concepts. If we take for example the model serving functionnality, it's something lacked from kedro. This pushes users to use an MLOps platform alongside kedro, just because it offer such functionality. Integrating some of the "orchestration" features into kedro will make the user workload/ux smoother. This will lower the need to map kedro concepts with the orchestrator/platform concepts. The application can just be orchestrated with a generic orchestrator (An Airflow with kedro operator ? or plain docker for API). I'm not a dbt expert, but i think that we can draw a parallel. Going down the framework path using dbt, lead to lowering functionnalities needed from the orchestration platform (Airflow just run dbt). Hope this help. |
@datajoely Great synthesis! Grouping MemoryDataSets: when running nodes in separate environments, MemoryDataSets won't work anymore. For Grouping in general allows the user to make a logical separation using the Kedro framework, while not having to make an unnecessary trade-off for performance. If each node corresponds to a machine/pod/executor, then there is overhead for spinning them up. As a user, I need to be able run multiple nodes on a single machine without it being one node. Controlling the grouping via tags seem a sensible choice. Requirements management in large Kedro projects: This is a blocker for us to move to a mono-repository. The result: overhead of maintaining multiple Kedro repositories. It's possible to work around this limitation of Kedro, however it would be an enormous plus if supported out-of-the-box. On our spark cluster, most nodes will run on an image with the default dependencies for a project. Some nodes will have heavier and conflicting dependencies (e.g. The other "common pain points" are not that relevant for me at this moment. One topic that came up with MLOps topics which needs are already addressed by Kedro:
|
Crazy idea: |
I promised to publish blog post about using kedro-airflow plugin and demo grouping mechanism and here it is: |
Also, worth considering what happens when a target platform supports something that cannot be defined by Kedro DAGs, like conditionals https://www.databricks.com/blog/announcing-enhanced-control-flow-databricks-workflows |
To an extent, Airflow has had this for a long time: |
Yep. But one of the outcomes of kedro-org/kedro-devrel#94 is that platforms should probably be a priority over open source orchestrators, because OSS orchestrators are more used to, well, orchestrate ETL/ELT tools (say Airflow + Airbyte, Prefect + meltano) but for "ML pipelines" (actually MLOps) commercial platforms seem to be much more widely used. So maybe before we could afford ignoring this pesky bit, but the moment platforms start growing a more complex set of features, the gap widens. |
Turned research synthesis into wiki page https://github.com/kedro-org/kedro/wiki/Synthesis-of-research-related-to-deployment-of-Kedro-to-modern-MLOps-platforms there's nothing else to do here. |
It would be great to see a parent ticket - I was using this to track the status of some of the recommendations |
There will be a parent ticket soon, when the next steps are a bit more clear |
Related: #3889 |
Authored with @AlpAribal
Deploying Kedro to and integrating with MLOps Platforms
This document aims to cover the current state regarding deploying Kedro on
enterprise-grade MLOps platforms:
Common pain points
High level graphic summary of the problem space identified:
Deciding on granularity when translating to orchestrator DSL
Expand detail
1:1 Mapping
This is where a single Kedro node is translated to a single orchestrator node.
Distributing each node also complicates the data flow between them:
MemoryDataset
s.M:1 Mapping
This is where the whole Kedro pipeline is run as a single node on the target platform.
The main benefit is simplicity: One job goes to the orchestrator, executed on a single machine.
However, there are inefficiencies:
M:N Mapping
This is where the full pipeline is divided into a set of sub-pipelines, that can be run separately. Today, there is no obvious way to do this.
This approach provides a middle ground between shortcomings of both the 1:1 and M:1 mappings:
Kedro is a fast, iterative development tool largely because the user is not required to think about execution contexts. This unmanaged complexity is why it is difficult to resolve this granularity mismatch in production contexts.
Piecemeal localised conventions for describing M:N granularity have emerged across mature users:
Each of these has merits and drawbacks. In every case, the user is given no easy way to validate if these groups are mutually exclusive or collectively exhaustive.
Despite the namespace option being the most robust approach available (since v0.16.x), these are not in wide use across our power-user base. There are several hypotheses for the low adoption rate:
namespaces != modular pipelines != micropackaging
, Overlapping features all unrelated to deployment confuse the value for the user.• Today, namespaces are primarily used for visualisation and pipeline-reuse not deployment.
• Internal monorepo tooling now covers much of the
micropackaging
feature space.• The error messages provided by Kedro when applying namespaces are unhelpful²
¹ May be resolved by new dataset factory feature
² e.g.
Failed to map datasets and/or parameters: params:features
Potential approaches to M:N grouping
Even for a mid-sized pipeline, it is not trivial to find the "optimum" grouping of nodes.
ParallelRunner
andThreadRunner
.MemoryDataset
, to be the starting node of a new group. The assumption is that users persist data after checkpointing meaningful work. In a theoretically perfect production system one would only persist the very end of the pipeline.Validating the groups
After nodes are mapped to several groups, sanity checks and questions need to be answered.
Expressing the groups
Requirements management in large Kedro projects
Expand detail
requirements.txt
s, but it is still up to the user to make these work neatly in independent environments.Separating pipeline definition and execution environments
Expand detail
No link between distributed
KedroSession
s of the same pipelineExpand detail
As described below, most deployment plugins run the Kedro CLI under the hood.
KedroSession
for each of these steps is created, and a separatesession_id
is assigned to each of them.Passing ephemeral data between distributed runs
Expand detail
Kedro, by default, uses
MemoryDataSet
s to hold intermediate data. However, this dataset type cannot be used in a distributed setting since containers do not share main memory.Deployment plugins usually replace the
MemoryDataset
by:Runner
implementation with another default dataset typeIn either case, ephemeral data is, at least temporarily, persisted to storage (cloud bucket, Kubernetes volume, etc.). The [de-]seriliasation of data throttles the pipeline execution speed and, in many cases, leads to worse performance in the distributed setting compared to a local run.
There are some solutions like the CNCF vineyard project that have in-memory data access offerings that might improve execution speed in only K8s specific situations.
Differentiating between data, model, and reporting artifacts
Expand detail
Processing and Training
Steps
kinds for datasets and models.
load.
PickleDataSet
can store any Python object and it is not known whether the dataset stores a model. In general, there is a strong argument that ONNX (LFAI) must be the default model serialisation mechanism within Kedro.Lack of a standard pattern for iterative development
Expand detail
Currently, deployment plugins address the one-way task of converting a developed pipeline into a deployment. When deployment is viewed as an iterative process of development and deployment steps, additional gaps need to be bridged.
Linking source code to execution
There are two popular configurations (1) tight (2) loose between source code and platform:
Keeping code and configuration separated
kedro package
in strict adherence with 12factor app.Limiting duplicated build efforts
In a setup where the pipeline is continuously deployed, repeating the same deployment workflow may lead to inefficiencies:
Kedro dependency after deployment to orchestrator
Expand detail
There may be some situations where Kedro integrating with a target platform leaves much of the platform feature set under-utilised. From the platform's perspective, deployed Kedro pipelines may feel like "closed boxes".
For many deployment plugins, translating a Kedro pipeline means encapsulating the Kedro project within a Docker container and executing
specific nodes via the Kedro CLI.
So, pipeline execution depends on Kedro in two ways:
/opt/models
) in the way that they handle artifact management these features often be bypassed and not automatically available to the users.Recommended changes to Kedro core
session_id
Setting: Simplifysession_id
management in distributed Kedro pipelines (see issue #2182).M:N
Groups in Kedro: Establish conventions forM:N
groups with deployment focus. (See kedro-plugins PR#241)Deployment plugins
Overview of plugins
Almost all plugins rely on a Docker image to wrap the Kedro project. The Docker image is usually built just before executing the pipeline, and source code is copied into the image as part of the build.
MemoryDatasets
.It is also worth noting that beyond data management and experiment tracking, deployment plugins often fail to leverage or unlock the full potential of platform-specific capabilities.
These unused capabilities include:
Comparison
* [O] maintained by the Kedro org, [G] maintained by the GetInData org
The text was updated successfully, but these errors were encountered: