Feat/sagemaker llms #234

isobel-daley-6point6 · 2025-02-05T09:13:23Z

Overview

This PR introduces SageMaker asynchronous inference endpoints to Data Workspace. SageMaker asynchronous endpoints can be used to deploy self-hosted ML models (including those that require GPUs, like LLMs). Users of Data Workspace tools (Theia/Jupyter/VSCode) will be able to invoke these inference endpoints. They will not have permission to deploy new inference endpoints.

Feature Flags

The overall SageMaker functionality has been introduced behind a feature flag (set by var.sagemaker_on).

A model-specific feature flag has also been added. This can be used to easily turn models 'on' and 'off'. In this PR, there is only one model (phi_2_3b). Therefore there is one model-specific feature flag, set by var.sagemaker_phi_2_3b.

High Level Summary of Functionality

SageMaker model artefacts are stored in S3 (model weights) and ECR (dependencies and inference code). A SageMaker model is created using these artefacts. These are deployed behind a SageMaker asynchronous inference endpoint with autoscaling.

A user can invoke the asynchronous endpoint from Data Workspace python tools using the boto3 library. When the SageMaker inference endpoints is called, a request enters the backlog. This triggers SageMaker to provision the necessary infrastructure (EC2 instance) to run this model. Once the model endpoint is available, the user's request is processed and the output is sent to a centralised SageMaker S3 bucket. Users of Data Workspace tools do not have access to this bucket. Instead, SNS triggers a Lambda function to run which copies the SageMaker output file from the centralised SageMaker bucket to the user's own Data Workspace file space. When no requests remain in the backlog, the infrastructure associated with the endpoint scales down.

Architecture Diagram

Implementation Details

SageMaker VPC

A new VPC has been created with a single private subnet. This VPC is used to host:

All SageMaker asynchronous inference endpoints
VPC endpoints for ECR, S3 and SNS

This VPC is peered with:

The main VPC to enable access to the SageMaker API and Runtime VPC endpoints
The notebooks VPC to allow users of DataWorkspace tools access to the SageMaker asynchronous inference endpoints

New VPC Endpoints in main VPC

Two new VPC endpoints have been added to the main VPC:

SageMaker Runtime: This endpoint manages requests to the deployed SageMaker models
SageMaker API: this endpoint enables programmatic access to SageMaker features (e.g. using the boto3 library)

These VPC endpoints have been placed in the main VPC as it is anticipated services like data-flow will need to access them in the future.

SageMaker Asynchronous Inference Endpoints

The sagemaker_llm_resource.tf file calls a reusable module ./modules/sagemaker_deployment. This module enables setup of new SageMaker asynchronous inference endpoints. Each new asynchronous endpoint consists of the following resources:

Model: this sets out the location on the model artefacts (weights and inference code) and VPC configuration.
Endpoint Configuration: Defines the endpoint type as asynchronous, set ups SNS success/failure topics and sets the S3 output location.
Endpoint: Brings together the model and endpoint configuration behind a deployed endpoint.
Cloudwatch Alarms: Multiple alarms are implemented to support autoscaling. These are based on various metrics, including CPU utilisation and backlog requests.
Autoscaling: Autoscaling policies (triggered by Cloudwatch alarms) are created to enable scaling based on workload requirements. These are based on CPU utilisation and backlog metrics.
Alerting via SNS Topics: SNS topics are triggered by specific alarms. SNS notifications currently trigger Lambdas set up to send Slack notifications (NB: this will be migrated to Teams in due course).

SageMaker is granted permissions via the inference and execution roles to do the following:

Access specific S3 buckets:
- SageMaker output bucket to publish the model's response
- Notebooks bucket to access user's inputs
- Model artefacts hosted in a specific AWS owned account
ECR to access model artefacts
Cloudwatch to publish logs for monitoring
Application-Autoscaling to enable autoscaling of underlying infrastructure
EC2 to create ENIs to associate with the underlying infrastructure on which the models are being run
Logs to enable logging

Lambdas

Lambdas have been implemented to cover the following

Copying the model's outputs from the central SageMaker output bucket to the user's own files area. This is triggered by a "success" notification to the SageMaker success SNS topic
Copying logs from Cloudwatch to S3 (may be removed)
Sending alerts to Slack when Cloudwatch alarms are triggered

AWS Budgets

AWS budgets has been set up to support tracking of costs relating to SageMaker

Data Workspace Tools: User Permissions

Permissions have been added to the notebook_task_execution policy to allow:

The SageMaker API VPC endpoint and SageMaker runtime VPC endpoint to be used. These VPC endpoints are located in the main VPC. These VPC endpoints allow programmatic access to the SageMaker inference endpoints.
SageMaker inference endpoints to be described, listed and invoked.

aidanrussell · 2025-02-26T17:03:01Z

infra/ecr.tf

@@ -249,6 +249,11 @@ data "aws_ecr_lifecycle_policy_document" "expire_untagged_after_one_day" {
  }
 }

+resource "aws_ecr_repository" "sagemaker" {


@peter-woodcock identified that this can be removed @isobel-daley-6point6 we can review

aidanrussell · 2025-02-26T17:04:43Z

infra/modules/sagemaker_init/iam/main.tf

+# Use the data source to get the bucket ARN from the bucket name
+data "aws_s3_bucket" "sagemaker_default_bucket" {
+  bucket = var.sagemaker_default_bucket_name
+}


@peter-woodcock identified this bucket is not defined as a resource @isobel-daley-6point6 we can review

aidanrussell · 2025-02-27T08:28:22Z

infra/main.tf

@@ -274,38 +274,94 @@ variable "s3_prefixes_for_external_role_copy" {
  default = ["import-data", "export-data"]
 }

+variable "sagemaker_example_inference_image" { default = "" }


@peter-woodcock identified this can be removed we can review @isobel-daley-6point6

isobel-daley-6point6 marked this pull request as ready for review February 13, 2025 20:09

isobel-daley-6point6 requested a review from a team as a code owner February 13, 2025 20:09

feat: llms using async sagemaker endpoints are available

5fddb90

aidanrussell force-pushed the feat/sagemaker-llms branch from 25d08df to 5fddb90 Compare February 26, 2025 17:01

aidanrussell reviewed Feb 26, 2025

View reviewed changes

aidanrussell reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/sagemaker llms #234

Feat/sagemaker llms #234

isobel-daley-6point6 commented Feb 5, 2025 •

edited

Loading

aidanrussell Feb 26, 2025

aidanrussell Feb 26, 2025

aidanrussell Feb 27, 2025

Feat/sagemaker llms #234

Are you sure you want to change the base?

Feat/sagemaker llms #234

Conversation

isobel-daley-6point6 commented Feb 5, 2025 • edited Loading

Overview

Feature Flags

High Level Summary of Functionality

Architecture Diagram

Implementation Details

SageMaker VPC

New VPC Endpoints in main VPC

SageMaker Asynchronous Inference Endpoints

Lambdas

AWS Budgets

Data Workspace Tools: User Permissions

aidanrussell Feb 26, 2025

Choose a reason for hiding this comment

aidanrussell Feb 26, 2025

Choose a reason for hiding this comment

aidanrussell Feb 27, 2025

Choose a reason for hiding this comment

isobel-daley-6point6 commented Feb 5, 2025 •

edited

Loading