Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/sagemaker llms #234

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

isobel-daley-6point6
Copy link
Contributor

@isobel-daley-6point6 isobel-daley-6point6 commented Feb 5, 2025

Overview

This PR introduces SageMaker asynchronous inference endpoints to Data Workspace. SageMaker asynchronous endpoints can be used to deploy self-hosted ML models (including those that require GPUs, like LLMs). Users of Data Workspace tools (Theia/Jupyter/VSCode) will be able to invoke these inference endpoints. They will not have permission to deploy new inference endpoints.

Feature Flags

The overall SageMaker functionality has been introduced behind a feature flag (set by var.sagemaker_on).

A model-specific feature flag has also been added. This can be used to easily turn models 'on' and 'off'. In this PR, there is only one model (phi_2_3b). Therefore there is one model-specific feature flag, set by var.sagemaker_phi_2_3b.

High Level Summary of Functionality

SageMaker model artefacts are stored in S3 (model weights) and ECR (dependencies and inference code). A SageMaker model is created using these artefacts. These are deployed behind a SageMaker asynchronous inference endpoint with autoscaling.

A user can invoke the asynchronous endpoint from Data Workspace python tools using the boto3 library. When the SageMaker inference endpoints is called, a request enters the backlog. This triggers SageMaker to provision the necessary infrastructure (EC2 instance) to run this model. Once the model endpoint is available, the user's request is processed and the output is sent to a centralised SageMaker S3 bucket. Users of Data Workspace tools do not have access to this bucket. Instead, SNS triggers a Lambda function to run which copies the SageMaker output file from the centralised SageMaker bucket to the user's own Data Workspace file space. When no requests remain in the backlog, the infrastructure associated with the endpoint scales down.

Architecture Diagram

sagemaker_data_workspace_arch-Page-3 drawio

Implementation Details

SageMaker VPC

A new VPC has been created with a single private subnet. This VPC is used to host:

  • All SageMaker asynchronous inference endpoints
  • VPC endpoints for ECR, S3 and SNS

This VPC is peered with:

  • The main VPC to enable access to the SageMaker API and Runtime VPC endpoints
  • The notebooks VPC to allow users of DataWorkspace tools access to the SageMaker asynchronous inference endpoints

New VPC Endpoints in main VPC

Two new VPC endpoints have been added to the main VPC:

  • SageMaker Runtime: This endpoint manages requests to the deployed SageMaker models
  • SageMaker API: this endpoint enables programmatic access to SageMaker features (e.g. using the boto3 library)

These VPC endpoints have been placed in the main VPC as it is anticipated services like data-flow will need to access them in the future.

SageMaker Asynchronous Inference Endpoints

The sagemaker_llm_resource.tf file calls a reusable module ./modules/sagemaker_deployment. This module enables setup of new SageMaker asynchronous inference endpoints. Each new asynchronous endpoint consists of the following resources:

  • Model: this sets out the location on the model artefacts (weights and inference code) and VPC configuration.
  • Endpoint Configuration: Defines the endpoint type as asynchronous, set ups SNS success/failure topics and sets the S3 output location.
  • Endpoint: Brings together the model and endpoint configuration behind a deployed endpoint.
  • Cloudwatch Alarms: Multiple alarms are implemented to support autoscaling. These are based on various metrics, including CPU utilisation and backlog requests.
  • Autoscaling: Autoscaling policies (triggered by Cloudwatch alarms) are created to enable scaling based on workload requirements. These are based on CPU utilisation and backlog metrics.
  • Alerting via SNS Topics: SNS topics are triggered by specific alarms. SNS notifications currently trigger Lambdas set up to send Slack notifications (NB: this will be migrated to Teams in due course).

SageMaker is granted permissions via the inference and execution roles to do the following:

  • Access specific S3 buckets:
    • SageMaker output bucket to publish the model's response
    • Notebooks bucket to access user's inputs
    • Model artefacts hosted in a specific AWS owned account
  • ECR to access model artefacts
  • Cloudwatch to publish logs for monitoring
  • Application-Autoscaling to enable autoscaling of underlying infrastructure
  • EC2 to create ENIs to associate with the underlying infrastructure on which the models are being run
  • Logs to enable logging

Lambdas

Lambdas have been implemented to cover the following

  • Copying the model's outputs from the central SageMaker output bucket to the user's own files area. This is triggered by a "success" notification to the SageMaker success SNS topic
  • Copying logs from Cloudwatch to S3 (may be removed)
  • Sending alerts to Slack when Cloudwatch alarms are triggered

AWS Budgets

AWS budgets has been set up to support tracking of costs relating to SageMaker

Data Workspace Tools: User Permissions

Permissions have been added to the notebook_task_execution policy to allow:

  • The SageMaker API VPC endpoint and SageMaker runtime VPC endpoint to be used. These VPC endpoints are located in the main VPC. These VPC endpoints allow programmatic access to the SageMaker inference endpoints.
  • SageMaker inference endpoints to be described, listed and invoked.

@isobel-daley-6point6 isobel-daley-6point6 marked this pull request as ready for review February 13, 2025 20:09
@isobel-daley-6point6 isobel-daley-6point6 requested a review from a team as a code owner February 13, 2025 20:09
@@ -249,6 +249,11 @@ data "aws_ecr_lifecycle_policy_document" "expire_untagged_after_one_day" {
}
}

resource "aws_ecr_repository" "sagemaker" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-woodcock identified that this can be removed @isobel-daley-6point6 we can review

# Use the data source to get the bucket ARN from the bucket name
data "aws_s3_bucket" "sagemaker_default_bucket" {
bucket = var.sagemaker_default_bucket_name
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-woodcock identified this bucket is not defined as a resource @isobel-daley-6point6 we can review

@@ -274,38 +274,94 @@ variable "s3_prefixes_for_external_role_copy" {
default = ["import-data", "export-data"]
}

variable "sagemaker_example_inference_image" { default = "" }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peter-woodcock identified this can be removed we can review @isobel-daley-6point6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants