Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oci support #84

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions cloud-service-providers/oci/oke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# NIM on Oracle Cloud Infrastructure (OCI) OKE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a link to this document on the main readme.md table that links to each CSP managed K8s deployment?


To deploy NIM on Oracle Cloud Infrastructure (OCI) successfully, it’s crucial to choose the correct GPU shapes and ensure that the appropriate NVIDIA drivers are installed.

When you select a GPU shape for a managed node pool or self-managed node in OKE, you must also select a compatible Oracle Linux GPU image that has the CUDA libraries pre-installed. The names of compatible images include 'GPU'. OCI offers Oracle Linux (OEL) providing the possibility to use pre-installed GPU drivers. This simplifies the deployment process for NIM.


## Prerequisites

Please follow [Prerequisite instructions](./prerequisites/README.md) to get ready for OKE creation.

## Create OKE

Please follow [Create OKE instruction](./setup/README.md) to create OKE.

## Deploy NIM

Please follow [Deploy NIM instruction](../../../helm/README.md) to deploy NIM.
66 changes: 66 additions & 0 deletions cloud-service-providers/oci/oke/prerequisites/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
### OKE Prerequisites

This list summarizes the key prerequisites you need to set up before deploying an OKE cluster on OCI.

- **OCI Account and Tenancy**:
- Ensure you have an OCI account with the necessary permissions.
- Set up a compartment for your Kubernetes cluster.

- **Networking**:
- Create a Virtual Cloud Network (VCN) with appropriate subnets.
- Ensure internet gateway, NAT gateway, and service gateway are configured.
- Set up route tables and security lists for network traffic.

- **IAM Policies**:
- Define IAM policies to allow OKE service to manage resources in your compartment.
- Grant required permissions to users or groups managing the Kubernetes cluster.

- **Service Limits**:
- Verify that your tenancy has sufficient service limits for compute instances, block storage, and other required resources.

- **CLI and SDK Tools**:
- Install and configure the OCI CLI for managing OKE.
- Optionally, set up OCI SDKs for automating tasks.

- **Kubernetes Version**:
- Decide on the Kubernetes version to deploy, ensuring compatibility with your applications and OCI features.

- **API Endpoint**:
- Choose between the public or private endpoint for the Kubernetes API server, based on your security requirements.

For more details, please reference this [link.](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengprerequisites.htm)


## Install OCI CLI

```
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"
```

For more details, please reference this [link.](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)

## Install kubectl

```
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
kubectl version --client
```

For more details, please reference this [link.](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/)

## Install Helm

```
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
```

For more details, please reference this [link.](https://helm.sh/docs/intro/install/)

## Next step

[Continue to OKE creation](../setup/README.md)
85 changes: 85 additions & 0 deletions cloud-service-providers/oci/oke/setup/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Setup OCI Kubernetes Engine (OKE)

The key to creating Oracle Kubernetes Engine (OKE) for NIM is to create a proper GPU node pool. The following steps will guide you through the process.

## Connect to OCI

1. Log in to your Oracle Cloud Infrastructure (OCI) Console.
2. Select the appropriate compartment where you want to create the OKE cluster.

## Identify GPU needed for NIM

- Refer to the NIM documentation to identify the NVIDIA GPU you [need](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html). Here is also a list of available [OKE NVIDIA GPU node shapes](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu).


## Confirm the GPU availability in

Use the OCI CLI to search for GPU availability:

```bash
oci compute shape list --region <region-name> --compartment-id <your-compartment-id> --all --query 'data[*].shape' --output json | jq -r '.[]' | grep -i 'gpu'
```

Cross-reference with the [OCI Regions](https://www.oracle.com/cloud/data-regions.html) to select the best region.

## Request Quota

Ensure you have the necessary service limits (quota) for the GPU shapes. If needed, request an increase via the OCI Console:

1. Navigate to **Governance and Administration** > **Limits, Quotas, and Usage**.
2. Select **Request Service Limit Increase** for the relevant GPU shapes.

## Create OKE

To easily create the OKE cluster and NVIDIA GPU nodepool, you can use Quick Create to initially set up the cluster with a default node pool that includes a single, simple VM node. After the cluster is created, you can add a new node pool with GPU shapes that require a larger boot volume size. This way, you can ensure the GPUs have the necessary storage while manually configuring the nodes as needed.

1. In the OCI Console, navigate to **Developer Services** > **Kubernetes Clusters** > **OKE Clusters**.
2. Click **Create Cluster** and select **Quick Create**.
3. Configure the following:
- **Name**: Provide a name for your cluster.
- **Compartment**: Select the appropriate compartment.
- **Kubernetes Version**: Choose the latest stable version.
- **Kubernetes API endpoint**: Private or public.
- **Node type**: Managed.
- **Kubernetes worker nodes**: Private or public.
4. Under **Shape and image**:
- **Shape**: You can leave the default simple VM.Standard.x.
- **Node Count**: Start with 1 node (adjust as needed).
- **Add an SSH key**(optional): In order to have access to nodes.
5. Click **Create Cluster** to start the provisioning process. This will provision a simple cluster, to which you can subsequently add a GPU nodepool.

## Create GPU nodepool on existing OKE cluster

1. For an existing OKE cluster, navigate to the **Node Pools** section.
2. Click **Add Node Pool** and configure:
- **Name**: Provide a name for the node pool.
- **Compartment**: Select the appropriate compartment.
- **Version**: the Kubernetes version of the nodes - defaults to current cluster version.
- **Node Placement Configuration** - select Availability Domain and Worker node subnet.
- **Node Shape**: Select the desired GPU-enabled shape.
- **Node Image**: is automatically populated with an OEL GPU image which you can change to a different version.
- **Node Count**: Set the number of nodes (adjust according to your needs).
- **Boot volume**: Specify a larger size than the default 50GB size, for example 300 GB. To complement this change also go to the next point on custo cloudinit.sh.
- **Show advanced options** -> **Initialization script** -> **Paste Cloud-init Script** and paste:
```
#!/bin/bash
curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
bash /var/run/oke-init.sh
/usr/libexec/oci-growfs -y
```
3. Click **Create Node Pool**.

## Connect to OKE

1. Install the OCI CLI if you haven't already.
2. Retrieve the OKE cluster credentials using the Access Cluster buton in the console Cluster details page:

```bash
oci ce cluster create-kubeconfig --cluster-id <cluster OCID> --file $HOME/.kube/config --region <region> --token-version 2.0.0 --kube-endpoint PUBLIC_ENDPOINT
```

3. Verify the connection to your OKE cluster:

```bash
kubectl get nodes
```
133 changes: 133 additions & 0 deletions cloud-service-providers/oci/oke/terraform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# orm-stack-oke-helm-deployment-nim

## Getting started

This stack deploys an OKE cluster with two nodepools:
- one nodepool with flexible shapes
- one nodepool with GPU shapes

And several supporting applications using helm:
- nginx
- cert-manager
- jupyterhub

With the scope of demonstrating [nVidia NIM LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) self-hosted model capabilities.

**Note:** For helm deployments it's necessary to create bastion and operator host (with the associated policy for the operator to manage the clsuter), **or** configure a cluster with public API endpoint.

In case the bastion and operator hosts are not created, is a prerequisite to have the following tools already installed and configured:
- bash
- helm
- jq
- kubectl
- oci-cli

[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/ionut-sturzu/nim_on_oke/archive/refs/heads/main.zip)


## Helm Deployments

### Nginx

[Nginx](https://kubernetes.github.io/ingress-nginx/deploy/) is deployed and configured as default ingress controller.

### Cert-manager

[Cert-manager](https://cert-manager.io/docs/) is deployed to handle the configuration of TLS certificate for the configured ingress resources. Currently it's using the [staging Let's Encrypt endpoint](https://letsencrypt.org/docs/staging-environment/).

### Jupyterhub

[Jupyterhub](https://jupyterhub.readthedocs.io/en/stable/) will be accessible to the address: [https://jupyter.a.b.c.d.nip.io](https://jupyter.a.b.c.d.nip.io), where a.b.c.d is the public IP address of the load balancer associated with the NGINX ingress controller.

JupyterHub is using a dummy authentication scheme (user/password) and the access is secured using the variables:

```
jupyter_admin_user
jupyter_admin_password
```

It also supports the option to automatically clone a git repo when user is connecting and making it available under `examples` directory.

### NIM

The LLM is deployed using [NIM](https://docs.nvidia.com/nim/index.html).

Parameters:
- `nim_image_repository` and `nim_image_tag` - used to specify the container image location
- `NGC_API_KEY` - required to authenticate with NGC services

Models with large context length require GPUs with lots of memory. In case of Mistral, with a context length of 32k, the deployment on A10 instances, fails with the default container settings.

To work around this issue, we can limit the context length using the `--max-model-len` argument for the vLLM. The underlying inference engine used by NIM.

In case of Mistral models, create a file `nim_user_values_override.yaml` file with the content below and provide it as input during ORM stack variable configuration.

## How to deploy?

1. Deploy directly to OCI using the below button:

[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/ionut-sturzu/nim_on_oke/archive/refs/heads/main.zip)


2. Deploy via ORM
- Create a new stack
- Upload the TF configuration files
- Configure the variables
- Apply

3. Local deployment

- Create a file called `terraform.auto.tfvars` with the required values.

```
# ORM injected values

region = "uk-london-1"
tenancy_ocid = "ocid1.tenancy.oc1..aaaaaaaaiyavtwbz4kyu7g7b6wglllccbflmjx2lzk5nwpbme44mv54xu7dq"
compartment_ocid = "ocid1.compartment.oc1..aaaaaaaaqi3if6t4n24qyabx5pjzlw6xovcbgugcmatavjvapyq3jfb4diqq"

# OKE Terraform module values
create_iam_resources = false
create_iam_tag_namespace = false
ssh_public_key = "<ssh_public_key>"

## NodePool with non-GPU shape is created by default with size 1
simple_np_flex_shape = { "instanceShape" = "VM.Standard.E4.Flex", "ocpus" = 2, "memory" = 16 }

## NodePool with GPU shape is created by default with size 0
gpu_np_size = 1
gpu_np_shape = "VM.GPU.A10.1"

## OKE Deployment values
cluster_name = "oke"
vcn_name = "oke-vcn"
compartment_id = "ocid1.compartment.oc1..aaaaaaaaqi3if6t4n24qyabx5pjzlw6xovcbgugcmatavjvapyq3jfb4diqq"

# Jupyter Hub deployment values
jupyter_admin_user = "oracle-ai"
jupyter_admin_password = "<admin-passowrd>"
playbooks_repo = "https://github.com/ionut-sturzu/nim_notebooks.git"

# NIM Deployment values
nim_image_repository = "nvcr.io/nim/meta/llama3-8b-instruct"
nim_image_tag = "latest"
NGC_API_KEY = "<ngc_api_key>"
```

- Execute the commands

```
terraform init
terraform plan
terraform apply
```

After the deployment is successful, get the Jupyter URL from the Terraform output and run it in the browser.
Log in with the user/password that you previously set.
Open and run the **NVIDIA_NIM_model_interaction.ipynb** notebook.

## Known Issues

If `terraform destroy` fails, manually remove the LoadBalancer resource configured for the Nginx Ingress Controller.

After `terrafrom destroy`, the block volumes corresponding to the PVCs used by the applications in the cluster won't be removed. You have to manually remove them.
14 changes: 14 additions & 0 deletions cloud-service-providers/oci/oke/terraform/common.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (c) 2022, 2024 Oracle Corporation and/or its affiliates.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl

locals {
state_id = coalesce(var.state_id, random_string.state_id.id)
}

resource "random_string" "state_id" {
length = 6
lower = true
numeric = false
special = false
upper = false
}
49 changes: 49 additions & 0 deletions cloud-service-providers/oci/oke/terraform/datasources.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright (c) 2022, 2024 Oracle Corporation and/or its affiliates.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl

data "oci_identity_tenancy" "tenant_details" {

tenancy_id = var.tenancy_ocid
}

data "oci_identity_regions" "home_region" {

filter {
name = "key"
values = [data.oci_identity_tenancy.tenant_details.home_region_key]
}
}

data "oci_identity_availability_domains" "ads" {

compartment_id = var.tenancy_ocid
}

data "oci_core_shapes" "gpu_shapes" {
for_each = { for entry in data.oci_identity_availability_domains.ads.availability_domains : entry.name => entry.id }

compartment_id = var.compartment_id
availability_domain = each.key

filter {
name = "name"
values = [var.gpu_np_shape]
}
}

data "oci_load_balancer_load_balancers" "lbs" {

compartment_id = coalesce(var.compartment_id, var.compartment_ocid)

filter {
name = "freeform_tags.state_id"
values = [local.state_id]
}

filter {
name = "freeform_tags.application"
values = ["nginx"]
}

depends_on = [module.nginx]
}
Loading