NVIDIA · adinadiana1234 · Aug 30, 2024 · Aug 30, 2024 · Aug 30, 2024 · Aug 30, 2024
diff --git a/cloud-service-providers/oci/oke/README.md b/cloud-service-providers/oci/oke/README.md
@@ -0,0 +1,18 @@
+# NIM on Oracle Cloud Infrastructure (OCI) OKE
+
+To deploy NIM on Oracle Cloud Infrastructure (OCI) successfully, it’s crucial to choose the correct GPU shapes and ensure that the appropriate NVIDIA drivers are installed. 
+
+When you select a GPU shape for a managed node pool or self-managed node in OKE, you must also select a compatible Oracle Linux GPU image that has the CUDA libraries pre-installed. The names of compatible images include 'GPU'. OCI offers Oracle Linux (OEL) providing the possibility to use pre-installed GPU drivers. This simplifies the deployment process for NIM.
+
+
+## Prerequisites
+
+Please follow [Prerequisite instructions](./prerequisites/README.md) to get ready for OKE creation.
+
+## Create OKE
+
+Please follow [Create OKE instruction](./setup/README.md) to create OKE.
+
+## Deploy NIM
+
+Please follow [Deploy NIM instruction](../../../helm/README.md) to deploy NIM.
diff --git a/cloud-service-providers/oci/oke/prerequisites/README.md b/cloud-service-providers/oci/oke/prerequisites/README.md
@@ -0,0 +1,66 @@
+### OKE Prerequisites
+
+This list summarizes the key prerequisites you need to set up before deploying an OKE cluster on OCI.
+
+- **OCI Account and Tenancy**:
+  - Ensure you have an OCI account with the necessary permissions.
+  - Set up a compartment for your Kubernetes cluster.
+
+- **Networking**:
+  - Create a Virtual Cloud Network (VCN) with appropriate subnets.
+  - Ensure internet gateway, NAT gateway, and service gateway are configured.
+  - Set up route tables and security lists for network traffic.
+
+- **IAM Policies**:
+  - Define IAM policies to allow OKE service to manage resources in your compartment.
+  - Grant required permissions to users or groups managing the Kubernetes cluster.
+
+- **Service Limits**:
+  - Verify that your tenancy has sufficient service limits for compute instances, block storage, and other required resources.
+
+- **CLI and SDK Tools**:
+  - Install and configure the OCI CLI for managing OKE.
+  - Optionally, set up OCI SDKs for automating tasks.
+
+- **Kubernetes Version**:
+  - Decide on the Kubernetes version to deploy, ensuring compatibility with your applications and OCI features.
+
+- **API Endpoint**:
+  - Choose between the public or private endpoint for the Kubernetes API server, based on your security requirements.
+
+For more details, please reference this [link.](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengprerequisites.htm)
+
+
+## Install OCI CLI
+
+```
+bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)"
+```
+
+For more details, please reference this [link.](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)
+
+## Install kubectl
+
+```
+curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
+curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
+echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
+sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
+kubectl version --client
+```
+
+For more details, please reference this [link.](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/)
+
+## Install Helm
+
+```
+curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
+chmod 700 get_helm.sh
+./get_helm.sh
+```
+
+For more details, please reference this [link.](https://helm.sh/docs/intro/install/)
+
+## Next step
+
+[Continue to OKE creation](../setup/README.md)
diff --git a/cloud-service-providers/oci/oke/setup/README.md b/cloud-service-providers/oci/oke/setup/README.md
@@ -0,0 +1,85 @@
+# Setup OCI Kubernetes Engine (OKE)
+
+The key to creating Oracle Kubernetes Engine (OKE) for NIM is to create a proper GPU node pool. The following steps will guide you through the process.
+
+## Connect to OCI
+
+1. Log in to your Oracle Cloud Infrastructure (OCI) Console.
+2. Select the appropriate compartment where you want to create the OKE cluster.
+
+## Identify GPU needed for NIM
+
+- Refer to the NIM documentation to identify the NVIDIA GPU you [need](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html). Here is also a list of available [OKE NVIDIA GPU node shapes](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu).
+
+
+## Confirm the GPU availability in 
+
+Use the OCI CLI to search for GPU availability:
+
+   ```bash
+   oci compute shape list --region <region-name> --compartment-id <your-compartment-id> --all --query 'data[*].shape' --output json | jq -r '.[]' | grep -i 'gpu'
+   ```
+
+   Cross-reference with the [OCI Regions](https://www.oracle.com/cloud/data-regions.html) to select the best region.
+
+## Request Quota
+
+Ensure you have the necessary service limits (quota) for the GPU shapes. If needed, request an increase via the OCI Console:
+
+1. Navigate to **Governance and Administration** > **Limits, Quotas, and Usage**.
+2. Select **Request Service Limit Increase** for the relevant GPU shapes.
+
+## Create OKE
+
+To easily create the OKE cluster and NVIDIA GPU nodepool, you can use Quick Create to initially set up the cluster with a default node pool that includes a single, simple VM node. After the cluster is created, you can add a new node pool with GPU shapes that require a larger boot volume size. This way, you can ensure the GPUs have the necessary storage while manually configuring the nodes as needed.
+
+1. In the OCI Console, navigate to **Developer Services** > **Kubernetes Clusters** > **OKE Clusters**.
+2. Click **Create Cluster** and select **Quick Create**.
+3. Configure the following:
+   - **Name**: Provide a name for your cluster.
+   - **Compartment**: Select the appropriate compartment.
+   - **Kubernetes Version**: Choose the latest stable version.
+   - **Kubernetes API endpoint**: Private or public.
+   - **Node type**: Managed.
+   - **Kubernetes worker nodes**: Private or public.
+4. Under **Shape and image**:
+   - **Shape**: You can leave the default simple VM.Standard.x.
+   - **Node Count**: Start with 1 node (adjust as needed).
+   - **Add an SSH key**(optional): In order to have access to nodes.
+5. Click **Create Cluster** to start the provisioning process. This will provision a simple cluster, to which you can subsequently add a GPU nodepool.
+
+## Create GPU nodepool on existing OKE cluster
+
+1. For an existing OKE cluster, navigate to the **Node Pools** section.
+2. Click **Add Node Pool** and configure:
+   - **Name**: Provide a name for the node pool.
+   - **Compartment**: Select the appropriate compartment.
+   - **Version**: the Kubernetes version of the nodes - defaults to current cluster version.
+   - **Node Placement Configuration** - select Availability Domain and Worker node subnet.
+   - **Node Shape**: Select the desired GPU-enabled shape.
+   - **Node Image**: is automatically populated with an OEL GPU image which you can change to a different version.
+   - **Node Count**: Set the number of nodes (adjust according to your needs).
+   - **Boot volume**: Specify a larger size than the default 50GB size, for example 300 GB. To complement this change also go to the next point on custo cloudinit.sh.
+   - **Show advanced options** -> **Initialization script** -> **Paste Cloud-init Script** and paste:
+   ```
+   #!/bin/bash 
+   curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh
+   bash /var/run/oke-init.sh
+   /usr/libexec/oci-growfs -y
+   ```
+3. Click **Create Node Pool**.
+
+## Connect to OKE
+
+1. Install the OCI CLI if you haven't already.
+2. Retrieve the OKE cluster credentials using the Access Cluster buton in the console Cluster details page:
+
+   ```bash
+   oci ce cluster create-kubeconfig --cluster-id <cluster OCID> --file $HOME/.kube/config --region <region> --token-version 2.0.0 --kube-endpoint PUBLIC_ENDPOINT
+   ```
+
+3. Verify the connection to your OKE cluster:
+
+   ```bash
+   kubectl get nodes
+   ```
diff --git a/cloud-service-providers/oci/oke/terraform/README.md b/cloud-service-providers/oci/oke/terraform/README.md
@@ -0,0 +1,133 @@
+# orm-stack-oke-helm-deployment-nim
+
+## Getting started
+
+This stack deploys an OKE cluster with two nodepools:
+- one nodepool with flexible shapes
+- one nodepool with GPU shapes
+
+And several supporting applications using helm:
+- nginx
+- cert-manager
+- jupyterhub
+
+With the scope of demonstrating [nVidia NIM LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) self-hosted model capabilities.
+
+**Note:** For helm deployments it's necessary to create bastion and operator host (with the associated policy for the operator to manage the clsuter), **or** configure a cluster with public API endpoint.
+
+In case the bastion and operator hosts are not created, is a prerequisite to have the following tools already installed and configured:
+- bash
+- helm
+- jq
+- kubectl
+- oci-cli
+
+[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/ionut-sturzu/nim_on_oke/archive/refs/heads/main.zip)
+
+
+## Helm Deployments
+
+### Nginx
+
+[Nginx](https://kubernetes.github.io/ingress-nginx/deploy/) is deployed and configured as default ingress controller.
+
+### Cert-manager
+
+[Cert-manager](https://cert-manager.io/docs/) is deployed to handle the configuration of TLS certificate for the configured ingress resources. Currently it's using the [staging Let's Encrypt endpoint](https://letsencrypt.org/docs/staging-environment/).
+
+### Jupyterhub
+
+[Jupyterhub](https://jupyterhub.readthedocs.io/en/stable/) will be accessible to the address: [https://jupyter.a.b.c.d.nip.io](https://jupyter.a.b.c.d.nip.io), where a.b.c.d is the public IP address of the load balancer associated with the NGINX ingress controller.
+
+JupyterHub is using a dummy authentication scheme (user/password) and the access is secured using the variables:
+
+```
+jupyter_admin_user
+jupyter_admin_password
+```
+
+It also supports the option to automatically clone a git repo when user is connecting and making it available under `examples` directory.
+
+### NIM
+
+The LLM is deployed using [NIM](https://docs.nvidia.com/nim/index.html).
+
+Parameters:
+- `nim_image_repository` and `nim_image_tag` - used to specify the container image location
+- `NGC_API_KEY` - required to authenticate with NGC services
+
+Models with large context length require GPUs with lots of memory. In case of Mistral, with a context length of 32k, the deployment on A10 instances, fails with the default container settings.
+
+To work around this issue, we can limit the context length using the `--max-model-len` argument for the vLLM. The underlying inference engine used by NIM.
+
+In case of Mistral models, create a file `nim_user_values_override.yaml` file with the content below and provide it as input during ORM stack variable configuration.
+
+## How to deploy?
+
+1. Deploy directly to OCI using the below button:
+
+[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/ionut-sturzu/nim_on_oke/archive/refs/heads/main.zip)
+
+
+2. Deploy via ORM
+- Create a new stack
+- Upload the TF configuration files
+- Configure the variables
+- Apply
+
+3. Local deployment
+
+- Create a file called `terraform.auto.tfvars` with the required values.
+
+```
+# ORM injected values
+
+region            = "uk-london-1"
+tenancy_ocid      = "ocid1.tenancy.oc1..aaaaaaaaiyavtwbz4kyu7g7b6wglllccbflmjx2lzk5nwpbme44mv54xu7dq"
+compartment_ocid  = "ocid1.compartment.oc1..aaaaaaaaqi3if6t4n24qyabx5pjzlw6xovcbgugcmatavjvapyq3jfb4diqq"
+
+# OKE Terraform module values
+create_iam_resources     = false
+create_iam_tag_namespace = false
+ssh_public_key           = "<ssh_public_key>"
+
+## NodePool with non-GPU shape is created by default with size 1
+simple_np_flex_shape   = { "instanceShape" = "VM.Standard.E4.Flex", "ocpus" = 2, "memory" = 16 }
+
+## NodePool with GPU shape is created by default with size 0
+gpu_np_size  = 1
+gpu_np_shape = "VM.GPU.A10.1"
+
+## OKE Deployment values
+cluster_name           = "oke"
+vcn_name               = "oke-vcn"
+compartment_id         = "ocid1.compartment.oc1..aaaaaaaaqi3if6t4n24qyabx5pjzlw6xovcbgugcmatavjvapyq3jfb4diqq"
+
+# Jupyter Hub deployment values
+jupyter_admin_user     = "oracle-ai"
+jupyter_admin_password = "<admin-passowrd>"
+playbooks_repo         = "https://github.com/ionut-sturzu/nim_notebooks.git"
+
+# NIM Deployment values
+nim_image_repository   = "nvcr.io/nim/meta/llama3-8b-instruct"
+nim_image_tag          = "latest"
+NGC_API_KEY            = "<ngc_api_key>"
+```
+
+- Execute the commands
+
+```
+terraform init
+terraform plan
+terraform apply
+```
+
+After the deployment is successful, get the Jupyter URL from the Terraform output and run it in the browser.
+Log in with the user/password that you previously set.
+Open and run the **NVIDIA_NIM_model_interaction.ipynb** notebook.
+
+## Known Issues
+
+If `terraform destroy` fails, manually remove the LoadBalancer resource configured for the Nginx Ingress Controller.
+
+After `terrafrom destroy`, the block volumes corresponding to the PVCs used by the applications in the cluster won't be removed. You have to manually remove them.
diff --git a/cloud-service-providers/oci/oke/terraform/common.tf b/cloud-service-providers/oci/oke/terraform/common.tf
@@ -0,0 +1,14 @@
+# Copyright (c) 2022, 2024 Oracle Corporation and/or its affiliates.
+# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl
+
+locals {
+  state_id = coalesce(var.state_id, random_string.state_id.id)
+}
+
+resource "random_string" "state_id" {
+  length  = 6
+  lower   = true
+  numeric = false
+  special = false
+  upper   = false
+}
diff --git a/cloud-service-providers/oci/oke/terraform/datasources.tf b/cloud-service-providers/oci/oke/terraform/datasources.tf
@@ -0,0 +1,49 @@
+# Copyright (c) 2022, 2024 Oracle Corporation and/or its affiliates.
+# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl
+
+data "oci_identity_tenancy" "tenant_details" {
+
+  tenancy_id = var.tenancy_ocid
+}
+
+data "oci_identity_regions" "home_region" {
+
+  filter {
+    name   = "key"
+    values = [data.oci_identity_tenancy.tenant_details.home_region_key]
+  }
+}
+
+data "oci_identity_availability_domains" "ads" {
+
+  compartment_id = var.tenancy_ocid
+}
+
+data "oci_core_shapes" "gpu_shapes" {
+  for_each = { for entry in data.oci_identity_availability_domains.ads.availability_domains : entry.name => entry.id }
+
+  compartment_id      = var.compartment_id
+  availability_domain = each.key
+
+  filter {
+    name   = "name"
+    values = [var.gpu_np_shape]
+  }
+}
+
+data "oci_load_balancer_load_balancers" "lbs" {
+
+  compartment_id = coalesce(var.compartment_id, var.compartment_ocid)
+
+  filter {
+    name   = "freeform_tags.state_id"
+    values = [local.state_id]
+  }
+
+  filter {
+    name   = "freeform_tags.application"
+    values = ["nginx"]
+  }
+
+  depends_on = [module.nginx]
+}