-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add oci support #84
Open
adinadiana1234
wants to merge
15
commits into
NVIDIA:main
Choose a base branch
from
adinadiana1234:add-oci-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add oci support #84
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
1a9561d
oke readme files
adinan-tech 008b68a
.
adinan-tech be43d3c
.
adinan-tech 60f7418
.
adinan-tech be9169e
.
adinan-tech 414801a
.
adinan-tech 1ccd947
.
adinan-tech 5d57eb2
.
adinan-tech 1ccb9d8
cloudinit
adinan-tech af97fcc
.
adinan-tech f315dfc
.
adinan-tech 637c4a2
oke setup
adinan-tech 9cda673
tf readme
adinan-tech 1df5b21
,
adinan-tech 2b2c652
add deploy to oci button
adinan-tech File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# NIM on Oracle Cloud Infrastructure (OCI) OKE | ||
|
||
To deploy NIM on Oracle Cloud Infrastructure (OCI) successfully, it’s crucial to choose the correct GPU shapes and ensure that the appropriate NVIDIA drivers are installed. | ||
|
||
When you select a GPU shape for a managed node pool or self-managed node in OKE, you must also select a compatible Oracle Linux GPU image that has the CUDA libraries pre-installed. The names of compatible images include 'GPU'. OCI offers Oracle Linux (OEL) providing the possibility to use pre-installed GPU drivers. This simplifies the deployment process for NIM. | ||
|
||
|
||
## Prerequisites | ||
|
||
Please follow [Prerequisite instructions](./prerequisites/README.md) to get ready for OKE creation. | ||
|
||
## Create OKE | ||
|
||
Please follow [Create OKE instruction](./setup/README.md) to create OKE. | ||
|
||
## Deploy NIM | ||
|
||
Please follow [Deploy NIM instruction](../../../helm/README.md) to deploy NIM. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
### OKE Prerequisites | ||
|
||
This list summarizes the key prerequisites you need to set up before deploying an OKE cluster on OCI. | ||
|
||
- **OCI Account and Tenancy**: | ||
- Ensure you have an OCI account with the necessary permissions. | ||
- Set up a compartment for your Kubernetes cluster. | ||
|
||
- **Networking**: | ||
- Create a Virtual Cloud Network (VCN) with appropriate subnets. | ||
- Ensure internet gateway, NAT gateway, and service gateway are configured. | ||
- Set up route tables and security lists for network traffic. | ||
|
||
- **IAM Policies**: | ||
- Define IAM policies to allow OKE service to manage resources in your compartment. | ||
- Grant required permissions to users or groups managing the Kubernetes cluster. | ||
|
||
- **Service Limits**: | ||
- Verify that your tenancy has sufficient service limits for compute instances, block storage, and other required resources. | ||
|
||
- **CLI and SDK Tools**: | ||
- Install and configure the OCI CLI for managing OKE. | ||
- Optionally, set up OCI SDKs for automating tasks. | ||
|
||
- **Kubernetes Version**: | ||
- Decide on the Kubernetes version to deploy, ensuring compatibility with your applications and OCI features. | ||
|
||
- **API Endpoint**: | ||
- Choose between the public or private endpoint for the Kubernetes API server, based on your security requirements. | ||
|
||
For more details, please reference this [link.](https://docs.oracle.com/en-us/iaas/Content/ContEng/Concepts/contengprerequisites.htm) | ||
|
||
|
||
## Install OCI CLI | ||
|
||
``` | ||
bash -c "$(curl -L https://raw.githubusercontent.com/oracle/oci-cli/master/scripts/install/install.sh)" | ||
``` | ||
|
||
For more details, please reference this [link.](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm) | ||
|
||
## Install kubectl | ||
|
||
``` | ||
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl" | ||
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256" | ||
echo "$(cat kubectl.sha256) kubectl" | sha256sum --check | ||
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl | ||
kubectl version --client | ||
``` | ||
|
||
For more details, please reference this [link.](https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/) | ||
|
||
## Install Helm | ||
|
||
``` | ||
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | ||
chmod 700 get_helm.sh | ||
./get_helm.sh | ||
``` | ||
|
||
For more details, please reference this [link.](https://helm.sh/docs/intro/install/) | ||
|
||
## Next step | ||
|
||
[Continue to OKE creation](../setup/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Setup OCI Kubernetes Engine (OKE) | ||
|
||
The key to creating Oracle Kubernetes Engine (OKE) for NIM is to create a proper GPU node pool. The following steps will guide you through the process. | ||
|
||
## Connect to OCI | ||
|
||
1. Log in to your Oracle Cloud Infrastructure (OCI) Console. | ||
2. Select the appropriate compartment where you want to create the OKE cluster. | ||
|
||
## Identify GPU needed for NIM | ||
|
||
- Refer to the NIM documentation to identify the NVIDIA GPU you [need](https://docs.nvidia.com/nim/large-language-models/latest/support-matrix.html). Here is also a list of available [OKE NVIDIA GPU node shapes](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu). | ||
|
||
|
||
## Confirm the GPU availability in | ||
|
||
Use the OCI CLI to search for GPU availability: | ||
|
||
```bash | ||
oci compute shape list --region <region-name> --compartment-id <your-compartment-id> --all --query 'data[*].shape' --output json | jq -r '.[]' | grep -i 'gpu' | ||
``` | ||
|
||
Cross-reference with the [OCI Regions](https://www.oracle.com/cloud/data-regions.html) to select the best region. | ||
|
||
## Request Quota | ||
|
||
Ensure you have the necessary service limits (quota) for the GPU shapes. If needed, request an increase via the OCI Console: | ||
|
||
1. Navigate to **Governance and Administration** > **Limits, Quotas, and Usage**. | ||
2. Select **Request Service Limit Increase** for the relevant GPU shapes. | ||
|
||
## Create OKE | ||
|
||
To easily create the OKE cluster and NVIDIA GPU nodepool, you can use Quick Create to initially set up the cluster with a default node pool that includes a single, simple VM node. After the cluster is created, you can add a new node pool with GPU shapes that require a larger boot volume size. This way, you can ensure the GPUs have the necessary storage while manually configuring the nodes as needed. | ||
|
||
1. In the OCI Console, navigate to **Developer Services** > **Kubernetes Clusters** > **OKE Clusters**. | ||
2. Click **Create Cluster** and select **Quick Create**. | ||
3. Configure the following: | ||
- **Name**: Provide a name for your cluster. | ||
- **Compartment**: Select the appropriate compartment. | ||
- **Kubernetes Version**: Choose the latest stable version. | ||
- **Kubernetes API endpoint**: Private or public. | ||
- **Node type**: Managed. | ||
- **Kubernetes worker nodes**: Private or public. | ||
4. Under **Shape and image**: | ||
- **Shape**: You can leave the default simple VM.Standard.x. | ||
- **Node Count**: Start with 1 node (adjust as needed). | ||
- **Add an SSH key**(optional): In order to have access to nodes. | ||
5. Click **Create Cluster** to start the provisioning process. This will provision a simple cluster, to which you can subsequently add a GPU nodepool. | ||
|
||
## Create GPU nodepool on existing OKE cluster | ||
|
||
1. For an existing OKE cluster, navigate to the **Node Pools** section. | ||
2. Click **Add Node Pool** and configure: | ||
- **Name**: Provide a name for the node pool. | ||
- **Compartment**: Select the appropriate compartment. | ||
- **Version**: the Kubernetes version of the nodes - defaults to current cluster version. | ||
- **Node Placement Configuration** - select Availability Domain and Worker node subnet. | ||
- **Node Shape**: Select the desired GPU-enabled shape. | ||
- **Node Image**: is automatically populated with an OEL GPU image which you can change to a different version. | ||
- **Node Count**: Set the number of nodes (adjust according to your needs). | ||
- **Boot volume**: Specify a larger size than the default 50GB size, for example 300 GB. To complement this change also go to the next point on custo cloudinit.sh. | ||
- **Show advanced options** -> **Initialization script** -> **Paste Cloud-init Script** and paste: | ||
``` | ||
#!/bin/bash | ||
curl --fail -H "Authorization: Bearer Oracle" -L0 http://169.254.169.254/opc/v2/instance/metadata/oke_init_script | base64 --decode >/var/run/oke-init.sh | ||
bash /var/run/oke-init.sh | ||
/usr/libexec/oci-growfs -y | ||
``` | ||
3. Click **Create Node Pool**. | ||
|
||
## Connect to OKE | ||
|
||
1. Install the OCI CLI if you haven't already. | ||
2. Retrieve the OKE cluster credentials using the Access Cluster buton in the console Cluster details page: | ||
|
||
```bash | ||
oci ce cluster create-kubeconfig --cluster-id <cluster OCID> --file $HOME/.kube/config --region <region> --token-version 2.0.0 --kube-endpoint PUBLIC_ENDPOINT | ||
``` | ||
|
||
3. Verify the connection to your OKE cluster: | ||
|
||
```bash | ||
kubectl get nodes | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# orm-stack-oke-helm-deployment-nim | ||
|
||
## Getting started | ||
|
||
This stack deploys an OKE cluster with two nodepools: | ||
- one nodepool with flexible shapes | ||
- one nodepool with GPU shapes | ||
|
||
And several supporting applications using helm: | ||
- nginx | ||
- cert-manager | ||
- jupyterhub | ||
|
||
With the scope of demonstrating [nVidia NIM LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) self-hosted model capabilities. | ||
|
||
**Note:** For helm deployments it's necessary to create bastion and operator host (with the associated policy for the operator to manage the clsuter), **or** configure a cluster with public API endpoint. | ||
|
||
In case the bastion and operator hosts are not created, is a prerequisite to have the following tools already installed and configured: | ||
- bash | ||
- helm | ||
- jq | ||
- kubectl | ||
- oci-cli | ||
|
||
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/ionut-sturzu/nim_on_oke/archive/refs/heads/main.zip) | ||
|
||
|
||
## Helm Deployments | ||
|
||
### Nginx | ||
|
||
[Nginx](https://kubernetes.github.io/ingress-nginx/deploy/) is deployed and configured as default ingress controller. | ||
|
||
### Cert-manager | ||
|
||
[Cert-manager](https://cert-manager.io/docs/) is deployed to handle the configuration of TLS certificate for the configured ingress resources. Currently it's using the [staging Let's Encrypt endpoint](https://letsencrypt.org/docs/staging-environment/). | ||
|
||
### Jupyterhub | ||
|
||
[Jupyterhub](https://jupyterhub.readthedocs.io/en/stable/) will be accessible to the address: [https://jupyter.a.b.c.d.nip.io](https://jupyter.a.b.c.d.nip.io), where a.b.c.d is the public IP address of the load balancer associated with the NGINX ingress controller. | ||
|
||
JupyterHub is using a dummy authentication scheme (user/password) and the access is secured using the variables: | ||
|
||
``` | ||
jupyter_admin_user | ||
jupyter_admin_password | ||
``` | ||
|
||
It also supports the option to automatically clone a git repo when user is connecting and making it available under `examples` directory. | ||
|
||
### NIM | ||
|
||
The LLM is deployed using [NIM](https://docs.nvidia.com/nim/index.html). | ||
|
||
Parameters: | ||
- `nim_image_repository` and `nim_image_tag` - used to specify the container image location | ||
- `NGC_API_KEY` - required to authenticate with NGC services | ||
|
||
Models with large context length require GPUs with lots of memory. In case of Mistral, with a context length of 32k, the deployment on A10 instances, fails with the default container settings. | ||
|
||
To work around this issue, we can limit the context length using the `--max-model-len` argument for the vLLM. The underlying inference engine used by NIM. | ||
|
||
In case of Mistral models, create a file `nim_user_values_override.yaml` file with the content below and provide it as input during ORM stack variable configuration. | ||
|
||
## How to deploy? | ||
|
||
1. Deploy directly to OCI using the below button: | ||
|
||
[![Deploy to Oracle Cloud](https://oci-resourcemanager-plugin.plugins.oci.oraclecloud.com/latest/deploy-to-oracle-cloud.svg)](https://cloud.oracle.com/resourcemanager/stacks/create?zipUrl=https://github.com/ionut-sturzu/nim_on_oke/archive/refs/heads/main.zip) | ||
|
||
|
||
2. Deploy via ORM | ||
- Create a new stack | ||
- Upload the TF configuration files | ||
- Configure the variables | ||
- Apply | ||
|
||
3. Local deployment | ||
|
||
- Create a file called `terraform.auto.tfvars` with the required values. | ||
|
||
``` | ||
# ORM injected values | ||
|
||
region = "uk-london-1" | ||
tenancy_ocid = "ocid1.tenancy.oc1..aaaaaaaaiyavtwbz4kyu7g7b6wglllccbflmjx2lzk5nwpbme44mv54xu7dq" | ||
compartment_ocid = "ocid1.compartment.oc1..aaaaaaaaqi3if6t4n24qyabx5pjzlw6xovcbgugcmatavjvapyq3jfb4diqq" | ||
|
||
# OKE Terraform module values | ||
create_iam_resources = false | ||
create_iam_tag_namespace = false | ||
ssh_public_key = "<ssh_public_key>" | ||
|
||
## NodePool with non-GPU shape is created by default with size 1 | ||
simple_np_flex_shape = { "instanceShape" = "VM.Standard.E4.Flex", "ocpus" = 2, "memory" = 16 } | ||
|
||
## NodePool with GPU shape is created by default with size 0 | ||
gpu_np_size = 1 | ||
gpu_np_shape = "VM.GPU.A10.1" | ||
|
||
## OKE Deployment values | ||
cluster_name = "oke" | ||
vcn_name = "oke-vcn" | ||
compartment_id = "ocid1.compartment.oc1..aaaaaaaaqi3if6t4n24qyabx5pjzlw6xovcbgugcmatavjvapyq3jfb4diqq" | ||
|
||
# Jupyter Hub deployment values | ||
jupyter_admin_user = "oracle-ai" | ||
jupyter_admin_password = "<admin-passowrd>" | ||
playbooks_repo = "https://github.com/ionut-sturzu/nim_notebooks.git" | ||
|
||
# NIM Deployment values | ||
nim_image_repository = "nvcr.io/nim/meta/llama3-8b-instruct" | ||
nim_image_tag = "latest" | ||
NGC_API_KEY = "<ngc_api_key>" | ||
``` | ||
|
||
- Execute the commands | ||
|
||
``` | ||
terraform init | ||
terraform plan | ||
terraform apply | ||
``` | ||
|
||
After the deployment is successful, get the Jupyter URL from the Terraform output and run it in the browser. | ||
Log in with the user/password that you previously set. | ||
Open and run the **NVIDIA_NIM_model_interaction.ipynb** notebook. | ||
|
||
## Known Issues | ||
|
||
If `terraform destroy` fails, manually remove the LoadBalancer resource configured for the Nginx Ingress Controller. | ||
|
||
After `terrafrom destroy`, the block volumes corresponding to the PVCs used by the applications in the cluster won't be removed. You have to manually remove them. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Copyright (c) 2022, 2024 Oracle Corporation and/or its affiliates. | ||
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl | ||
|
||
locals { | ||
state_id = coalesce(var.state_id, random_string.state_id.id) | ||
} | ||
|
||
resource "random_string" "state_id" { | ||
length = 6 | ||
lower = true | ||
numeric = false | ||
special = false | ||
upper = false | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Copyright (c) 2022, 2024 Oracle Corporation and/or its affiliates. | ||
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl | ||
|
||
data "oci_identity_tenancy" "tenant_details" { | ||
|
||
tenancy_id = var.tenancy_ocid | ||
} | ||
|
||
data "oci_identity_regions" "home_region" { | ||
|
||
filter { | ||
name = "key" | ||
values = [data.oci_identity_tenancy.tenant_details.home_region_key] | ||
} | ||
} | ||
|
||
data "oci_identity_availability_domains" "ads" { | ||
|
||
compartment_id = var.tenancy_ocid | ||
} | ||
|
||
data "oci_core_shapes" "gpu_shapes" { | ||
for_each = { for entry in data.oci_identity_availability_domains.ads.availability_domains : entry.name => entry.id } | ||
|
||
compartment_id = var.compartment_id | ||
availability_domain = each.key | ||
|
||
filter { | ||
name = "name" | ||
values = [var.gpu_np_shape] | ||
} | ||
} | ||
|
||
data "oci_load_balancer_load_balancers" "lbs" { | ||
|
||
compartment_id = coalesce(var.compartment_id, var.compartment_ocid) | ||
|
||
filter { | ||
name = "freeform_tags.state_id" | ||
values = [local.state_id] | ||
} | ||
|
||
filter { | ||
name = "freeform_tags.application" | ||
values = ["nginx"] | ||
} | ||
|
||
depends_on = [module.nginx] | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a link to this document on the main readme.md table that links to each CSP managed K8s deployment?