-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure: Add IB capabilities #200
base: main
Are you sure you want to change the base?
Conversation
@@ -60,6 +69,7 @@ resource "azurerm_linux_virtual_machine" "instances" { | |||
location = var.location | |||
resource_group_name = local.resource_group_name | |||
network_interface_ids = [azurerm_network_interface.nic[each.key].id] | |||
availability_set_id = contains(each.value["tags"], "node") ? azurerm_availability_set.avset.id : null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an example of how this could be done, perhaps you'd prefer an additional tag?
The availability set is required for IB support (see #128). Even though I used the CentOS-HPC image:
IB did not work out of the box. This image also came with some azure packages which were flooding the logs with errors until I removed them:
so I am not sure they are the best option. (I also just saw in https://github.com/Azure/azhpc-images/blob/master/centos/common/hpc-tuning.sh that they even do this themselves with You can install the Mellanox drivers and get working IB with:
(this mostly comes from https://github.com/Azure/azhpc-images/blob/master/centos/centos-7.x/centos-7.9-hpc/install_mellanoxofed.sh). Looking at Azure/azhpc-images#119 , perhaps just running the last line alone might have been enough to make the IB work... |
Could probably automate the installation of the drivers with something like diff --git a/common/instance_config/puppet.yaml b/common/instance_config/puppet.yaml
index 0ef6741..9a693db 100644
--- a/common/instance_config/puppet.yaml
+++ b/common/instance_config/puppet.yaml
@@ -74,6 +74,13 @@ runcmd:
- "(tar xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer && ./efa_installer.sh --yes --minimal)"
- rm -fr aws-efa-installer aws-efa-installer-latest.tar.gz
%{ endif }
+# Azure IB installation
+%{ if contains(tags, "node") }
+ - "(export VERSION=5.5-1.0.3.2 && wget --retry-connrefused --tries=3 --waitretry=5 http://content.mellanox.com/ofed/MLNX_OFED-${VERSION}/MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64.tgz && tar -zxvf MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64.tgz )"
+ - "(KERNEL=( $(rpm -q kernel | tail -1 |sed 's/kernel\-//g') ) ./MLNX_OFED_LINUX*/mlnxofedinstall --kernel $KERNEL --kernel-sources /usr/src/kernels/${KERNEL} --add-kernel-support --skip-repo --skip-unsupported-devices-check --without-fw-update && dracut -f && /etc/init.d/openibd force-restart)"
+ - rm -fr MLNX_OFED_LINUX-*
+%{ endif }
+
write_files:
- content: | (though this doesn't seem to work when I tested it) |
Just confirmed that using the HPC images can give a functional IB simply by doing:
|
So, it looks like we could just porbably just install the drivers ourselves on a basic image using the approach in the comment above, but that is a bit of a pain to keep up to date. If we use the distributed HPC images (as a recommendation), this becomes simpler since we would just need to run
BUT it seems it needs to run after puppet does its thing, I tried placing it in |
Closes #199