Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: Add IB capabilities #200

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ocaisa
Copy link
Collaborator

@ocaisa ocaisa commented Jan 12, 2022

Closes #199

@@ -60,6 +69,7 @@ resource "azurerm_linux_virtual_machine" "instances" {
location = var.location
resource_group_name = local.resource_group_name
network_interface_ids = [azurerm_network_interface.nic[each.key].id]
availability_set_id = contains(each.value["tags"], "node") ? azurerm_availability_set.avset.id : null
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an example of how this could be done, perhaps you'd prefer an additional tag?

@ocaisa
Copy link
Collaborator Author

ocaisa commented Jan 13, 2022

The availability set is required for IB support (see #128).

Even though I used the CentOS-HPC image:

    publisher = "OpenLogic",
    offer     = "CentOS-HPC",
    sku       = "7_9-gen2"

IB did not work out of the box. This image also came with some azure packages which were flooding the logs with errors until I removed them:

sudo yum erase -y azure-security-2.14.0-64.x86_64 azsec-monitor-0.9.0-64.x86_64 azure-mdsd.x86_64

so I am not sure they are the best option. (I also just saw in https://github.com/Azure/azhpc-images/blob/master/centos/common/hpc-tuning.sh that they even do this themselves with yum remove -y azsec-monitor)

You can install the Mellanox drivers and get working IB with:

VERSION="5.5-1.0.3.2"
MLNX_OFED_DOWNLOAD_URL=http://content.mellanox.com/ofed/MLNX_OFED-${VERSION}/MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64.tgz
wget --retry-connrefused --tries=3 --waitretry=5 $MLNX_OFED_DOWNLOAD_URL
tar -zxvf MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64.tgz 
cd MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64/
KERNEL=( $(rpm -q kernel | sed 's/kernel\-//g') )
KERNEL=${KERNEL[-1]}
sudo ./mlnxofedinstall --kernel $KERNEL --kernel-sources /usr/src/kernels/${KERNEL} --add-kernel-support --skip-repo --skip-unsupported-devices-check --without-fw-update
sudo dracut -f
sudo /etc/init.d/openibd force-restart

(this mostly comes from https://github.com/Azure/azhpc-images/blob/master/centos/centos-7.x/centos-7.9-hpc/install_mellanoxofed.sh). Looking at Azure/azhpc-images#119 , perhaps just running the last line alone might have been enough to make the IB work...

@ocaisa ocaisa changed the title Azure: Use spot specific values only when using spot instances Azure: Use spot specific values only when using spot instances, add IB capabilities Jan 13, 2022
@ocaisa
Copy link
Collaborator Author

ocaisa commented Jan 13, 2022

Could probably automate the installation of the drivers with something like

diff --git a/common/instance_config/puppet.yaml b/common/instance_config/puppet.yaml
index 0ef6741..9a693db 100644
--- a/common/instance_config/puppet.yaml
+++ b/common/instance_config/puppet.yaml
@@ -74,6 +74,13 @@ runcmd:
   - "(tar xf aws-efa-installer-latest.tar.gz && cd aws-efa-installer && ./efa_installer.sh --yes --minimal)"
   - rm -fr aws-efa-installer aws-efa-installer-latest.tar.gz
 %{ endif }
+# Azure IB installation
+%{ if contains(tags, "node") }
+  - "(export VERSION=5.5-1.0.3.2 && wget --retry-connrefused --tries=3 --waitretry=5 http://content.mellanox.com/ofed/MLNX_OFED-${VERSION}/MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64.tgz && tar -zxvf MLNX_OFED_LINUX-${VERSION}-rhel7.9-x86_64.tgz )" 
+  - "(KERNEL=( $(rpm -q kernel | tail -1 |sed 's/kernel\-//g') ) ./MLNX_OFED_LINUX*/mlnxofedinstall --kernel $KERNEL --kernel-sources /usr/src/kernels/${KERNEL} --add-kernel-support --skip-repo --skip-unsupported-devices-check --without-fw-update && dracut -f && /etc/init.d/openibd force-restart)"
+  - rm -fr MLNX_OFED_LINUX-*
+%{ endif }
+
 
 write_files:
   - content: |

(though this doesn't seem to work when I tested it)

@ocaisa
Copy link
Collaborator Author

ocaisa commented Jan 13, 2022

Just confirmed that using the HPC images can give a functional IB simply by doing:

[centos@node2 ~]$ ibstat
[centos@node2 ~]$ sudo /etc/init.d/openibd force-restart
Unloading HCA driver:                                      [  OK  ]
Loading HCA driver and Access Layer:                       [  OK  ]
[centos@node2 ~]$ ibstat
CA 'mlx5_0'
	CA type: MT4120
	Number of ports: 1
	Firmware version: 16.28.4000
	Hardware version: 0
	Node GUID: 0x00155dfffe3400e6
	System image GUID: 0x98039b0300c9ce34
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 100
		Base lid: 1172
		LMC: 0
		SM lid: 1
		Capability mask: 0x2651ec48
		Port GUID: 0x00155dfffd3400e6
		Link layer: InfiniBand

@ocaisa
Copy link
Collaborator Author

ocaisa commented Jan 14, 2022

So, it looks like we could just porbably just install the drivers ourselves on a basic image using the approach in the comment above, but that is a bit of a pain to keep up to date. If we use the distributed HPC images (as a recommendation), this becomes simpler since we would just need to run

sudo /etc/init.d/openibd force-restart

BUT it seems it needs to run after puppet does its thing, I tried placing it in common/instance_config/puppet.yaml and that didn't work.

@cmd-ntrf cmd-ntrf self-assigned this Jan 14, 2022
@ocaisa ocaisa changed the title Azure: Use spot specific values only when using spot instances, add IB capabilities Azure: Add IB capabilities Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Azure: some properties of VM should only be set when using spot instances
2 participants