Skip to content

Latest commit

 

History

History
329 lines (228 loc) · 12.7 KB

README.rst

File metadata and controls

329 lines (228 loc) · 12.7 KB

Gpupt Roll

This roll enables GPU pass-through to a guest VM. The roll installs rocks commands for GPU management a frontend and vm-container nodes.

  1. This roll assumes that GPU cards are NVIDIA cards that have Video/Audio function for each card which are designated by a function 0/1 respectively on the PCI bus. For example, to list nvidia devices

    [root@gpu-1-6]# lspci -D -d 10de:
    0000:02:00.0 3D controller: NVIDIA Corporation GF100GL [Tesla T20 Processor] (rev a3)
    0000:02:00.1 Audio device: NVIDIA Corporation GF100 High Definition Audio Controller (rev a1)
    
  2. This roll assumes that GPU cards for each physical host are given logical names with the prefix gpupci The names are unique on a single host and the sequence starts with gpupci1. The name is not unique across different hosts.

  3. A single GPU card will be assigned to a VM host. The logical name for a GPU card on any VM is always gpupci without indexes. See example below for usage.

  4. Cuda roll is installed on vm-containers and on the guest VMs.

To build the roll, execute :

# make roll

A successful build will create gpupt-*.x86_64*.iso file.

To add this roll to existing cluster, execute these instructions on a Rocks frontend:

# rocks add roll gpupt-*.x86_64.disk1.iso
# rocks enable roll gpupt
# (cd /export/rocks/install; rocks create distro)
# rocks run roll gpupt > add-roll.sh
# bash add-roll.sh

On the vm-container nodes (only GPU-enabled):

# yum clean all
# yum install rocks-command-gpupt
  1. The following commands are enabled with the gpupt roll:

    rocks add host gpu ...
    rocks dump host gpu ...
    rocks list host gpu ...
    rocks remove host gpu ...
    rocks report host gpu ...
    rocks set host gpu ...
    
  2. A plugin plugin_device.py to manage guest VM GPU pass-through PCI addressing. Used by rocks command rocks report host vm config.

  3. A command gpupci to manage GPU cards PCI addressing (list, detach, attach). This command is executed on the GPU-enabled hosts to get information (list) or to make GPU card available/non-available on the physical host PCI bus. For more info use gpupci -h

The Intel VT-d extensions provide hardware support for assigning a physical device to a guest VM. There are 2 parts in enabling extensions (assuming that the hardware provides a support for it). The changes are done on the physical host that has GPU cards and will be hosting VMs.

  1. Enable VT-D extensions in BIOS Verify if your processor supports VT-d extensions. The extensions differ among manufactureres. Consult the BIOS settings.

  2. Activate Vt-d in the kernel Append the following flags to the end of the kernel line in boot.grub:

    intel_iommu=on iommu=pt pci=realloc rdblacklist=nvidia
    

    The last flag is to disable loading of nvidia driver.

  3. Uninstall nvidia driver. This step is important otherwise when booting VMs later the following errors may be present in /var/log/libvirt/qemu/VMNAME.log and VM will not boot:

    Failed to assign device "hostdev0" : Device or resource busy
    2017-08-31T22:17:28.117713Z qemu-kvm: -device pci-assign,host=02:00.0,id=hostdev0, ...  Device 'pci-assign' could not be initialized
    

    To uninstall the driver

    /opt/cuda/driver/uninstall-driver
    more /var/log/nvidia-uninstall.log
    

Reboot.

When the host is rebooted, check if the changes are enabled:

# cat /proc/cmdline
ro root=UUID=575b0aac-0b20-4024-8a2d-26f8d3cc460b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet intel_iommu=on iommu=pt pci=realloc  rdblacklist=nvidia

the output should contain added flags

The following two commands shoudl show PCI-DMA and IOMMU

# dmesg | grep -i PCI-DMA
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
# grep -i IOMMU /var/log/messages
Aug 28 15:06:23 gpu-1-6 kernel: Command line: ro root=UUID=575b0aac-0b20-4024-8a2d-26f8d3cc460b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet intel_iommu=on iommu=pt pci=realloc  rdblacklist=nvidia
Aug 28 15:06:23 gpu-1-6 kernel: Kernel command line: ro root=UUID=575b0aac-0b20-4024-8a2d-26f8d3cc460b rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet intel_iommu=on iommu=pt pci=realloc  rdblacklist=nvidia
Aug 28 15:06:23 gpu-1-6 kernel: Intel-IOMMU: enabled
Aug 28 15:06:23 gpu-1-6 kernel: dmar: IOMMU 0: reg_base_addr fbffe000 ver 1:0 cap c90780106f0462 ecap f020fe
Aug 28 15:06:23 gpu-1-6 kernel: IOMMU 0xfbffe000: using Queued invalidation
Aug 28 15:06:23 gpu-1-6 kernel: IOMMU: hardware identity mapping for device 0000:00:00.0
...
Aug 31 10:57:53 gpu-1-6 kernel: IOMMU: hardware identity mapping for device 0000:04:00.1
Aug 31 10:57:53 gpu-1-6 kernel: IOMMU: Setting RMRR:
Aug 31 10:57:53 gpu-1-6 kernel: IOMMU: Prepare 0-16MiB unity mapping for LPC

Check that nvidia driver is not loaded

lsmod | grep nvidia

should return nothing

The commands to detach GPU cards from physical hosts are run once for each GPU card on each host. The list below includes some informational commands.

  1. Run gpupci -l command on all GPU-enabled vm-containers to get information about the GPU cards. For example, on vm-container-0-15 the output is

    # gpupci -l
    gpupci1 pci_0000_02_00_0
    gpupci2 pci_0000_03_00_0
    

    The output means there are 2 GPU cards and for each there is a logincal GPU name and its PCI bus info.

  2. Run commands to add this information in the rocks database:

    # rocks add host gpu vm-container-0-15 gpupci1 pci_0000_02_00_0
    # rocks add host gpu vm-container-0-15 gpupci2 pci_0000_03_00_0
    
  3. Verify that GPU info now is in the database:

    # rocks list host gpu
    HOST               GPU     PCI_BUS
    vm-container-0-15: gpupci1 pci_0000_02_00_0
    vm-container-0-15: gpupci2 pci_0000_03_00_0
    
  4. Detach the GPU cards from the physical host. This is an actual command that detaches the GPU from the physical host PCI bus. This needs to be done once for each GPU card before any VM can use the GPU PCI in pass-through mode. This can be done as a single command for all cards

    # rocks run host vm-container-0-15 "gpupci -d all"
    

    or using a specific logical name for a single GPU card on a given host

    # rocks run host vm-container-0-2 "gpupci -d gpupci1"
    

Once the GPU card is detached from a physical host it is ready for use by the guest VM. We assume that a single GPU card is assigned to a VM and that a VM is run on a GPU-enabled vm-container. For example, if there is a VM rocks-33 that is created and running on a vm-container-0-15 and we want to assign a GPU to it:

rocks stop host VM rocks-33
rocks add host gpu rocks-33 gpupci pci_0000_02_00_0
rocks report host vm config rocks-33

The first command stops VM, the add command adds a GPU attribute to the VM in the rocks database. The report command verifies that the xml file that describes the VM configuration has device information for the GPU card. For this example, the output would contain:

...
  <hostdev mode='subsystem' type='pci' managed='yes'>
    <source>
      <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </source>
  </hostdev>
</devices>

At the next start of the VM the GPU card will be available to the VM.

  1. PCI bus address

    On the VM the GPU PCI bus address will be different from the GPU PCI address of the physical host. For eample, a GPU card on a physical host

    [root@gpu-1-6]# lspci -D -d 10de:
    0000:02:00.0 3D controller: NVIDIA Corporation GF100GL [Tesla T20 Processor] (rev a3)
    

    shows on a VM as

    root@rocce-vm3 ~]# lspci -d 10de:
    00:06.0 3D controller: NVIDIA Corporation GF100GL [Tesla T20 Processor] (rev a3)
    
  2. check nvidia driver is loaded

    # lsmod | grep nvidia
    nvidia_uvm             63294  0
    nvidia               8368623  1 nvidia_uvm
    i2c_core               29964  2 nvidia,i2c_piix4
    
  3. check if the GPU card is present

    # nvidia-smi
    Thu Aug 31 17:37:32 2017
    +------------------------------------------------------+
    | NVIDIA-SMI 346.59     Driver Version: 346.59         |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla M2050         On   | 0000:00:06.0     Off |                    0 |
    | N/A   N/A    P1    N/A /  N/A |      6MiB /  2687MiB |      0%   E. Process |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    
  4. run a few commands form nvidia toolkit to get more info about the GPU card

    nvidia-smi -q
    /opt/cuda/bin/deviceQuery
    /opt/cuda/bin/deviceQueryDrv
    

The first set of commands can be run on physical and virtual hsots, the rest are run on aphysical host.

  1. listing of pci devices

    lspci -D -n
    lspci -D -n -d 10de:
    lspci -D -nn -d 10de:
    lspci -vvv -s 0000:03:00.0
    

    For example, the output below shows info for 2 GPU cards, for video and audio components

    # lspci -D -n -d 10de:
    0000:02:00.0 0302: 10de:06de (rev a3)
    0000:02:00.1 0403: 10de:0be5 (rev a1)
    0000:03:00.0 0302: 10de:06de (rev a3)
    0000:03:00.1 0403: 10de:0be5 (rev a1)
    

    The video card component ends on 0 abd audio card component ends on 1.

  2. virsh info for the devices as a tree

    virsh nodedev-list --tree
    

    Note, that 4 devices from the above lspci command in the output of this command become

    +- pci_0000_00_03_0            (comment: parent pci device)
    |   |
    |   +- pci_0000_02_00_0
    |   +- pci_0000_02_00_1
    |
    +- pci_0000_00_07_0            (comment: parent pci device)
    |   |
    |   +- pci_0000_03_00_0
    |   +- pci_0000_03_00_1
    

    This syntax for pci bus is used in all virsh commands below.

  3. virsh detach and reattach devices

    virsh nodedev-detach pci_0000_02_00_0
    virsh nodedev-detach pci_0000_02_00_1
    virsh nodedev-reattach pci_0000_02_00_1
    
  4. GPU cards info

    virsh nodedev-dumpxml pci_0000_02_00_0 > pci-gpu1
    virsh nodedev-dumpxml pci_0000_03_00_0 > pci-gpu2
    
  5. check device symbolic links

    readlink /sys/bus/pci/devices/0000\:02\:00.0/driver
    
  6. check xml definition of the VM

    virsh dumpxml rocce-vm3 > vm3.out
    

    For a GPU-enabled VM, hostdev section described in the sections above should be in the output.

Useful links for enabling PCI passthrough devices