Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSIX Implementation for Linux Kernel >= 5.0.0 #5

Open
wants to merge 1 commit into
base: vfio-over-socket
Choose a base branch
from

Conversation

atomass
Copy link

@atomass atomass commented Jul 9, 2020

Some comments on this PR:

  • I would have gladly used the already existing flags member of lm_reg_info_t instead of adding the new is_msix one. But I couldn't because flags is directly used as VFIO flags.
  • From PCIe specification capabilities don't need to be contiguous but this break somehow libmuser cap.c implementation. So I ended up with caps->caps[i].end = prev_end = caps->caps[i].start + ALIGN(lm_caps[i].size, 8) - 1; to ensure DWORD alignment. This is not the best option because there could be a (very rare/impossible) case were to the callback function it will be asked to read/write an offset greater than what it's expecting.
    I needed the DWORD alignment because PCI_CAP_ID_MSIX is 12bytes
  • When QEMU finds a region that expose VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability, it will mmap it as follows:
        region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
                                     MAP_SHARED, region->vbasedev->fd,
                                     region->fd_offset +
                                     region->mmaps[i].offset);

The problem is that libmuser derives the fd_offset of each region accordingly to the function region_to_offset -> return (uint64_t)region << LM_REGION_SHIFT;, a huge shift. That's why I decided that MSIX structures should reside on BAR 0.

@tmakatos
Copy link
Owner

tmakatos commented Jul 13, 2020

Thanks for the patch!

First of all, what exactly doe this patch fix specifically for kernel >= 5? This is a based on the vfio-over-socet branch so the host kernel doesn't really come into play here, right?

The fix for the alignment is legitimate. Howerver I don't understand the problem regarding enforcing the capabilities to be contiguous.

there could be a (very rare/impossible) case were to the callback function it will be asked to read/write an offset greater than what it's expecting.
Coud you elaborate?

Are you trying to allow the non-standard path of the PCI header (>= 0x40) to be memory mapped? If so, why? And if there really is a legitimate for that then I think the fix would be to keep the capabilities on a chunk of memory the same way we do for the PCI header instead of keeping in the in a list.

@tmakatos tmakatos self-assigned this Jul 13, 2020
caps->caps[i].start = prev_end + 1;
caps->caps[i].end = prev_end = caps->caps[i].start + lm_caps[i].size - 1;
caps->caps[i].end = prev_end = caps->caps[i].start + ALIGN(lm_caps[i].size, 8) - 1;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this fix a separate patch?

@atomass
Copy link
Author

atomass commented Jul 13, 2020

First of all, what exactly doe this patch fix specifically for kernel >= 5?

VFIO_REGION_INFO_CAP_MSIX_MAPPABLE capability has been introduced in kernel 5.0 (more on this, here: https://patchwork.kernel.org/patch/10106225/)
For older kernels, or in the case this capability is missing, QEMU will try to understand were your MSI(X) structures are located and how to mmap them (check https://github.com/qemu/qemu/blob/v4.2.0/hw/vfio/pci.c#L1295). I have to be honest, I had hard times to figure out why it wasn't working. Then I just went for the newest and best practice road.

The fix for the alignment is legitimate. However, I don't understand the problem regarding enforcing the capabilities to be contiguous.

I had to force them to be contiguous otherwise something, from libmuser side, didn't work as expected. I tried to understand what was wrong but without any success. So I decided to come up with this workaround

Can you elaborate?

Sure. Take for example the case of PCI_CAP_ID_MSIX capability. It's defined to be 12 bytes long (that it's clearly not dword aligned) with my fix it will become dword aligned but the size of it will become 16bytes.
That is because how libmuser is managing capabilities right now, after having them setup it converts lm_cap_t to struct cap (that does not really make 100% sense to me).
On one side lm_cap_t is used as an array that contains the size of the capability. On the other hand, struct cap contains start and end of the capability (that, ok, conceptually is the same thing so: why convert them, in a first place?)
But go back to my concern:

there could be a (very rare/impossible) case were to the callback function it will be asked to read/write an offset greater than what it's expecting.

So, let we say that, in your device, you have defined a lm_cap like this:

.caps[0] = {
            .id = PCI_CAP_ID_MSIX,
            .size = 12,
            .fn = &cap_msix
        },

in the function struct caps *caps_create(const lm_cap_t *lm_caps, int nr_caps), with my changes, it will be transformed to

struct cap {
    start = prev_end + 1;
    end = caps->caps[i].start + ALIGN(lm_caps[i].size, 8) - 1;
   ...
}

so you can easily understand that the size now will be size = end - start = ALIGN(lm_caps[i].size, 8) that, in our example, will be 16 != 12. So, if to libmuser will arrive a request from the driver to read 14 bytes out of that capability, it will be passing this request to the fn associated to that capability because it will look legitimate, even if it shouldn't.
Hope to have made it more clear now.

Are you trying to allow the non-standard path of the PCI header (>= 0x40) to be memory mapped?

No

@tmakatos
Copy link
Owner

tmakatos commented Jul 13, 2020

I'm looking at the documentation of VFIO_REGION_INFO_CAP_MSIX_MAPPABLE:

/*
 * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
 * which allows direct access to non-MSIX registers which happened to be within
 * the same system page.

MUSER allows any BAR to be mmaped, however my understanding is that this is not a hard requirement for MSI-X to work, correct? I understand that this might be broken in MUSER because capabilities aren't properly aligned, but hopefully your fix addresses this. Are you trying to fix two things in the same patch (fix alignment + add support for VFIO_REGION_INFO_CAP_MSIX_MAPPABLE)? If so, then we should separate them into separate commits.

Regarding the alignment problem, at the time we were implementing capabilities it seemed easier from an API point of view to implement it the way we did, however it turns out that having all capabilities in a buffer right after the PCI header would solve multiple problems.

Regarding adding support for VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, if you really need it then we should add it, however I feel that we should first fix/refactor capabilities (replace the list with a buffer).

Is my understanding correct?

I had realised that capabilities need improvement and had created issue nutanix#14. In fact I had started working on it (https://github.com/tmakatos/muser/commits/issue-14-rework) but then decided to stop because it would make creating an lm_ctx_t slightly more complicated. Moreover, I erroneouly thought that by doing so we would have to handle PCI capabilities in MUSER and didn't want to do that at the moment as it would have been a huge task. I now feel that it's fine to require from the user a few extra lines of code in order to create an lm_ctx_t. Also, we have some know-how regarding hanlding MSI-X in https://github.com/tmakatos/spdk/blob/master/lib/nvmf/muser.c, so we could at least offer some experimental support of halding some capabilities.

@tmakatos
Copy link
Owner

tmakatos commented Jul 14, 2020

Take for example the case of PCI_CAP_ID_MSIX capability. It's defined to be 12 bytes long (that it's clearly not dword aligned)

I just checked the PCI spec (rev 3.0) and it says that capabilities must be DWORD aligned (p. 667), and DWORD is defined as 4 bytes (p. 31), not 8. I'm looking at /usr/include/linux/pci_regs.h and I see that all capability sizes (e.g. PCI_CAP_MSIX_SIZEOF) are DWORD aligned. Can you check whether using the correct alignment (4 bytes) solves your problem?

@atomass
Copy link
Author

atomass commented Jul 14, 2020

MUSER allows any BAR to be mmaped, however my understanding is that this is not a hard requirement for MSI-X to work, correct?

No, I think you're not correct. Check what I wrote previously:

For older kernels, or in the case this capability is missing, QEMU will try to understand were your MSI(X) structures are located and how to mmap them (check https://github.com/qemu/qemu/blob/v4.2.0/hw/vfio/pci.c#L1295).

So, in one way (VFIO_REGION_INFO_CAP_MSIX_MAPPABLE) or another (https://github.com/qemu/qemu/blob/v4.2.0/hw/vfio/pci.c#L1295) QEMU needs to mmap the BAR (fully or partially) where MSIX structures reside.

Can you check whether using the correct alignment (4 bytes) solves your problem?

To my (very little and maybe wrong) understanding, DWORD alignment implies that the size must be an even multiplier of DWORD. So, because 12bytes are 3 DWORDs, to my understanding, it's not DWORD aligned. Am I wrong?

Btw, if you prefer, we can split the discussion between capabilities and MSIX. They are related to some extent but not that deeply.

@tmakatos
Copy link
Owner

tmakatos commented Jul 14, 2020

Yes I think we need to split the discussion.

DWORD alignment means that that the size must be a multiple of DWORD: 3 (size o MSI-X) * 4 (size of DWORD) == 12. Check https://en.wikipedia.org/wiki/Data_structure_alignment:

A memory address a is said to be n-byte aligned when a is a multiple of n bytes (where n is a power of 2). In this context, a byte is the smallest unit of memory access, i.e. each memory address specifies a different byte. An n-byte aligned address would have a minimum of log2(n) least-significant zeros when expressed in binary.

Regarding VFIO_REGION_INFO_CAP_MSIX_MAPPABLE, MUSER does allow any BAR to be memory mapped, you just need to set the LM_REG_FLAG_MMAP attribute and provide a callback when defining the region. We already do that in SPDK, however we use a 4.x kernel. I think I'm still missing something here :-/.

@tmakatos
Copy link
Owner

@atomass I discovered a bug in libmuser where the capability ID is always returned, this seems to be affected by alignment so I suppose that's why it works with QWORD alignment. I'll be fixing this shortly.

@tmakatos
Copy link
Owner

@atomass I've refactored PCI capabilities (which sligtly changed the API) and the bug is now gone. The code now is actually smaller, simpler and most importantly correct. There some more room for improvement. Please try it out and let me know if it works for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants