Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEP-15: HAL Improvements #232

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

MEP-15: HAL Improvements #232

wants to merge 5 commits into from

Conversation

Gerrit91
Copy link
Contributor

Description

I am not very certain about this MEP, but somehow I felt the need that we should think about this topic again and add our learnings from the past in order to improvate hardware support.

@metal-robot
Copy link
Contributor

metal-robot bot commented Nov 19, 2024

Thanks for contributing a pull request to the metal-stack docs!

A rendered preview of your changes will be available at: https://docs.metal-stack.io/previews/PR232/

I am not very certain about this MEP, but somehow I felt the need that we should think about this topic again and add our learnings from the past in order to improvate hardware support.
@Gerrit91 Gerrit91 force-pushed the mep-15-hal-improvements branch from 569233c to b4a27ee Compare November 19, 2024 14:58
Copy link
Contributor

@majst01 majst01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice writeup so far.

docs/src/development/proposals/MEP15/index.md Outdated Show resolved Hide resolved

## metal-bmc CLI

The CLI of the new metal-bmc API must become a first-class citizen in order to simplify testing the API. The entire new API should be generically implemented such that operators can run commands easily against a BMC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One key aspect of test-ability is that for every exposed API call a full set of CRUD operations exists, for example if there is a GetIdentifierLEDState call present, there must be a SetIdentifierLEDState available to be able to test the this single functionality. Same must apply for all other endpoints if possible. Sometimes its not possible for example GetBoardInfo.

Copy link
Contributor Author

@Gerrit91 Gerrit91 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CRUD then really the correct term? Because I think we cannot "create" something through this API. I think you want an opportunity to get the current state in case we have a setter for something, right?


In order to have earliest possible discovery of a machine and allow potential BMC management without having to run the metal-hammer, a new table in the metal-api named `bmc` is proposed. The primary key for this table is a BMC's mac address.

This table contains the available drivers to access a machine with, which is tried to be automatically discovered through the metal-bmc. It may be that the table entries do not have an association to a machine ID directly. This is also not required in order to issue commands against the machines. A relation can be established at a later point (in most cases automatically done by the metal-hammer), such that the existing commands like `metalctl machine power/boot/...` continue to work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot imagine to have a single driver per machine type which covers all of our requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A vendor implementation may provide a fallback to e.g. the IPMI driver if it is available.

Copy link
Contributor Author

@Gerrit91 Gerrit91 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we can include fallback into our API directly and specify a list of drivers:

mac: 27:53:57:51:6b:c8
drivers: ["Redfish", "IPMI"]

If one driver errors out, it will automatically fallback to the next.


If it does not find an entity in the database, it performs an "auto discovery". In this process, the metal-bmc tries to automatically discover available BMC drivers for this server (e.g. for IPMI through RMCP like [idiscover](https://linux.die.net/man/8/idiscover), etc.).

It then reports this BMC to the metal-api containing the mac address, IP and possible drivers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past we see often that the BMC renews its IP address with a new one, most often after a hard reset of the BMC or if the firmware of the BMC got updated. This needs to be considered.


A user might provide connection details for specific drivers or select a different default driver for BMC management. It is now theoretically possible to interact with the machine BMC through the metal-api. Note that the metal-hammer was not yet involved.

If there is already an entity found in the `bmc` table, the metal-bmc attempts to update the BMC information. If, in addition to that, credentials are already provided to access the machine, the metal-bmc can additionally figure out a machine UUID related to the BMC address it can establish a relation between BMC table and machine table by updating the machine ID field in the `bmc` table and also update information about the board.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID is already calculated and send back to the metal-api within the pixiecore by inspecting the DHCP Bootp Request. This must be adopted eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pixiecore only knows the UUID but not the BMC mac address. I think it can stay as it is. The pixiecore can create the empty machine entity, but it cannot establish the relation between BMC and machine.


The metal-hammer gets access to the BMC API as well as to the metal-api through the pixiecore. The metal-hammer will lookup the BMC in the metal-api by the locally discovered UUID. If there is a relation between the machine and the BMC already, the metal-hammer does not need to do anything specific. It may call the new BMC API at any given point during the provisioning sequence.

If there is no relation yet, the metal-hammer attempts to establish this relation by using IPMI inband information. The metal-hammer tries to figure out the BMC mac address and attempts to generate a privileged IPMI user and password. If this works, then the metal-hammer updates the BMC table with working access credentials. This way, it is not strictly required for operators to manually insert connection data into the BMC table, but the metal-hammer can generate them through inband capabilities. If it does not work, an operator has to manually provide credentials.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires IPMI inband access which might not be taken for granted.


### Firmware Update Functionality

This feature will be dropped because it was not completely worked out at the point of implementation. It also seems like nobody is actively using it. This brings so many challenges that we should create another MEP in order to bring it back when it's required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably drop BIOS update and keep BMC Update functionality because the later does not require a system power reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how often did we really use it? 😅
I think the current implementation is kind of dangerous because it is not prevented from us that multiple update requests are issued in parallel against a single BMC. The operator does not even know if an update is currently in progress or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is a implementation flaw which must be addressed

@majst01
Copy link
Contributor

majst01 commented Nov 19, 2024

As a next step we should draw a first draft of how the API would look like to get it more dialed.


For every DHCP entry, the metal-bmc looks up the BMC in the new metal-api `bmc` table.

If it does not find an entity in the database, it performs an "auto discovery". In this process, the metal-bmc tries to automatically discover available BMC drivers for this server (e.g. for IPMI through RMCP like [idiscover](https://linux.die.net/man/8/idiscover), etc.).
Copy link
Contributor

@mwindower mwindower Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiscover is pretty old, we could just do a golang based port scan for the BMC network:

  • use open port 623 as indication for ipmi
  • use open port 443 as indication for redfish
  • if both are open: prefer redfish

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idiscover was just mentioned as an example of how a discovery can be done. Of course it should be implemented properly in Golang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants