monitor: Fan controller malfunction monitoring #48
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The MAX31785 fan controller has several times in the field shown RPM values of exactly 29104 RPM, leading to fan callouts for healthy fans and sometimes system shutdowns. This is considered a malfunction, since there is no known reason why this would happen.
It is suspected that resetting the chip will clear that error and allow the system to keep functioning without an immediate need for service. Resets can be using a GPIO wired from the BMC to the reset input on the MAX31785 for systems that have that wired.
This commit creates a new MalfunctionMonitor class to handle this workaround. It does the following:
Tested: Injected various high tach readings at different times with special lab only code.
Examples of traces of a reset:
And when the fan doesn't recover:
Change-Id: I493be2cfd2058f770299f6cfde8d782b530b7df9