Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitor: Fan controller malfunction monitoring #48

Merged
merged 1 commit into from
Dec 2, 2024

Conversation

spinler
Copy link
Contributor

@spinler spinler commented Nov 21, 2024

The MAX31785 fan controller has several times in the field shown RPM values of exactly 29104 RPM, leading to fan callouts for healthy fans and sometimes system shutdowns. This is considered a malfunction, since there is no known reason why this would happen.

It is suspected that resetting the chip will clear that error and allow the system to keep functioning without an immediate need for service. Resets can be using a GPIO wired from the BMC to the reset input on the MAX31785 for systems that have that wired.

This commit creates a new MalfunctionMonitor class to handle this workaround. It does the following:

  • Reads the reset GPIO name and tach trigger value out of an optional section in the JSON config file. If it isn't present, then the functionality won't be enabled.
  • Watches for tach sensor values over that limit. On the first one that is it will:
    • Reset the fan controller using the GPIO.
    • Stop the fan monitor function for 15s to let the fan RPMs recover.
    • Create an informational event log saying the reset occurred.
    • Keep track of the fact that a reset was done and the sensor that hit it.
  • It will only do one reset per power on, so if the same sensor or a new one is over the limit again, it will not reset again but will keep track of which fans have hit it.
  • If a fan ends up getting called out, then the code will check if that fan has hit the malfunction as noted above. If it has, then a unique event log will be created indicating that there was a malfunction. Otherwise, the typical fan fault event log will be created.
  • On the power off, the state is cleared so the next power on can start fresh.

Tested: Injected various high tach readings at different times with special lab only code.

Examples of traces of a reset:

phosphor-fan-monitor[2449]: FanCtlr malfunction detected. Tach /xyz/openbmc_project/sensors/fan_tach/fan4_0 value 29104 is over limit.
phosphor-fan-monitor[2449]: Resetting fan controller to recover

And when the fan doesn't recover:

phosphor-fan-monitor[2449]: Creating event log for malfunctioning fan ctlr and sensor /xyz/openbmc_project/sensors/fan_tach/fan4_0

Change-Id: I493be2cfd2058f770299f6cfde8d782b530b7df9

monitor/malfunction_monitor.cpp Outdated Show resolved Hide resolved
monitor/malfunction_monitor.cpp Outdated Show resolved Hide resolved
The MAX31785 fan controller has several times in the field shown RPM
values of exactly 29104 RPM, leading to fan callouts for healthy fans
and sometimes system shutdowns. This is considered a malfunction, since
there is no known reason why this would happen.

It is suspected that resetting the chip will clear that error and allow
the system to keep functioning without an immediate need for service.
Resets can be using a GPIO wired from the BMC to the reset input on the
MAX31785 for systems that have that wired.

This commit creates a new MalfunctionMonitor class to handle this
workaround.  It does the following:

- Reads the reset GPIO name and tach trigger value out of an optional
  section in the JSON config file.  If it isn't present, then the
  functionality won't be enabled.
- Watches for tach sensor values over that limit.  On the first one that
  is it will:
  - Reset the fan controller using the GPIO.
  - Stop the fan monitor function for 15s to let the fan RPMs recover.
  - Create an informational event log saying the reset occurred.
  - Keep track of the fact that a reset was done and the sensor that hit
    it.
- It will only do one reset per power on, so if the same sensor or a new
  one is over the limit again, it will not reset again but will keep
  track of which fans have hit it.
- If a fan ends up getting called out, then the code will check if that
  fan has hit the malfunction as noted above.  If it has, then a unique
  event log will be created indicating that there was a malfunction.
  Otherwise, the typical fan fault event log will be created.
- On the power off, the state is cleared so the next power on can start
  fresh.

Tested: Injected various high tach readings at different times with
special lab only code.

Examples of traces of a reset:
```
phosphor-fan-monitor[2449]: FanCtlr malfunction detected. Tach /xyz/openbmc_project/sensors/fan_tach/fan4_0 value 29104 is over limit.
phosphor-fan-monitor[2449]: Resetting fan controller to recover
```

And when the fan doesn't recover:
```
phosphor-fan-monitor[2449]: Creating event log for malfunctioning fan ctlr and sensor /xyz/openbmc_project/sensors/fan_tach/fan4_0
```

Signed-off-by: Matt Spinler <[email protected]>
Change-Id: I493be2cfd2058f770299f6cfde8d782b530b7df9
@spinler spinler force-pushed the 1110_max31785_malfunction branch from 83cb23b to ff76fc4 Compare November 21, 2024 20:05
@rfrandse rfrandse merged commit bc1e7af into ibm-openbmc:1110 Dec 2, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants