NHC Suspect Mode

NHC suspect mode configuration recommendations.

Upon entry into Suspect Mode, NHC immediately allows healthy nodes to be returned to the resource pool. Suspect Mode allows the remaining nodes, which are all in suspect state, an opportunity to return to healthiness. If the nodes do not return to healthiness by the end of the Suspect Mode (determined by the suspectend global variable; see below), their states are set to admindown. For more information about how Suspect Mode functions, see the intro_NHC(8) man page.

Important: Suspect Mode is enabled in the default configuration. Cray recommends that sites run NHC with Suspect Mode enabled.

If enabled, the default NHC configuration file uses the following Suspect Mode variables:

suspectenable:
Enables Suspect Mode; valid values are y and n.

Default: y

suspectbegin:
Sets the Suspect Mode timer. Suspect Mode starts after the number of seconds indicated by suspectbegin have expired.

Default: 180

suspectend:
Suspect Mode ends after the number of seconds indicated by suspectend have expired. This timer only starts after NHC has entered Suspect Mode.

Default: 2100

Considerations when evaluating shortening the length of Suspect Mode:

  • The length of Suspect Mode can be shortened if there are no external file systems, such as Lustre, for NHC to check.
  • Cray recommends that the length of Suspect Mode be at least a few seconds longer than the longest time-out value for any of the NHC tests. For example, if the Filesystem test had the longest time-out value at 900 seconds, then the length of Suspect Mode should be at least 905 seconds.
  • The longer the Suspect Mode, the longer nodes have to recover from any unhealthy situations. Setting the length of Suspect Mode too short reduces this recovery time and increases the likelihood of the nodes being marked admindown prematurely.
  • Cray recommends that the length of Suspect Mode be increased on systems containing Intel® Xeon Phi™ ("KNL") compute nodes to avoid premature test failures. Timeout values for individual tests may also need to be increased on these systems.
recheckfreq:
Suspect Mode rechecks the health of the nodes in suspect state at a frequency specified by recheckfreq. This value is in seconds.

For a detailed description about NHC actions during the recheck process, see the intro_NHC(8) man page.

Default: 300