Global Configuration Variables that Affect all NHC Tests

NHC basics.

The following global configuration variables may be set in the /etc/opt/cray/nodehealth/nodehealth.conf file to alter the behavior of all NHC tests. The global configuration variables are case-insensitive.

Runtests: Frequency
Determines how frequently NHC tests are run on the compute nodes. Frequency may be either errors or always. When the value errors is specified, the NHC tests are run only when an application terminates with a non-zero error code or terminates abnormally. When the value always is specified, the NHC tests are run after every application termination. If the Runtests global variable is not specified, the implicit default is errors.

This variable applies only to tests in the application set; reservations do not terminate abnormally.

Connecttime: TimeoutSeconds
Specifies the amount of time, in seconds, that NHC waits for a node to respond to requests for the TCP connection to be established. If Suspect Mode is disabled and a particular node does not respond after connecttime has elapsed, then the node is marked admindown. If Suspect Mode is enabled and a particular node does not respond after connecttime has elapsed, then the node is marked suspect. NHC will then attempt to contact the node with a frequency established by the recheckfreq variable.

If the Connecttime global variable is not specified, then the implicit default TCP time-out value is used. NHC will not enforce time-out on the connections if none is specified. The Connecttime: TimeoutSeconds value provided in the default NHC configuration file is 60 seconds.

The following global variables control the interaction of NHC and dumpd, the SMW daemon that initiates automatic dump and reboot of nodes.
maxdumps: MaximumNodes
Specifies the number of nodes that fail with the dump or dumpreboot action that will be dumped. For example, if NHC was checking on 10 nodes that all failed tests with the dump or dumpreboot actions, only the number of nodes specified by maxdumps would be dumped, instead of all of them. The default value is 1.

To disable dumps of failed nodes with dump or dumpreboot actions, set maxdumps: 0.

downaction: action
Specifies the action NHC takes when it encounters a down node. Valid actions are log, dump, reboot, and dumpreboot. The default action is log.
downdumps: number_dumps
Specifies the maximum number of dumps that NHC will dump for a given APID, assuming that the downaction variable is either dump or dumpreboot. These dumps are in addition to any dumps that occur because of NHC test failures. The default value is 1.
The following global variables control the interaction between NHC, ALPS, and the SDB.
alps_recheck_max: number of seconds
NHC will attempt to verify its view of the states of the nodes with the ALPS view. If NHC is unable to contact ALPS, this variable controls the maximum delay between rechecks.

Default value: 10 seconds

alps_sync_timeout: number of seconds
If NHC is unable to contact ALPS to verify the states of the nodes, this variable controls the length of time before NHC gives up and aborts.

Default value: 1200 seconds

alps_warn_time: number of seconds
If NHC is unable to contact ALPS to verify the states of the nodes, this variable controls how often warnings are issued.

Default value: 120 seconds

sdb_recheck_max: number of seconds
NHC will contact the SDB to query for the states of the nodes. If NHC is unable to contact the SDB, this variable controls the maximum delay between rechecks.

Default value: 10 seconds

sdb_warn_time: number of seconds
If NHC is unable to contact the SDB, this variable controls how often warnings are issued.

Default value: 120 seconds

node_no_contact_warn_time: number of seconds
If NHC is unable to contact a specific node, this variable controls how often warnings are issued.

Default value: 600 seconds

The following global variable controls NHC's use of node states.
unhealthy_state: swdown
When a node is deemed unhealthy, it is normally set to admindown. This variable permits a different state to be chosen instead.

Default: not set

unhealthy_state: rebootq
When a node is going to be rebooted, it is normally set to Unavail. This variable permits a different state to be chosen instead.

Default: not set