Global Configuration Variables that Affect all NHC Tests
NHC basics.
The following global configuration variables may be set in the /etc/opt/cray/nodehealth/nodehealth.conf file to alter the behavior of all NHC tests. The global configuration variables are case-insensitive.
Runtests:Frequency- Determines how frequently NHC tests are run on the compute nodes. Frequency may be either
errorsoralways. When the valueerrorsis specified, the NHC tests are run only when an application terminates with a non-zero error code or terminates abnormally. When the valuealwaysis specified, the NHC tests are run after every application termination. If theRuntestsglobal variable is not specified, the implicit default iserrors.This variable applies only to tests in the application set; reservations do not terminate abnormally.
Connecttime:TimeoutSeconds- Specifies the amount of time, in seconds, that NHC waits for a node to respond to requests for the TCP connection to be established. If Suspect Mode is disabled and a particular node does not respond after
connecttimehas elapsed, then the node is markedadmindown. If Suspect Mode is enabled and a particular node does not respond afterconnecttimehas elapsed, then the node is markedsuspect. NHC will then attempt to contact the node with a frequency established by therecheckfreqvariable.If the
Connecttimeglobal variable is not specified, then the implicit default TCP time-out value is used. NHC will not enforce time-out on the connections if none is specified. TheConnecttime: TimeoutSeconds value provided in the default NHC configuration file is60seconds.
maxdumps:MaximumNodes- Specifies the number of nodes that fail with the
dumpordumprebootaction that will be dumped. For example, if NHC was checking on 10 nodes that all failed tests with thedumpordumprebootactions, only the number of nodes specified bymaxdumpswould be dumped, instead of all of them. The default value is1.To disable dumps of failed nodes with
dumpordumprebootactions, setmaxdumps: 0. downaction:action- Specifies the action NHC takes when it encounters a down node. Valid actions are
log,dump,reboot, anddumpreboot. The default action islog. downdumps:number_dumps- Specifies the maximum number of dumps that NHC will dump for a given APID, assuming that the
downactionvariable is eitherdumpordumpreboot. These dumps are in addition to any dumps that occur because of NHC test failures. The default value is1.
alps_recheck_max:number of seconds- NHC will attempt to verify its view of the states of the nodes with the ALPS view. If NHC is unable to contact ALPS, this variable controls the maximum delay between rechecks.
Default value: 10 seconds
alps_sync_timeout:number of seconds- If NHC is unable to contact ALPS to verify the states of the nodes, this variable controls the length of time before NHC gives up and aborts.
Default value: 1200 seconds
alps_warn_time:number of seconds- If NHC is unable to contact ALPS to verify the states of the nodes, this variable controls how often warnings are issued.
Default value: 120 seconds
sdb_recheck_max:number of seconds- NHC will contact the SDB to query for the states of the nodes. If NHC is unable to contact the SDB, this variable controls the maximum delay between rechecks.
Default value: 10 seconds
sdb_warn_time:number of seconds- If NHC is unable to contact the SDB, this variable controls how often warnings are issued.
Default value: 120 seconds
node_no_contact_warn_time:number of seconds- If NHC is unable to contact a specific node, this variable controls how often warnings are issued.
Default value: 600 seconds
unhealthy_state:swdown- When a node is deemed unhealthy, it is normally set to
admindown. This variable permits a different state to be chosen instead.Default: not set
unhealthy_state:rebootq- When a node is going to be rebooted, it is normally set to
Unavail. This variable permits a different state to be chosen instead.Default: not set