Standard Variables that Affect Individual NHC Tests

NHC basics.

The following variables are used with each NHC test; set each variable for each test. All variables are case-insensitive. Each NHC test has values supplied for these variables in the default NHC configuration file.

Specific NHC tests require additional variables, which are defined in the nodehealth configuration file.

action
Specifies the action to perform if the compute node fails the given NHC test. action may have one of the following values:
log
Logs the failure to the system console log. The log action will not cause a compute node's state to be set to admindown.
Important: Tests that have an action of Log do not run in Suspect Mode. If using plugin scripts with an action of Log, the script will only be run once, in Normal Mode. This makes log collecting and various other maintenance tasks easier to code.
admindown
Sets the compute node's state to admindown (no more applications will be scheduled on that node) and logs the failure to the system console log.

If Suspect Mode is enabled, the node will first be set to suspect state, and if the test continues to fail, the node will be set to admindown at the end of Suspect Mode.

die
Halts the compute node so that no processes can run on it, sets the compute node's state to admindown, and logs the failure to the system console log. (The die action is the equivalent of a kernel panic.) This action is good for catching bugs because the state of the processes is preserved and can be dumped at a later time.

If the advanced_features variable is enabled, die is not allowed.

Each subsequent action includes the actions that preceded it; for example, the die action encompasses the admindown and log actions.

If NHC is running in Normal Mode and cannot contact a compute node, and if Suspect Mode is not enabled, NHC will set the compute node's state to admindown.

The following actions control the NHC and dumpd interaction.

dump
Sets the compute node's state to admindown and requests a dump from the SMW, in accordance with the maxdumps configuration variable.
reboot
Sets the compute node's state to unavail and requests a reboot from the SMW. The unavail state is used rather than the admindown state when nodes are to be rebooted because a node that is set to admindown and subsequently rebooted stays in the admindown state. The unavail state does not have this limitation.
dumpreboot
Sets the compute node's state to unavail and requests a dump and reboot from the SMW.
The following actions control the NHC and dumpd interaction.
warntime
Specifies the amount of elapsed test time, in seconds, before xtcheckhealth logs a warning message to the console file. This allows an administrator to take corrective action, if necessary, before the timeout is reached.
timeout
Specifies the total time, in seconds, that a test should run before an error is returned by xtcheckhealth and the specified action is taken.
restartdelay
Valid only when NHC is running in Suspect Mode. Specifies how long NHC will wait, in seconds, to restart the test after the test fails. The minimum restart delay is one second.
sets
Indicates when to run a test. The default NHC configuration specifies to run specific tests after application completion and to run an alternate group of tests at reservation end. When ALPS calls NHC at the end of the application, tests marked with Sets: Application are run. By default, these tests are: Filesystem, Accelerator, ugni_nhc_plugins, Application Exited Check, Apinit Ping Test, and Apinit Log and Core File Recovery. At the end of the reservation, ALPS calls tests marked Sets: Reservation. By default, these are: Free Memory Check, ugni_nhc_plugins, Reservation, and Hugepages Check.

If no set is specified for a test, it will default to Application, and run when ALPS calls NHC at the end of the application. If NHC is launched manually, using the xtcheckhealth command, and the -m sets argument is not specified on the command line, then xtcheckhealth defaults to running the Application set.

If a test is marked Sets: All, it will always run, regardless of how NHC is invoked.