Standard Variables that Affect Individual NHC Tests
NHC basics.
The following variables are used with each NHC test; set each variable for each test. All variables are case-insensitive. Each NHC test has values supplied for these variables in the default NHC configuration file.
Specific NHC tests require additional variables, which are defined in the nodehealth configuration file.
- action
- Specifies the action to perform if the compute node fails the given NHC test. action may have one of the following values:
log- Logs the failure to the system console log. The
logaction will not cause a compute node's state to be set toadmindown.Important: Tests that have an action ofLogdo not run in Suspect Mode. If using plugin scripts with an action ofLog, the script will only be run once, in Normal Mode. This makes log collecting and various other maintenance tasks easier to code. admindown- Sets the compute node's state to
admindown(no more applications will be scheduled on that node) and logs the failure to the system console log.If Suspect Mode is enabled, the node will first be set to
suspectstate, and if the test continues to fail, the node will be set toadmindownat the end of Suspect Mode. die- Halts the compute node so that no processes can run on it, sets the compute node's state to
admindown, and logs the failure to the system console log. (Thedieaction is the equivalent of a kernel panic.) This action is good for catching bugs because the state of the processes is preserved and can be dumped at a later time.If the
advanced_featuresvariable is enabled,dieis not allowed.Each subsequent action includes the actions that preceded it; for example, the
dieaction encompasses theadmindownandlogactions.If NHC is running in Normal Mode and cannot contact a compute node, and if Suspect Mode is not enabled, NHC will set the compute node's state to
admindown.
The following actions control the NHC and dumpd interaction.
dump- Sets the compute node's state to
admindownand requests a dump from the SMW, in accordance with themaxdumpsconfiguration variable. reboot- Sets the compute node's state to
unavailand requests a reboot from the SMW. Theunavailstate is used rather than theadmindownstate when nodes are to be rebooted because a node that is set toadmindownand subsequently rebooted stays in theadmindownstate. Theunavailstate does not have this limitation. dumpreboot- Sets the compute node's state to
unavailand requests a dump and reboot from the SMW.
- warntime
- Specifies the amount of elapsed test time, in seconds, before xtcheckhealth logs a warning message to the console file. This allows an administrator to take corrective action, if necessary, before the timeout is reached.
- timeout
- Specifies the total time, in seconds, that a test should run before an error is returned by xtcheckhealth and the specified action is taken.
- restartdelay
- Valid only when NHC is running in Suspect Mode. Specifies how long NHC will wait, in seconds, to restart the test after the test fails. The minimum restart delay is one second.
- sets
- Indicates when to run a test. The default NHC configuration specifies to run specific tests after application completion and to run an alternate group of tests at reservation end. When ALPS calls NHC at the end of the application, tests marked with
Sets: Applicationare run. By default, these tests are:Filesystem,Accelerator,ugni_nhc_plugins,Application Exited Check,Apinit Ping Test, andApinit Log and Core File Recovery. At the end of the reservation, ALPS calls tests markedSets: Reservation. By default, these are:Free Memory Check,ugni_nhc_plugins,Reservation, andHugepages Check.If no set is specified for a test, it will default to
Application, and run when ALPS calls NHC at the end of the application. If NHC is launched manually, using the xtcheckhealth command, and the -m sets argument is not specified on the command line, then xtcheckhealth defaults to running theApplicationset.If a test is marked
Sets: All, it will always run, regardless of how NHC is invoked.