Configure Node Health Checker Tests

List of NHC tests and configuration file description.

ALPS automatically invokes Node Health Checker (NHC) upon the termination of an application and passes it a list of associated CNL compute nodes. Compute Node Cleanup (CNCU) is called to remove as many application-created, in-memory data objects as possible in order to return compute nodes to the pool of available nodes with as much free memory as possible. NHC performs specified tests to determine if these compute nodes are healthy enough to support running subsequent applications. If not, it removes them from the resource pool.

The CLE installation and upgrade processes automatically install and enable NHC software; there is no need to change any installation configuration parameters or issue any commands. The NHC advanced_features control variable must be enabled to use CNCU, and by default it is set to on. For further information, see NHC Control Variables.

Use the cray_node_health_worksheet.yaml file or configurator to configure the NHC tests, which test CNL compute node functionality. All tests that are enabled will run when NHC is in either Normal Mode or in Suspect Mode. Tests run in parallel, independently of each other, except for the Free Memory Check test, which requires that the Application Exited Check test passes before the Free Memory Check test begins.

The xtcheckhealth binary runs the NHC tests; for information about the xtcheckhealth binary, see the intro_NHC(8) and xtcheckhealth(8) man pages.

The NHC tests are listed below. In the default NHC configuration file, each test that is enabled starts with an action of admindown, except for Free Memory Check, which starts with an action of log.

Also read important test usage information in Guidance for the Accelerator Test, Guidance for the Application Exited Check and Apinit Ping Tests, Guidance for the Filesystem Test, Guidance for the Hugepages Test, and Guidance for the NHC Lustre File System Test.

Accelerator
Tests the health of any accelerators present on the node. It is an application set test and should not be run in the reservation set.

The global accelerator test (gat) script detects the type of accelerator(s) present on the node and then launches a test specific to the accelerator type. The test fails if it is unable to run successfully on the accelerator, or if the amount of allocated memory on the accelerator exceeds the amount specified using the gat -m argument.

Default: enabled

Application Exited Check
Verifies that any remaining processes from the most recent application have terminated. It is an application set test and should not be run in the reservation set because an application is not associated with a reservation cancellation.

The Application Exited Check test checks locally on the compute node for processes running under the ID of the application (APID). If processes are running, NHC waits a period of time (defined in the configuration file) to determine if the application processes exit properly. If the process does not exit within that time, this test fails.

Default: enabled

Apinit Log and Core File Recovery
A plugin script to copy apinit core dump and log files to a login/service node. It is an application set test.

Default: not enabled. Apinit Log and Core File Recovery should not be enabled until a destination directory is determined and specified in the NHC configuration file.

Apinit Ping
Verifies that the ALPS daemon is running on the compute node and is responsive. It is an application set test.

The Apinit Ping test queries the status of the apinit daemon locally on each compute node; if the apinit daemon does not respond to the query, then this test fails.

Default: enabled

DataWarp

A plugin script to check that any reservation-affiliated DataWarp mount points have been removed. Note that the plugin can only detect a problem after the last reservation on a node completes.

Default: disabled

Free Memory Check
Examines how much memory is consumed on a compute node while applications are not running. Use it only as a reservation test because an application within a reservation may leave data for another application in a reservation. If run in the application set, Free Memory Check could consider data that was intentionally left for the next application to be leaked memory and mark the node admindown. Run the Free Memory Check only after the Reservation test passes successfully.

Default: enabled (action is log only)

Filesystem
Ensures that the compute node is able to perform simple I/O to the specified file system. It is configured as an application set test in the default configuration, but it can be run in the reservation set. For a file system that is mounted read-write, the test performs a series of operations on the file system to verify the I/O. A file is created, written, flushed, synced, and deleted. If a mount point is not explicitly specified, the mount point(s) from the compute node /etc/fstabs file will be used and a Filesystem test will be created for each mount point found in the file. If a mount point is explicitly specified, then only that file system will be checked. An administrator can specify multiple FileSystem tests by placing multiple Filesystem lines in the configuration file. For example, one line could specify the implicit Filesystem test, and the next line could specify a specific file system that does not appear in /etc/fstab. This could continue for any and all file systems.

When enabling the Filesystem test, an administrator can exclude mount points that should not be tested using the excluding setting in the configuration to list mount points that should not be tested by the Filesystem test. This allows intentionally excluding specific mount points even though they appear in the fstab file. This action prevents NHC from setting nodes to admindown because of errors on relatively benign file systems. Explicitly specified mount points cannot be excluded in this fashion; if they should not be checked, then they should simply not be specified.

The Filesystem test creates its temporary files in a subdirectory (.nodehealth.fstest) of the file system root. An error message is written to the console when the unlink of a file created by this test fails.

Default: enabled

Hugepages
Calculates the amount of memory available in a specified page size with respect to a percentage of /proc/boot_freemem. It is a reservation set test.

This test will continue to check until either the memory clears up or the time-out is reached. The default time-out is 300 seconds.

Default: disabled

Sigcont Plugin
Sends a SIGCONT signal to the processes of the current APID. It is an application set test.

Default: disabled

Plugin
Allows scripts and executables not built into NHC to be run, provided they are accessible on the compute node. .

Default: disabled so that local configuration settings may be used

ugni_nhc_plugins
Tests the User level Gemini Network Interface (uGNI) on compute nodes. It is a reservation set test and an application set test. By extension, testing the uGNI interface also tests the proper operation of parts of the network interface card (NIC). The test sends a datagram packet out to the node's NIC and back again.
Reservation
checks for the existence of the /proc/reservations/rid directory, where rid is the reservation ID. It is a reservation set test, and should not be run in the application set.

If this directory still exists, the test will attempt to end the reservation and then wait for the specified timeout value for the directory to disappear. If the test fails and Suspect Mode is enabled, NHC enters Suspect Mode. In Suspect Mode, Reservation continues running, repeatedly requesting that the kernel clean up the reservation, until the test passes or until Suspect Mode times out. If the directory does not disappear in that time, the test prints information to the console and exits with a failure.

Default: enabled with a timeout value of 300 seconds

CCM plugin
validates the cleanup of a cluster compatibility mode (CCM) environment at the end of a reservation. It is a reservation set test, and it will not run if it is misconfigured as an application test.

This test runs on a compute node only when /var/crayccm is detected. The test removes the /var/lib/{empty,debus} directories, unmounts CCM mount points if they still exist, and unmounts /dsl/dev/random and /dsl/dev/pts. If the unmounts are successful, the test removes the /var/crayccm, /var/lib/rpcbind, and /var/spool/{PBS,torque} directories.

The CCM plugin is not included in a site's NHC configuration file. Administrators must add the test to their configuration in order to use it. See the cray_node_health_worksheet.yaml file for CCM plugin settings to copy into a site's NHC configuration file.

Individual tests may appear multiple times in the configuration, with different variable values. Every time a test is specified, NHC will run that test. This means if the same line is specified five times, NHC will try to run that same test five times. This functionality is mainly used in the case of the Plugin test, allowing the administrator to specify as many additional tests as have been written for the site, or the Filesystem test, allowing the adminstrator to specify as many additional file systems as wanted. However, any test can be specified to run any number of times. Different parameters and test actions can be set for each test. For example, this could be used to set up hard limits and soft limits for the Free Memory Check test. Two Free Memory Check tests could be specified in the configuration file; the first test configured to only warn about small amounts of non-free memory, and the second test configured to admindown a node that has large amounts of non-free memory. See the cray_node_health_worksheet.yaml file for configuration information.