Configure the Node Health Checker (NHC)

NHC basic concepts.

For an overview of NHC (sometimes referred to as NodeKARE), see the intro_NHC(8) man page. For additional information about ALPS and how ALPS cooperates with NHC to perform application cleanup, see About Modules and Modulefiles.

The nodehealth Modulefile

To gain access to the NHC functions, the nodehealth module must be loaded. The admin-modules modulefile loads the nodehealth module, or load it directly, as follows:

% module load nodehealth

The Base-opts.default.local file includes the admin-modules modulefile. For additional information about the Base-opts.default.local file, see System-wide Default Modulefiles.

Introduction

NHC can be run under two basic circumstances:

  • immediately after applications within a reservation have terminated and immediately after a reservation has terminated
  • when a node boots

To support running NHC at boot time and after applications and reservations complete, NHC uses two separate and independent configuration files, which enable NHC to be configured differently for these situations.

After Application and Reservation Termination

The configuration file that controls NHC behavior after a job has terminated is /etc/opt/cray/nodehealth/nodehealth.conf, located in the shared root. The CLE installation and upgrade processes automatically install this file and enable NHC software; there is no need to change any installation configuration parameters or issue any commands; however, the system administrator may edit this configuration file to customize NHC behavior. After editing, the changes made are reflected in the behavior of NHC the next time that it runs.

Compute Node Cleanup (CNCU) is called after an application or reservation terminate. Its objective is to efficiently return compute nodes to the pool of available nodes with as much free memory as they have when they are first booted. ALPS invokes NHC after every application completes and after every reservation completes. The NHC tests that run after applications are an application set. The NHC tests that run after reservations exit are a reservation set. With multiple test sets executing, CNCU requires more than one instance of NHC to be running simultaneously. The advanced_features NHC control variable must be enabled to use CNCU. The default setting of advanced_features in the example NHC configuration file is on.

When a Node Boots

The configuration file that controls NHC behavior on boot is located on the compute node. To change this file, the administrator must instead change its template, which is located on the SMW in one of two locations:
  • On non-partitioned systems, the SMW template is /opt/xt-images/templates/default/etc/opt/cray/nodehealth/nodehealth.conf.
  • On partitioned systems, the SMW template is /opt/xt-images/templates/default-pN/etc/opt/cray/nodehealth/nodehealth.conf, where pN is the partition number.

In either case, after modifying the nodehealth.conf file, it is necessary to remake the boot image for the compute node and reboot the node with the new boot image in order for the changes to take effect.

Each CLE release package also includes an example NHC configuration file, /opt/cray/nodehealth/default/etc/nodehealth.conf.example. The nodehealth.conf.example file is a copy of the /etc/opt/cray/nodehealth/nodehealth.conf file provided for an initial installation.

Important: The /etc/opt/cray/nodehealth/nodehealth.conf file is not overwritten during a CLE upgrade if the file already exists. This preserves the site-specific modifications previously made to the file. However, the system administrator should compare the /etc/opt/cray/nodehealth/nodehealth.conf file content with the /opt/cray/nodehealth/default/etc/nodehealth.conf.example file provided with each release to identify any changes, and then update the /etc/opt/cray/nodehealth/nodehealth.conf file accordingly.

If the /etc/opt/cray/nodehealth/nodehealth.conf file does not exist, then the /opt/cray/nodehealth/default/etc/nodehealth.conf.example file is copied to the /etc/opt/cray/nodehealth/nodehealth.conf file.

To use an alternate NHC configuration file, use the xtcleanup_after -f alt_NHCconfigurationfile option to specify which NHC configuration file to use with the xtcleanup_after script. For additional information, see the xtcleanup_after(8) man page.

NHC can also be configured to automatically dump, reboot, or dump and reboot nodes that have failed tests. This is controlled by the action variable specified in the NHC configuration file that is used with each NHC test and the /etc/opt/cray-xt-dumpd/dumpd.conf configuration file. For additional information, see Using dumpd to Automatically Dump and Reboot Nodes, the dumpd(8) man page, and the dumpd.conf configuration file on the SMW.