NHC Messages
NHC messages may be found on the SMW in /var/opt/cray/log/sessionid/nhc-YYYYMMDD with '<node_health:M.m>' in the message, where M is the major and m is the minor NHC revision number. All NHC messages are visible in the console file.
NHC prints a summary message per node at the end of Normal Mode and Suspect Mode when at least one test has failed on a node. For example:
<node_health:3.1> APID:100 (xtnhc) FAILURES: The following tests have failed in normal mode:
<node_health:3.1> APID:100 (xtnhc) FAILURES: (Admindown) Apinit_Ping
<node_health:3.1> APID:100 (xtnhc) FAILURES: (Admindown) Plugin /example/plugin
<node_health:3.1> APID:100 (xtnhc) FAILURES: (Log Only ) Filesystem_Test on /mydir
<node_health:3.1> APID:100 (xtnhc) FAILURES: (Admindown) Free_Memory_Check
<node_health:3.1> APID:100 (xtnhc) FAILURES: End of list of 5 failed test(s) The xtcheckhealth error and warning messages include node IDs and application IDs and are written to the console file on the SMW; for example:
[2010-04-05 23:07:09][c1-0c2s0n0]<node_health:3.0> APID:2773749
(check_apid) WARNING: Failure: File /dev/cpuset/2773749/tasks exists and is not empty. \
The following processes are running under expired APID
2773749:
[2010-04-05 23:07:09][c1-0c2s0n1]<node_health:3.0> APID:2773749
(check_apid) WARNING: Pid: 300 Name: (marys_program) State: DThe xtcleanup_after script writes its normal launch information to the /var/log/xtcheckhealth_log file, which resides on the login nodes. The xtcleanup_after launch information includes the time that xtcleanup_after was launched and the time xtcleanup_after called xtcheckhealth.
The xtcleanup_after script writes error output (launch failure information) to the /var/log/xtcheckhealth_log file, to the console file on the SMW, and to the syslog.
Example xtcleanup_after output follows:
Thu Apr 22 17:48:18 CDT 2010 <node_health> (xtcleanup_after)
/opt/cray/nodehealth/3.0-1.0000.20840.30.8.ss/bin/xtcheckhealth -a 10515
-e 1 /tmp/apsysLVNqO9 /etc/opt/cray/nodehealth/nodehealth.conf