Monitor the Health of PCIe Channels

About the xtpcimon command.

Processors are connected to the high-speed interconnect network (HSN) ASIC through PCIe channels.

The xtpcimon command is executed from the System Management Workstation (SMW) and is started and run during the boot process.

Any PCIe-related errors are reported to stdout, unless directed to a log file.

xtpcimon also displays CLE-originated GHAL-based Advanced Error Reporting (AER) errors for PCIe.

If the optional /opt/cray/hss/default/etc/xtpcimon.ini initialization file is present, the xtpcimon command uses the settings provided in the file.

For more information, see the xtpcimon(8) man page.

Report PCIe-related errors to stdout

crayadm@smw> xtpcimon
starting
----> connection to event router made
121017 04:57:01  #############  #################   ##################
121017 04:57:01  Node           Category            Description
121017 04:57:01  #############  #################   ##################
Received all responses to request to start monitoring
121017 04:58:01  c0-0c0s7a0n1   CorrectableMemErr   0:0:0 AER Correctable: Non-fatal \
                                                      error (mask bit: 1)
121008 05:42:00  c0-0c1s6a0n2   CorrectableMemErr   Link CRC error (cnt: 3)
121008 05:43:30  c0-0c1s6a0n2   Info                Correctable/CRC error

Also refer to the XC Series SEC and check_xt Guide S-2542 to use system event rules for the Cray Simple Event Correlator (SEC) and the related check_xt utility.