Capture and Analyze System-level and Node-level Dumps

Multiple methods are available to manage dumps.

The xtdumpsys command collects and analyzes information from a Cray system that is failing or has failed, has crashed, or is hung. Analysis is performed on, for example, event log data, active heartbeat probing, voltages, temperatures, health faults, in-memory console buffers, and high-speed interconnection network errors. When failed components are found, detailed information is gathered from them.

To collect similar information for components that have not failed, invoke the xtdumpsys command with the --add option and name the components from which to collect data. The HSS xtdumpsys command saves dump information in /var/opt/cray/dump/timestamp by default.
Note: When using the --add option to add multiple components, separate components with spaces, not commas.

Dump information about a working component

For this example, dump the entire system and collect detailed information from all blade controllers in chassis 0 of cabinet 0:

crayadm@smw> xtdumpsys --add c0-0c0s0 

The xtdumpsys command is written in Python and supports plug-ins written in Python. A number of plug-in scripts are included in the software release. Call xtdumpsys --list to view a list of included plug-ins and their respective directories. The xtdumpsys command also now supports the use of configuration files to specify xtdumpsys presets, rather than entering them via the command line.

For more information, see the xtdumpsys(8) man page.