Capture and Analyze System-level and Node-level Dumps
Multiple methods are available to manage dumps.
The xtdumpsys command collects and analyzes information from a Cray system that is failing or has failed, has crashed, or is hung. Analysis is performed on, for example, event log data, active heartbeat probing, voltages, temperatures, health faults, in-memory console buffers, and high-speed interconnection network errors. When failed components are found, detailed information is gathered from them.
Dump information about a working component
For this example, dump the entire system and collect detailed information from all blade controllers in chassis 0 of cabinet 0:
crayadm@smw> xtdumpsys --add c0-0c0s0
The xtdumpsys command is written in Python and supports plug-ins written in Python. A number of plug-in scripts are included in the software release. Call xtdumpsys --list to view a list of included plug-ins and their respective directories. The xtdumpsys command also now supports the use of configuration files to specify xtdumpsys presets, rather than entering them via the command line.
For more information, see the xtdumpsys(8) man page.