SMW HA Overview

The Cray System Management Workstation (SMW) High Availability system supports SMW failover. An SMW High Availability (HA) system is a Cray XC system with two second-generation high-end SMWs (also called rack-mount SMWs) that run the SUSE Linux Enterprise High Availability (SLEHA) Extension and the Cray SMW High Availability Extension (SMWHA) release package. The two SMWs must be installed and configured as specified in the install guide.

The SMW failover feature provides improved reliability, availability, and serviceability (RAS) of the SMW, allowing the mainframe to operate correctly and at full speed. This feature adds SMW failover, fencing, health monitoring, and failover notification. Administrators can be notified of SMW software or hardware problems in real time and be able to react by manually shutting down nodes, or allowing the software to manage the problems. In the event of a hardware failure or rsms (HSS) daemon failure, the software will fail over to the passive SMW node, which becomes the active node. The failed node, once repaired, can be returned to the configuration as the passive node.

The SUSE Pacemaker Cluster Resource Manager (CRM) provides administration and monitoring of the SMW HA system with a command-line interface (crm). With this interface and associated commands, the SMW administrator can display cluster status, monitor the HSS daemons (configured as cluster resources), configure automatic failover notification by email, and customize the SMW failover thresholds for each resource. The Cray SMWHA software includes scripts and commands to simplify the CRM interface and help with specific tasks on an SMW HA system.

The Pacemaker Cluster Resource Manager uses the term node to refer to a host in a CRM cluster. On an SMW HA system, a CRM node is an SMW, not a Cray XC compute or service node.