Restore Normal Operations After SMW Failover

While a failover is automatic, adding the failed SMW back into the cluster requires manual intervention to identify the reason for failover, take corrective action if needed, and return the failed SMW to an online state. Another failover (that is, a "failback" to the originally active SMW) is not possible until the failed SMW returns to online status and its failcounts are cleared so that it is eligible to run all cluster resources.

  1. Identify and fix the problems that caused the failover (such as a hardware fault, kernel panic, or HSS daemon issues). Use the following methods to help diagnose problems:
    1. Examine the log file /var/log/smwha.log on the new active SMW. For more information, see Examine the SMW HA Log File to Determine SMW Failover Cause.
    2. Execute the show_failcounts command and note any resources with non-zero failcounts.
    3. From the active SMW, examine /var/opt/cray/log/smwmessages-yyyymmdd for relevant messages.
    4. Examine the failing SMW for additional clues.
    5. For a non-STONITH failover: In most cases, the failing SMW will still be running; additional clues may be available in dmesg or via other commands.
    6. For a STONITH failover: The failing SMW will be powered off. Before powering it back on, place the SMW into standby mode so that it does not automatically try to rejoin the cluster at startup before ensuring that the node is healthy. For more information, see Restart Stopped Resources.
  2. Log on to the failing SMW (either from the console or remotely by using the actual host name). Identify the reason for the failure and take corrective action as needed. This might include administrative actions such as freeing space on a file system that has filled up or hardware actions such as replacing a failing component.
  3. After the SMW is ready to rejoin the cluster, run the clean_resources command as described in Restart stopped resources. This command also resets all failcounts to zero.

    After running clean_resources, wait several minutes for cluster activity to settle. You can check cluster status with the crm_mon -r1 command.

  4. Return the SMW to online status as the passive SMW.
    Replace smw2 with the host name of the failed SMW.
    smw1# crm node online smw2
  5. If the boot node mounts any SMW directories, and passwordless access between the boot node and SMW is not configured, the mount point on the boot node to the SMW is stale. To refresh the mount point:
    1. Log into the boot node.
    2. Unmount then remount the SMW directories.
    3. Restart bnd.
      boot# /etc/init.d/bnd restart