Critical Events That Cause SMW HA Failover

The following critical events cause a failover from the active SMW to the passive SMW:

  • Hardware fault on the active SMW.
  • Lost heartbeat between the two SMWs.
  • Kernel fault (panic) on the active SMW.
  • Failed resource (HSS daemon or cluster service). If a resource stops, the cluster manager automatically restarts it and increments the failcount by 1. When the failcount exceeds the migration threshold (by default, 1,000,000), a failover occurs.

The failover type (STONITH or non-STONITH) depends upon whether the newly active SMW can determine the health of the failing SMW. A STONITH failover occurs only if there is no other way for the new SMW to ensure the integrity of the cluster.

  • In the case of STONITH failover, the original SMW is powered off (via the STONITH capability) if it is not already off. This guarantees that file synchronization is stopped and the failed SMW no longer holds any cluster-managed resources so that the new SMW will have exclusive access to those resources.

  • In the case of non-STONITH failover, the original SMW is still powered up. In addition:
    • HSS daemons are stopped on the original SMW.
    • Lightweight Log Manager (LLM) logging to shared disk is stopped.
    • File synchronization (csync2) between SMWs is stopped.
    • The shared file systems on the boot RAID are unmounted on the original SMW.
    • Network connections using the eth0, eth1, eth2, eth3, and eth4 virtual IP addresses are dropped and those interfaces begin accepting connections to their actual IP addresses only.

For both types of failover, the following actions then occur on the new SMW:

  • The eth0, eth1, eth2, eth3, eth4, and eth5 (optional) interfaces begin accepting connections using the virtual IP addresses in addition to their actual IP addresses.
  • The shared file systems on the boot RAID are mounted on the new SMW.
  • File synchronization (csync2) between SMWs usually resumes (depending on the reason for failover).
  • LLM logging to the shared disk resumes.
  • The HSS database (MySQL) is started on the new SMW.
  • HSS daemons are started on the new SMW (including, if necessary, any xtbootsys-initiated daemons).
  • Failcounts and failed actions are written to the log file /var/log/smwha.log on the newly active SMW.
Important: When the SMW HA system is running, do not remove power to the active SMW and its iDRAC to force a failover. Doing so will put the HA cluster in a frozen state where no resources are online. The HA cluster cannot recover automatically from this state.