Configure SDB Node Failover
Configure a tier1 service node to be a backup SDB node for SDB node failover.
- Both the SDB node and the backup SDB node must have a Fibre Channel or SAS connection to the boot RAID.
- Both the SDB node and the backup SDB node must have an Ethernet connection to the network shared with the SMW in order to PXE boot and transfer data as a tier1 node.
- The primary and backup nodes must not be on the same blade.
- The SDB and boot nodes must not be on the same blade.
The system must be shut down before invoking the xtcli halt command, which is used in this procedure.
If a secondary (backup) service database (SDB) node is configured, SDB node failover will occur automatically if the primary SDB node fails. This procedure configures the system for SDB node failover. If SDB node failover was configured during an SMW/CLE software installation or update, this procedure is not needed.
For the examples in this procedure, the cname of the primary SDB node is c0-0c0s3n1, and the cname of the backup SDB node is c0-4c0s3n1.
- Configure
cray_multipathfor the backup node, if cray_multipath is enabled.cray_multipathis in the global config set and may be inherited by the CLE config set. If the globalcray_multipathis enabled and the CLEcray_multipathis set to inherit from the global config set, then make the changes in the globalcray_multipathservice. If the CLEcray_multipathservice is enabled and not set to inherit from the global config set, then make the changes in the CLEcray_multipathservice.Enter the list of multipath nodes.
Change
cray_multipath.settings.multipath.data.node_list, so that it includes both the primary SDB node and the backup SDB node.This example shows a list of five nodes: an SMW with host ID 1eac4e0c, a primary boot node with cname c0-0c0s4n1, a backup boot node with cname c0-2c0s4n1, a primary SDB node with cname c0-0c0s3n1, and a backup SDB node with cname c0-4c0s3n1.
cray_multipath.settings.multipath.data.node_list: - 1eac4e0c - c0-0c0s4n1 - c0-2c0s4n1 - c0-0c0s3n1 - c0-4c0s3n1
- Configure
cray_node_groupsto add the backup SDB node.In the CLE config set, the
cray_node_groupsservice should have ansdb_nodesnode group with the primary SDB node (c0-0c0s3n1) and the backup SDB node (c0-4c0s3n1) as members.cray_node_groups.settings.groups.data.group_name.sdb_nodes: null cray_node_groups.settings.groups.data.sdb_nodes.description: Default node group which contains the primary and failover (if applicable) SDB nodes associated with the current partition. cray_node_groups.settings.groups.data.sdb_nodes.members: - c0-0c0s3n1 - c0-4c0s3n1 - Configure
cray_persistent_datato add thesdb_nodesnode group.Ensure that this setting includes theboot_nodesnode group and thesdb_nodesnode group.cray_persistent_data.settings.mounts.data./var/lib/nfs.client_groups: - boot_nodes - sdb_nodes
- Configure
cray_scalable_servicesto add thesdb_nodesnode group.Ensure that this setting includes theboot_nodesnode group and thesdb_nodesnode group.cray_scalable_services.settings.scalable_service.data.tier1_groups: - boot_nodes - sdb_nodes
- Configure
cray_netto add the backup SDB node.These settings configure a host as the backup SDB node (backup_sdbnode) when using SDB node failover. Ensure that the
standby_nodevariable is set totrue.Note: The host name for the primary and backup SDB node should both be set tosdb. The aliases can be different so that the /etc/hosts entry for the cname has the host name alias.cray_net.settings.hosts.data.common_name.backup_sdbnode: null cray_net.settings.hosts.data.backup_sdbnode.description: backup SDB node for the system cray_net.settings.hosts.data.backup_sdbnode.aliases: - cray-sdb2 cray_net.settings.hosts.data.backup_sdbnode.hostid: c0-4c0s3n1 cray_net.settings.hosts.data.backup_sdbnode.host_type: admin cray_net.settings.hosts.data.backup_sdbnode.hostname: sdb cray_net.settings.hosts.data.backup_sdbnode.standby_node: true cray_net.settings.hosts.data.backup_sdbnode.interfaces.common_name.hsn_boot_alias: null cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.name: ipogif0:1 cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.description: Well known address used for SDB node services. cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.vlan_id: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.vlan_etherdevice: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.bonding_slaves: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.bonding_module_opts: mode=active-backup miimon=100 cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.aliases: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.network: hsn cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.ipv4_address: 10.131.255.253 cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.ipv4_secondary_addresses: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.mac: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.startmode: auto cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.bootproto: static cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.mtu: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.extra_attributes: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.module: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.params: '' #cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.unmanaged_interface: false cray_net.settings.hosts.data.backup_sdbnode.interfaces.common_name.primary_ethernet: null cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.name: eth0 cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.description: Ethernet connecting SDB node to the SMW. cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.vlan_id: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.vlan_etherdevice: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.bonding_slaves: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.hsn_boot_alias.bonding_module_opts: mode=active-backup miimon=100 cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.aliases: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.network: admin cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.ipv4_address: 10.3.1.253 cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.ipv4_secondary_addresses: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.mac: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.startmode: auto cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.bootproto: static cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.mtu: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.extra_attributes: [] cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.module: '' cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.params: '' #cray_net.settings.hosts.data.backup_sdbnode.interfaces.primary_ethernet.unmanaged_interface: false - Update the config set to regenerate the hosts file so that it contains the appropriate backup node settings.
smw# cfgset update p0 smw# cfgset validate p0
- Halt the primary and backup SDB nodes using their cnames.
crayadm@smw> xtcli halt c0-0c0s3n1,c0-4c0s3n1
- Set the primary and backup SDB nodes using the xtcli command. Use the
-dargument for an SDB node.crayadm@smw> xtcli part_cfg update p0 -d c0-0c0s3n1,c0-4c0s3n1
- Add SDB node failover to the boot automation file, auto.hostname.start.
When SDB node failover is used, add settings to the boot automation file to ensure that STONITH is enabled on the blades that contain the primary and backup SDB nodes. The STONITH setting does not survive a power cycle or any other action that causes the
bcsysddaemon to restart. Adding these lines to the boot automation file maintains that setting.Set STONITH for the blades that contain the primary and backup SDB nodes. In the example, the primary SDB node is c0-0c0s3n1, so its blade is c0-0c0s3. Add these lines before the line for booting the SDB node.
# Set STONITH for primary SDB node lappend actions {crms_exec "xtdaemonconfig c0-0c0s3 stonith=true"} # Set STONITH for the backup SDB node lappend actions {crms_exec "xtdaemonconfig c0-4c0s3 stonith=true"} - Enable the xtfailover_halt command in the auto.hostname.stop file.
Uncomment the second of these lines in auto.hostname.stop. This file in /opt/cray/hss/default/etc is normally copied from auto.xtshutdown to auto.hostname.stop during a fresh install. The xtfailover_halt command ensures that the xtbootsys shutdown process sends a STOP NMI to the failover nodes.
# Enable the following line if boot or sdb failover is enabled: lappend actions { crms_exec \ "/opt/cray/hss/default/bin/xtfailover_halt --partition $data(partition,given) --shutdown" }If the above lines are not present in the site auto.hostname.stop automation file for shutting down CLE, add them. - Assign the boot image to the backup SDB node.Check which NIMS group and boot image are being used for the primary SDB node and the backup SDB node. (The cnode and cmap commands replace the nimscli command, which was deprecated in CLE 6.0.UP04 and removed in CLE 6.0.UP05. Be sure to change any scripts that reference nimscli.)
smw# cnode list c0-0c0s3n1 smw# cnode list c0-4c0s3n1
If the backup SDB node does not have the same NIMS group and boot image assigned, update the backup SDB node.
Remove the old NIMS group from the backup SDB node.
smw# cnode update -G oldNIMSgroup c0-4c0s3n1
Assign the primary SBD node's NIMS group and boot image to the backup SDB node.smw# cnode update -g primaryNIMSgroup \ -i /path/to/primary/bootimage c0-4c0s3n1
Confirm the change.smw# cnode list c0-4c0s3n1
- Boot the system.
crayadm@smw> xtbootsys -a auto.hostname.start
Trouble? If a node that is on a blade with STONITH enabled fails to boot, try adjusting the heartbeat timeout setting for that node (see the xtdaemonconfig man page).For all other problems booting CLE, see the XC™ Series Boot Troubleshooting Guide (S-2565).