Disable Boot Node Failover
While the CLE system is shut down, disable boot node failover by changing settings in several config services (cray_multipath, cray_node_groups, and cray_net), halting both the primary and backup boot nodes, and then updating the default boot configuration, STONITH settings, and NIMS information.
The CLE system must be shut down before invoking xtcli halt, which is used in this procedure.
For the examples in this procedure, the cname of the primary boot node is c0-0c0s4n1, and the cname of the backup boot node is c0-2c0s4n1.
- Save a copy of the config set before making modifications. Use a meaningful name for the archival copy of the config set.
smw# cfgset create --clone p0 p0_before_disabling_failover
- Determine the active NIMS map and save a copy of it before making modifications. Use a meaningful name for the archival copy of the NIMS map. The second command in this example uses the name of the NIMS map output from the grep command.
smw# cmap list | grep -i true p0 smw# cmap create --clone p0 p0_before_disabling_failover
Trouble? The cmap command will first verify that the CLE config set associated with the NIMS map exists. If it does not exist, the command will fail with an error message to that effect.- If the config set is expected to be missing (for example, during an installation when the CLE config set has not yet been created), then repeat the cmap create command with the
--no-verifyflag. - If the config set is NOT expected to be missing, then create/locate the missing config set, set it as the default config set for the NIMS map, and repeat the cmap create command.
- If the config set is expected to be missing (for example, during an installation when the CLE config set has not yet been created), then repeat the cmap create command with the
- Update cray_multipath to remove the backup boot node from the list.
smw# cfgset modify --remove c0-2c0s4n1 \ cray_multipath.settings.multipath.data.node_list
- Update cray_node_groups to remove the backup boot node.
smw# cfgset modify --remove c0-2c0s4n1 \ cray_node_groups.settings.groups.data.boot_nodes.members
- Update cray_net to remove the host entry for the backup boot node. It may have another key, but typically it is called
backup_bootnode.- List the defined hosts to determine the correct key to use.
smw# cfgset get cray_net.settings.hosts
- Remove the backup boot node entry.
smw# cfgset modify --delete cray_net.settings.hosts.data.backup_bootnode
- List the defined hosts to determine the correct key to use.
- Update and validate the global and CLE config sets.This will regenerate the CLE /etc/hosts file so that it contains none of the backup node settings.
smw# cfgset update -m prepare global smw# cfgset validate global smw# cfgset update -m prepare p0 smw# cfgset validate p0
- Halt the primary and backup boot nodes.
smw# su - crayadm crayadm@smw> xtcli halt c0-0c0s4n1,c0-2c0s4n1
- Update the default boot configuration.Note that this command is used for only the primary node when there is no failover node.
crayadm@smw> xtcli boot_cfg update -b c0-0c0s4n1
- Use xtdaemonconfig to update the HSS daemon to remove STONITH from the blades containing the primary and backup boot nodes.
crayadm@smw> xtdaemonconfig c0-0c0s4 stonith=false crayadm@smw> xtdaemonconfig c0-2c0s4 stonith=false
- Remove or comment out the following lines in the auto.hostname.start boot automation file that set stonith=true for the blades containing the boot nodes.
# Set STONITH for primary boot node lappend actions {crms_exec "xtdaemonconfig c0-0c0s4 stonith=true"} # Set STONITH for backup boot node lappend actions {crms_exec "xtdaemonconfig c0-2c0s4 stonith=true"} - Remove or comment out the following lines in the auto.hostname.stop file. Skip this step if SDB node failover will NOT be disabled. This step applies only if both boot and SDB node failover will be disabled.
# Enable the following line if boot or sdb failover is enabled: lappend actions { crms_exec \ "/opt/cray/hss/default/bin/xtfailover_halt --partition $data(partition,given) --shutdown" } - Update NIMS for the backup boot node.Change the NIMS group and boot image of the node being removed as the backup boot node so that it looks like other service nodes instead of like the primary boot node.
- Determine which NIMS group and boot image are being used for the primary and backup boot nodes.
crayadm@smw> exit smw# cnode list c0-0c0s4n1 smw# cnode list c0-2c0s4n1 smw# cnode list other_service_node
- Remove the old NIMS group from the backup boot node.
smw# cnode update -G oldNIMSgroup c0-2c0s4n1
- Assign the service node NIMS group and boot image to the backup boot node.
smw# cnode update -g serviceNIMSgroup \ -i /path/to/service/bootimage c0-2c0s4n1
- Determine which NIMS group and boot image are being used for the primary and backup boot nodes.
- Boot the system to confirm these changes.
smw# su - crayadm crayadm@smw> xtbootsys -a auto.hostname.start