Add New Hardware to a System
Add new blades, cabinets, and so forth, to a system.
Whether adding a single compute blade or a single service blade or several components in a full cabinet or several cabinets, the process is similar.
- Add new components to system partition.
- If the system is partitioned, then add the new components to the specific partition. If the system is not partitioned, then this step can be skipped.
crayadm@smw> xtcli part_cfg show p2 crayadm@smw> xtcli part_cfg deactivate p2
- Update the members of the partition with the old components and the new components.
crayadm@smw> xtcli part_cfg update p2 -m c2-0c0s0,c2-0c0s1,c2-0c0s7,c0-0c0s9,c2-0c0s11,c2-0c0s13,c2-0c0s15,c2-0c0s3 crayadm@smw> xtcli part_cfg activate p2
- If the system is partitioned, then add the new components to the specific partition. If the system is not partitioned, then this step can be skipped.
- Ensure new slots for the components are not disabled and are assigned to the desired partition. If they are disabled, they will not be discovered. If they are not assigned to a partition, they will not be bounced during the xtdiscover process, and therefore will not be properly discovered.All systems:
crayadm@smw> xtcli status s0
- Discover the new hardware.Full system:
crayadm@smw> su - smw# xtdiscover smw# exit
Partitioned system:crayadm@smw> su - smw# xtdiscover smw# exit
- Run rtr --discover if there is a significant change modifying the routing configuration.Full system:
crayadm@smw> rtr --discover
If this is a partitioned system, first deactivate the partitions, run rtr for the full system, and then activate the partitions again. This is most important when xtdiscover has identified a hardware change.Partitioned system:crayadm@smw> xtcli part_cfg deactivate p1 crayadm@smw> xtcli part_cfg deactivate p2 crayadm@smw> xtcli part_cfg activate p0 crayadm@smw> rtr --discover crayadm@smw> xtcli part_cfg deactivate p0 crayadm@smw> xtcli part_cfg activate p1 crayadm@smw> xtcli part_cfg activate p2
- Confirm that the new components are now seen.
crayadm@smw> xtcli status s0
If the new components do not show up properly in the status output, do not continue. Power cycle the whole system, try the xtdiscover again. If they still are not showing, there may be a problem with the new hardware components.
- Run rtr --discover if there is a significant change modifying the routing configuration.
- Update firmware on new components. Check whether any firmware needs to be updated on the various controllers.
crayadm@smw> xtzap -r -v s0
If any are out of date, output like the following from the xtzap command will be seen and the firmware needs to be updated.
Individual Revision Mismatches: Type ID Expected Installed ---------- ----------------- ---------- ---------------------------------------- cc_bios c0-0 0013 0012 bc_bios c0-0c0s0 0013 0012 bc_bios c0-0c0s1 0013 0012 bc_bios c0-0c0s2 0013 0012 bc_bios c0-0c0s3 0013 0012
- Update firmware, if not all current.
CAUTION: The xtzap command is normally intended for use by Cray Service personnel only. Improper use of this restricted command can cause serious damage to the computer system.
If the output of xtzap includes a "Revision Mismatches" section, then some firmware is out of date and needs to be reflashed. To update, run xtzap with one or more of the options described in the next paragraph.
While the xtzap -a command can be used to update all components with a single command, it may be faster to use the xtzap -blade command when only blade types need to be updated, or the xtzap -t command when only a single type needs to be updated. On larger systems, this can save significant time.
This is the list of all cabinet level components:cc_mc (CC Microcontroller) cc_bios (CC Tolapai BIOS) cc_fpga (CC FPGA) chia_fpga (CHIA FPGA)
This is a list of all blade level components:cbb_mc (CBB BC Microcontroller) ibb_mc (IBB BC Microcontroller) anc_mc (ANC BC Microcontroller) bc_bios (BC Tolapai BIOS) lod_fpga (LOD FPGA) node_bios (Node BIOS) loc_fpga (LOC FPGA) qloc_fpga (QLOC FPGA)
If the output of the xtzap command shows that only a specific type needs to be updated, then use the -t option with that type (this example uses the node_bios type).crayadm@smw> xtzap -t node_bios s0
If the output of the xtzap command shows that only blade component types need to be updated, then use the -b option:crayadm@smw> xtzap -b s0
If the output of the xtzap command shows that only cabinet component types need to be updated, then use the -c option:crayadm@smw> xtzap -c s0
If the output of the xtzap command shows that both blade- and cabinet-level component types need to be updated, or if unsure of what needs to be updated, then use the -a option:crayadm@smw> xtzap -a s0
- Perform xtbounce --linktune, if not all current. Force xtbounce to do a linktune on the full system before checking firmware again.
crayadm@smw> xtbounce --linktune=all s0
- Check firmware, after update and linktune. After updating them, confirm that they were all updated.
crayadm@smw> xtzap -r -v s0
- Update firmware, if not all current.
- Check routing configuration of the system.
The rtr -R command produces no output unless there is a routing problem.
Full system:crayadm@smw> rtr -R s0
Partitioned system:
crayadm@smw> rtr -R p1 crayadm@smw> rtr -R p2
- Update config sets with the new components. If the desired images are already available, assign the images to the new nodes. If images are not yet available, create new images with imgbuilder.This will generate a new /etc/hosts file for the CLE nodes.Full system:
crayadm@smw> su - smw# cfgset update global smw# cfgset update p0 smw# exit crayadm@smw>
Partitioned system:crayadm@smw> su - smw# cfgset update global smw# cfgset update p1 smw# cfgset update p2 smw# exit crayadm@smw>
- Update NIMS for new components. Now that the new components have been added and the firmware is up to date, NIMs may need to be updated to reflect the hardware change.
- View settings for already existing similar nodes.
crayadm@smw> cnode list -p p0
Note: Use the appropriate -p for the partition and config sets for -c. - If this blade was swapped out and replaced with a different type (for example, was compute, swapped for service), remove it from the old group.
crayadm@smw> cnode update -p p0 -G netroot_compute \ c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
- Assign the nodes to the correct config set, group (compute_aarch64, compute, netroot_compute, service, login, dal, etc.), and image.If the nodes being added are aarch64 nodes, add the nodes to the compute_aarch64 group rather than the compute group. This will ensure that the correct aarch64 images will be mapped to these nodes when building images with the default cray_image_groups.yaml config file for imgbuilder. If these are the first aarch64 compute nodes being added to the system, there may not exist an appropriate image for them yet.
crayadm@smw> cnode update -p p0 -c p0 -g service \ -i /var/opt/cray/imps/boot_images/service_XXX.cpio \ c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
- If this is a netroot_compute node, assign the key for netroot (can be combined with the config set, group, and image assignment in above command).
crayadm@smw> cnode update -p p0 -s netroot=compute-large_cle_XXX \ c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
- If this was a netroot_compute and is not anymore, remove the netroot key.
crayadm@smw> cnode update -p p0 -K netroot \ c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
- View settings for already existing similar nodes.
- Update any workload manager (WLM) configuration as specified in the associated WLM documentation.
- PBS Professional™ is a commercial product licensed by Altair Engineering, Inc. For documentation, see http://www.pbsworks.com/PBSProductGT.aspx?n=PBS-Professional&c=Overview-and-Capabilities&d=PBS-Professional,-Documentation.
- Moab™ and TORQUE are commercial products licensed by Adaptive Computing. For general product information, see http://www.adaptivecomputing.com.
- Slurm (Simple Linux Utility for Resource Management) is an open source application that is commercially supported by SchedMD, among others. For Cray-specific installation/configuration instructions, see XC™ Series Slurm Installation Guide (S-2538).
- Boot the system using the standard boot procedure.Skip this step when adding aarch64 nodes until all other steps are performed in the following sections, if these are the first aarch64 nodes being added to the system. The images will not have been built and mapped to these nodes yet.
The admin may also re-purpose some of the aarch64 compute nodes as service nodes so that they can be used as aarch64 login nodes. The steps to re-purpose compute nodes as service nodes and to build and map images are described in Repurpose a Compute or Service Node and Build aarch64 Compute and Login Images.
––––––––––––––––––––––––––––––––––––––––––––––––––––
If this is an air-cooled XC system (XC-AC), then when the system has completed booting, perform the procedure in Check Cabinet Cooling Parameters for an Air-Cooled XC System.