Add New Hardware to a System

Add new blades, cabinets, and so forth, to a system.

Whether adding a single compute blade or a single service blade or several components in a full cabinet or several cabinets, the process is similar.

  1. Add new components to system partition.
    1. If the system is partitioned, then add the new components to the specific partition. If the system is not partitioned, then this step can be skipped.
      crayadm@smw> xtcli part_cfg show p2 
      crayadm@smw> xtcli part_cfg deactivate p2
      
    2. Update the members of the partition with the old components and the new components.
      crayadm@smw> xtcli part_cfg update p2 -m c2-0c0s0,c2-0c0s1,c2-0c0s7,c0-0c0s9,c2-0c0s11,c2-0c0s13,c2-0c0s15,c2-0c0s3
      crayadm@smw> xtcli part_cfg activate p2
      
  2. Ensure new components are not disabled and are assigned to the desired partition. If they are disabled, they will not be discovered. If they are not assigned to a partition, they will not be bounced during the xtdiscover process, and therefore will not be properly discovered.
    Full system:
    crayadm@smw> xtcli status s0
    
    Partitioned system:
    crayadm@smw> xtcli status p1
    crayadm@smw> xtcli status p2
    
  3. Discover the new hardware.
    Full system:
     crayadm@smw> su -
     smw# xtdiscover
     smw# exit
    
    Partitioned system:
     crayadm@smw> su -
     smw# xtdiscover
     smw# exit
    
    1. Run rtr --discover if there is a significant change modifying the routing configuration.
      Full system:
      crayadm@smw> rtr --discover
      
      If this is a partitioned system, first deactivate the partitions, run rtr for the full system, and then activate the partitions again. This is most important when xtdiscover has identified a hardware change.
      Partitioned system:
      crayadm@smw> xtcli part_cfg deactivate p1
      crayadm@smw> xtcli part_cfg deactivate p2
      crayadm@smw> xtcli part_cfg activate p0
      
      crayadm@smw> rtr --discover
      
      crayadm@smw> xtcli part_cfg deactivate p0
      crayadm@smw> xtcli part_cfg activate p1
      crayadm@smw> xtcli part_cfg activate p2
      
    2. Confirm that the new components are now seen.
      crayadm@smw> xtcli status s0
      

      If the new components do not show up properly in the status output, do not continue. Power cycle the whole system, try the xtdiscover again. If they still are not showing, there may be a problem with the new hardware components.

  4. Update firmware on new components. Check whether any firmware needs to be updated on the various controllers.
    crayadm@smw> xtzap -r -v s0
    

    If any are out of date, output like the following from the xtzap command will be seen and the firmware needs to be updated.

    Individual Revision Mismatches:
    
    Type       ID                Expected   Installed
    ---------- ----------------- ---------- ----------------------------------------
    cc_bios    c0-0              0013       0012
    bc_bios    c0-0c0s0          0013       0012
    bc_bios    c0-0c0s1          0013       0012
    bc_bios    c0-0c0s2          0013       0012
    bc_bios    c0-0c0s3          0013       0012
    
    1. Update firmware, if not all current.

      CAUTION: The xtzap command is normally intended for use by Cray Service personnel only. Improper use of this restricted command can cause serious damage to the computer system.

      If the output of xtzap includes a "Revision Mismatches" section, then some firmware is out of date and needs to be reflashed. To update, run xtzap with one or more of the options described in the next paragraph.

      While the xtzap -a command can be used to update all components with a single command, it may be faster to use the xtzap -blade command when only blade types need to be updated, or the xtzap -t command when only a single type needs to be updated. On larger systems, this can save significant time.

      This is the list of all cabinet level components:
      cc_mc (CC Microcontroller)
      cc_bios (CC Tolapai BIOS)
      cc_fpga (CC FPGA)
      chia_fpga (CHIA FPGA)
      This is a list of all blade level components:
      cbb_mc (CBB BC Microcontroller)
      ibb_mc (IBB BC Microcontroller)
      anc_mc (ANC BC Microcontroller)
      bc_bios (BC Tolapai BIOS)
      lod_fpga (LOD FPGA)
      node_bios (Node BIOS)
      loc_fpga (LOC FPGA)
      qloc_fpga (QLOC FPGA)
      If the output of the xtzap command shows that only a specific type needs to be updated, then use the -t option with that type (this example uses the node_bios type).
      crayadm@smw> xtzap -t node_bios s0
      
      If the output of the xtzap command shows that only blade component types need to be updated, then use the -b option:
      crayadm@smw> xtzap -b s0
      
      If the output of the xtzap command shows that only cabinet component types need to be updated, then use the -c option:
      crayadm@smw> xtzap -c s0
      
      If the output of the xtzap command shows that both blade- and cabinet-level component types need to be updated, or if unsure of what needs to be updated, then use the -a option:
      crayadm@smw> xtzap -a s0
      
    2. Perform xtbounce --linktune, if not all current.
      Force xtbounce to do a linktune on the full system before checking firmware again.
      crayadm@smw> xtbounce --linktune=all s0
      
    3. Check firmware, after update and linktune. After updating them, confirm that they were all updated.
      crayadm@smw> xtzap -r -v s0
      
  5. Check routing configuration of the system.

    The rtr -R command produces no output unless there is a routing problem.

    Full system:
    crayadm@smw> rtr -R s0
    

    Partitioned system:

    crayadm@smw> rtr -R p1
    crayadm@smw> rtr -R p2
    
  6. Update NIMS for new components.
    Now that the new components have been added and the firmware is up to date, several NIMS commands are needed.
    Note: The cnode and cmap commands replace the nimscli command, which was deprecated in CLE 6.0.UP04 and removed in CLE 6.0.UP05. Be sure to change any scripts that reference nimscli.
    1. View settings for already existing similar nodes.
      crayadm@smw> cnode list -p p0
      
    2. If this blade was swapped out and replaced with a different type (for example, was compute, swapped for service), remove it from the old group.
      crayadm@smw> cnode update --partition p1 -c p1 -G netroot_compute \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
      
    3. Assign the nodes to the correct config set, group (compute, netroot_compute, service, login, dal, etc.), and image.
      crayadm@smw> cnode update --partition p1 -c p1 -g service \
      -i /var/opt/cray/imps/boot_images/service_XXX.cpio \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
      
    4. If this is a netroot_compute node, assign the key for netroot (can be combined with the config set, group, and image assignment in above command).
      crayadm@smw> cnode update --partition p1 -s netroot=compute-large_cle_XXX \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
      
    5. If this was a netroot_compute and is not anymore, remove the netroot key.
      crayadm@smw> cnode update --partition p1 -K netroot \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3
      
    6. If this was a compute node, and is now a service, remove the rest of the extraneous keys.
      crayadm@smw> cnode update --partition p1 -c p1 -K hsn_ipv4_mask \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3'
      crayadm@smw> cnode update --partition p1 -c p1 -K hsn_ipv4_net \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3'
      crayadm@smw> cnode update --partition p1 -c p1 -K sdbnodeip \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3'
      crayadm@smw> cnode update --partition p1 -c p1 -K bootnodeip \
      c0-0c0s1n0 c0-0c0s1n1 c0-0c0s1n2 c0-0c0s1n3'
      
  7. Update config sets with the new components.
    This will generate a new /etc/hosts file for the CLE nodes.
    Full system:
    crayadm@smw> su -
    smw# cfgset update global
    smw# cfgset update p0
    smw# exit
    crayadm@smw>
    
    Partitioned system:
    crayadm@smw> su -
    smw# cfgset update global
    smw# cfgset update p1
    smw# cfgset update p2
    smw# exit
    crayadm@smw>
    
  8. Update any workload manager (WLM) configuration as specified in the associated WLM documentation.
  9. Boot the system using the standard boot procedure.

––––––––––––––––––––––––––––––––––––––––––––––––––––

If this is an air-cooled XC system (XC-AC), then when the system has completed booting, perform the procedure in Check Cabinet Cooling Parameters for an Air-Cooled XC System.