Run Applications Using the aprun Command

Detailed information about the aprun command.

On systems using the native Slurm workload manager, applications are launched using the srun utility, which is documented separately.

On systems using the PBS Professional or Moab/Torque workload manager, the aprun utility launches applications on compute nodes. The utility submits applications to the Application Level Placement Scheduler (ALPS) for placement and execution, forwards the user's login node environment to the assigned compute nodes, forwards signals, and manages the stdin, stdout, and stderr streams.

Use the aprun command to specify required resources, request application placement, and initiate application launch. The basic format of the aprun command is as follows:

aprun [global_options] [command_options] cmd1 [: [command_options] cmd2 [: ...] ] [--help] [--version]

Use the colon character (:) to separate the different options for separate binaries when running in MPMD (Multiple Program Multiple Data) mode. Use the --help option to display detailed aprun command line usage information. Use the --version option to display the ALPS version information.

The aprun command supports two general sets of arguments: global options and command options. The global options apply to the execution command line as a whole and are as follows:

-b | --bypass-app-transfer

Bypass application transfer to compute node

-B | --batch-args

Get values from Batch reservation for -n, -N, -d, and -m

-C | --reconnect

Reconnect fanout control tree around failed nodes

-D | --debug level

Debug level bitmask (0-7)

-e | --environment-override env

Set an environment variable on the compute nodes

Must use format VARNAME=value

Set multiple environment variables using multiple -e arguments

-m | --memory-per-pe size

Per PE memory limit in megabytes (default node memory/number of processors)

K|M|G suffix supported (16 == 16M == 16 megabytes)

Add an 'h' suffix to request per PE huge page memory

Add an 's' to the 'h' suffix to make the per PE huge page memory size strict (required)

-P | --pipes pipes

Write[,read] pipes (not applicable for general use)

-p | --protection-domain pdi

Protection domain identifier

-q, --quiet

Quiet mode; suppress aprun non-fatal messages

-R | --relaunch max_shrink

Relaunch application; max_shrink is zero or more maximum PEs to shrink for a relaunch

-T, --sync-output

Use synchronous TTY

-t, --cpu-time-limit sec

Per PE CPU time limit in seconds (default unlimited)

The command options apply to individual binaries and can be set differently for each binary when operating in MPMD mode. The command options are as follows:

-a | --architecture arch

Architecture type

--cc | --cpu-binding cpu_list

CPU binding list or keyword

([cpu#[,cpu# | cpu1-cpu2] | x]...] | keyword)

--cp | --cpu-binding-file file

CPU binding placement filename

-d | --cpus-per-pe depth

Number of CPUs allocated per PE (number of threads)

-E | --exclude-node-list node_list

List of nodes to exclude from placement

--exclude-node-list-file node_list_file

File with a list of nodes to exclude from placement

-F | --access-mode flag

Exclusive or share node resources flag

-j | --cpus-per-cu CPUs

CPUs to use per Compute Unit (CU)

-L | --node-list node_list

Manual placement list (node[,node | node1-node2]...)

-l | --node-list-file node_list_file

File with manual placement list

-N | --pes-per-node pes

PEs per node

-n | --pes width

Number of PEs requested

--p-governor governor_name

Specify application performance governor

--p-state pstate

Specify application p-state in kHz

-r | --specialized-cpus CPUs

Restrict this many CPUs per node to specialization

-S | --pes-per-numa-node pes

PEs per NUMA node

--ss | --strict-memory-containment

Strict memory containment per NUMA node

In more detail, the aprun options are as follows:

-a arch

Specifies the architecture type of the compute node on which the application will run; arch is xt. If using aprun to launch a compiled and linked executable, do not include the -a option; ALPS can determine the compute node architecture type from the ELF header (see the elf(5) man page).

-b

Bypasses the transfer of the executable program to the compute nodes. By default, the executable is transferred to the compute nodes during the aprun process of launching an application.

-B

Reuses the width, depth, nppn, and memory request options that were specified with the batch reservation. This option obviates the need to specify aprun options -n, -d, -N, and -m. aprun will exit with errors if these options are specified with the -B option.

-C

Attempts to reconnect the application-control fan-out tree around failed nodes and complete application execution. To use this option, the application must use a programming model that supports reconnect. Options -C and -R are mutually exclusive.

-cc cpu_list|keyword

Binds processing elements (PEs) to CPUs. CNL does not migrate processes that are bound to a CPU. This option applies to all multicore compute nodes. The cpu_list is not used for placement decisions, but is used only by CNL during application execution. For further information about binding (CPU affinity), see The aprun CPU Affinity Option.

The cpu_list is a comma-separated list of logical CPU numbers and/or hyphen-separated CPU ranges. It controls the cpu affinity of each PE task, and each descendent thread or task of each PE, as they are created. (Collectively, "app tasks".) Upon the creation of each app task, it is bound to the CPU in cpu_list corresponding to the number of app tasks that have been created at that point. For example, the first PE created is bound to the first CPU in cpu_list. The second PE created is bound to the second CPU in cpu_list (assuming the first PE has not created any children or threads first). If more app tasks are created than given in cpu_list, binding starts over at the beginning of cpu_list.

Instead of a CPU number, an x may be specified in any position or positions in cpu_list. The app task that corresponds to this position will not be bound to any CPU.

The above behavior can result in undesireable and/or unpredictable behavior when more than one PE on a node creates children or threads without synchronizing between themselves. Because the app tasks are bound to CPUs in cpu_list in the order in which they are created, unpredictable creation order will lead to unpredictable binding. To prevent this, the user may specify a cpu_list per PE. Multiple cpu_lists are separated by colons (:).

% aprun -n 2 -d 3 -cc 0,1,2:4,5,6 ./a.out

The example above contains two cpu_lists. The first (0,1,2) is applied to the first PE created and any threads or child processes that result. The second (4,5,6) is applied to the second PE created and any threads or child processes that result.

Out-of-range cpu_list values are ignored unless all CPU values are out of range, in which case an error message is issued. For example, to bind PEs starting with the highest CPU on a compute node and work down from there, use this -cc option:

% aprun -n 8 -cc 10-4 ./a.out

If the PEs were placed on Cray XE6 24-core compute nodes, the specified -cc range would be valid. However, if the PEs were placed on Cray XK6 eight-core compute nodes, CPUs 10-8 would be out of range and therefore not used.

Instead of a cpu_list, the argument to the -cc option may be one of the following keywords:

The cpu keyword (the default) binds each PE to a CPU within the assigned NUMA node. Indicating a specific CPU is not necessary.
If a depth per PE (aprun -d depth) is specified, the PEs are constrained to CPUs with a distance of depth between them to each PE's threads to the CPUs closest to the PE's CPU.
The -cc cpu option is the typical use case for an MPI application.

Tip: If CPUs are oversubscribed for an OpenMP application, Cray recommends not using the -cc cpu default. Test the -cc none and -cc numa_node options and compare results to determine which option produces the better performance.
The depth keyword can improve MPI rank and thread placement on Cray XC30 nodes by assigning to a PE and its children a cpumask with -d (depth) bits set. If the -j option is also used, only -j PEs will be assigned per compute unit.
The numa_node keyword constrains PEs to the CPUs within the assigned NUMA node. CNL can migrate a PE among the CPUs in the assigned NUMA node but not off the assigned NUMA node.
If PEs create threads, the threads are constrained to the same NUMA-node CPUs as the PEs. There is one exception. If depth is greater than the number of CPUs per NUMA node, when the number of threads created by the PE has exceeded the number of CPUs per NUMA node, the remaining threads are constrained to CPUs within the next NUMA node on the compute node. For example, on an 8-core XK node where CPUs 0-3 are on NUMA node 0 and CPUs 4-7 are on NUMA node 1, if depth is 5, threads 0-3 are constrained to CPUs 0-3 and thread 4 is constrained to CPUs 4-7.
The none keyword allows PE migration within the assigned NUMA nodes.

-D value

The -D option value is an integer bitmask setting that controls debug verbosity, where:

A value of 1 provides a small level of debug messages
A value of 2 provides a medium level of debug messages
A value of 4 provides a high level of debug messages

Because this option is a bitmask setting, value can be set to get any or all of the above levels of debug messages. Therefore, valid values are 0 through 7. For example, -D 3 provides all small and medium level debug messages.

-d depth

Specifies the number of CPUs for each PE and its threads. ALPS allocates the number of CPUs equal to depth times pes. The -cc cpu_list option can restrict the placement of threads, resulting in more than one thread per CPU.

The default depth is 1.

For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -d option to specify the number of CPUs hosting the threads. ALPS creates -n pes instances of the executable, and the executable spawns OMP_NUM_THREADS-1 additional threads per PE. For an OpenMP example, see Run an OpenMP Application.

The maximum permissible depth value depends on the types of CPUs installed on the Cray system.

-e env

Sets an environment variable on the compute nodes. The form of the assignment should be of the form VARNAME=value. Multiple arguments with multiple, space-separated flag and assignment pairings can be used, e.g., -e VARNAME1=value1 -e VARNAME2=value2, etc.

-F exclusive|share

exclusive mode provides a program with exclusive access to all the processing and memory resources on a node. Using this option with the cc option binds processes to those mentioned in the affinity string. share mode access restricts the application specific cpuset contents to only the application reserved cores and memory on NUMA node boundaries, meaning the application will not have access to cores and memory on other NUMA nodes on that compute node. The exclusive option does not need to be specified because exclusive access mode is enabled by default. However, if nodeShare is set to share in alps.conf then use the -F exclusive to override the policy set in this file. Check the value of nodeShare by executing apstat -svv | grep access.

-j num_cpus

Specifies how many CPUs to use per compute unit for an ALPS job. For more information on compute unit affinity, see XC™ Series Compute Unit Affinity Configuration Guide.

-L node_list

Specifies the candidate nodes to constrain application placement. The syntax allows a comma-separated list of nodes (such as -L 32,33,40), a hyphen-separated range of nodes (such as -L 41-87), or a combination of both formats. Node values can be expressed in decimal, octal (preceded by 0), or hexadecimal (preceded by 0x). The first number in a range must be less than the second number (8-6, for example, is invalid), but the nodes in a list can be in any order.

If the placement node list contains fewer nodes than the number required, a fatal error is produced. If resources are not currently available, aprun continues to retry.

The cnselect command is a common source of node lists. See the cnselect(1) man page for details.

-m size[h|hs]

Specifies the per-PE required Resident Set Size (RSS) memory size in megabytes. K, M, and G suffixes (case insensitive) are supported (16M = 16m = 16 megabytes, for example). If the -m option is not included, the default amount of memory available to each PE equals (compute node memory size) / (number of PEs) calculated for each compute node.

Use the h or hs suffix to allocate huge pages (2 MB) for an application.

The use of the -m option is not required on Cray systems because the kernel allows the dynamic creation of huge pages. However, it is advisable to specify this option and preallocate an appropriate number of huge pages, when memory requirements are known, to reduce operating system overhead.

-m sizeh

Requests memory to be allocated to each PE, where memory is preferentially allocated out of the huge page pool. All nodes use as much memory as they are able to allocate and 4 KB base pages thereafter.

-m sizehs

Requests memory to be allocated to each PE, where memory is allocated out of the huge page pool. If the request cannot be satisfied, an error message is issued and the application launch is terminated.

To use huge pages, first link the application with hugetlbfs:

% cc -c my_hugepages_app.c 
% cc -o my_hugepages_app my_hugepages_app.o -lhugetlbfs

Set the huge pages environment variable at run time:

% setenv HUGETLB_MORECORE yes

% export HUGETLB_MORECORE=yes

See the intro_hugepages(1) man page for further details.

-n pes

Specifies the number of processing elements (PEs) that the application requires. A PE is an instance of an ALPS-launched executable. Express the number of PEs in decimal, octal, or hexadecimal form. If pes has a leading 0, it is interpreted as octal (-n 16 specifies 16 PEs, but -n 016 is interpreted as 14 PEs). If pes has a leading 0x, it is interpreted as hexadecimal (-n 16 specifies 16 PEs, but -n 0x16 is interpreted as 22 PEs). The default value is 1.

-N pes_per_node

Specifies the number of PEs to place per node. For Cray systems, the default is the number of cores on the node.

-p protection domain identifier

Requests use of a protection domain using the user pre-allocated protection identifier. If protection domains are already allocated by system services, this option cannot be used. Any cooperating set of applications must specify this same aprun -p option to have access to the shared protection domain. aprun will return an error if either the protection domain identifier is not recognized or if the user is not the owner of the specified protection domain identifier.

--p-governor governor_name

--p-governor sets a performance governor on compute nodes used by the application. Choices are performance, powersave, userspace, ondemand, conservative. See /usr/src/linux/Documentation/cpu-freq/governors.txt for details. --p-governor cannot by used with --p-state.

--p-state pstate

Specifies the CPU frequency used by the compute node kernel while running the application. --p-state cannot be used with --p-governor.

-q

Specifies quiet mode and suppresses all aprun-generated non-fatal messages. Do not use this option with the -D (debug) option; aprun terminates the application if both options are specified. Even with the -q option, aprun writes its help message and any ALPS fatal messages when exiting. Normally, this option should not be used.

-r cores

When cores > 0, core specialization is enabled. On each compute node, cores CPUs will be dedicated to system tasks, and system tasks will not run on the CPUs on which the the application is placed.

Whenever core specialization is enabled, the highest-numbered CPU will be used as one of these system CPUs. When cores > 1, the additional system CPUs will be chosen from the CPUs not selected for the application by the usual affinity options.

It is an error to specify cores > 0 and to include the highest-numbered CPU in the -cc cpu_list option. It is an error to specify cores and a -cc cpu_list where the number of CPUs in the cpu_list plus cores is greater than the number of CPUs on the node.

-R pe_dec

Enables application relaunch so that should the application experience certain system failures, ALPS will attempt to relaunch and complete in a degraded manner. pe_dec is the processing element (PE) decrement tolerance. If pe_dec is non-zero, aprun attempts to relaunch with a maximum of pe_dec fewer PEs. If pe_dec is 0, aprun attempts relaunch with the same number of PEs specified with original launch. Relaunch is supported per aprun instance. A decrement count value greater than zero will fail for MPMD launches with more than one element. aprun attempts relaunch with ec_node_failed and ec_node_halt hardware supervisory system events only. Options -C and -R are mutually exclusive.

-S pes_per_numa_node

Specifies the number of PEs to allocate per NUMA node. Use this option to reduce the number of PEs per NUMA node, thereby making more resources (such as memory) available per PE.

The allowable values for this option vary depending on the types of CPUs installed on the system. A zero value is not allowed and causes a fatal error. For further information, see The aprun Memory Affinity Options.

-ss

Specifies strict memory containment per NUMA node. When -ss is specified, a PE can allocate only the memory that is local to its assigned NUMA node.

The default is to allow remote-NUMA-node memory allocation to all assigned NUMA nodes. Use this option to find out if restricting each PE's memory access to local-NUMA-node memory affects performance.

-T

Synchronizes the application's stdout and stderr to prevent interleaving of its output.

-t sec

Specifies the per-PE CPU time limit in seconds. The sec time limit is constrained by the CPU time limit on the login node. For example, if the time limit on the login node is 3600 seconds but a -t value of 5000 is specified, the application is constrained to 3600 seconds per PE. If the time limit on the login node is unlimited, the sec value is used (or, if not specified, the time per-PE is unlimited). Determine the CPU time limit by using the limit command (csh) or the ulimit -a command (bash).

For OpenMP or multithreaded applications where processes may have child tasks, the time used in the child tasks accumulates against the parent process. Thus, it may be necessary to multiply the sec value by the depth value in order to get a real-time value approximately equivalent to the same value for the PE of a non-threaded application.

: (colon)

Separates the names of executables and their associated options for Multiple Program, Multiple Data (MPMD) mode. A space is required before and after the colon.