OpenMP Overview

The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors.

The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors. The OpenMP specification is accessible at https://www.openmp.org. Beginning with CCE 9.0, OpenMP is disabled by default and must be explicitly enabled using the -homp option.

Supported Version

CCE supports the OpenMP API, Version 4.5, with the following exceptions and/or enhancements. The most up-to-date exceptions are listed in the crayftn(1) and intro_openmp(7) man pages:

OMP_DISPLAY_AFFINITY, an OpenMP 5.0 feature, is supported in this release. Cray recommends the use of OMP_DISPLAY_AFFINITY instead of CRAY_OMP_CHECK_AFFINITY. Affinity display environment variables and API calls (OMP_DISPLAY_AFFINITY, OMP_AFFINITY_FORMAT, omp_set_affinity_format, omp_get_affinity_format, omp_display_affinity, and omp_capture_affinity) are supported.
OpenMP depend clauses are supported on taskwait directives. This is an OpenMP 5.0 feature supported in this release.
For Cray Classic C/C++, support for OpenMP Random Access Iterators (RAIs) in the C++ Standard Template Library (STL) is deferred. This limitation does not apply to Cray Clang (LLVM-based) C/C++.
For Cray Classic C/C++, cancellation does not destruct/deallocate implicitly private local variables. It correctly handles explicitly private variables.
The device clause is not supported. The other mechanisms for selecting a default device are supported: OMP_DEFAULT_DEVICE and omp_set_default_device.
The only API calls allowed in target regions are: omp_is_initial_device, omp_get_thread_num, omp_get_num_threads, omp_get_team_num, and omp_get_num_teams.
User-defined reductions are not supported in target regions.
Individual structure members are not supported in the map clause or the target update construct. Instead, CCE only supports mapping and updating entire structure variables, handling all of the members together as a single aggregate object.

The following additional notes apply to CCE's OpenMP support:

An untied task that starts execution on a thread and suspends will always resume execution on that same thread.
simd functions will not vectorize if inlining is disabled or the function definition is not visible at the callsite.
simd loops containing function calls will not vectorize if inlining is disabled or the function definitions are not visible.

OpenMP Offloading Support

OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the currect CPU target. An appropriate accelerator target module must be loaded in order to use target directives. When the accelerator target is an NVIDIA GPU, CCE generally maps omp teams to GPU threadblocks and omp simd to GPU threads within a threadblock.

For CCE Classic, omp parallel constructs are limited to a single observable GPU thread, but CCE performs aggressive autothreading and will often map omp do or omp for loops to GPU threads. The loopmark listing file will indicate how each construct maps to GPU parallelism.

For CCE Clang, omp parallel constructs will map to GPU threads if all function calls in the construct have visible definitions and the construct contains only the following OpenMP constructs are API calls: omp barrier, omp for (with a static schedule), omp_get_thread_num, and omp_get_num_threads.

When the accelerator target is the host and a teams construct is encountered, the number of teams that execute the region is determined by the num_teams clause if present, the nthreads-var ICV if the num_teams clause is not present and nthreads-var is set to a value greater than 1, or else it will execute with one team.

Compiling

OpenMP is disabled at default and must be explicitly enabled. These CCE options affect OpenMP applications:

-h [no]omp
-h threadn

Executing

For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -ddepth option to specify the number of CPUs hosting the threads. The number of threads specified by OMP_NUM_THREADS should not exceed the number of cores in the CPU. If neither the OMP_NUM_THREADS environment variable nor the omp_set_num_threads call is used to set the number of OpenMP threads, the system defaults to 1 thread. For further information, including example OpenMP programs, see the Cray Application Developer's Environment User's Guide.

Debugging

The -g option is compatible with the -homp option, and together the options provide debugging support for OpenMP directives. The -g option, when specified with no optimization options or with -O0, provides debugging support identical to specifying the -G0 option. If any optimization is specified, -g is ignored.

OpenMP Implementation Defined Behavior

The OpenMP Application Program Interface Specification, presents a list of implementation defined behaviors. The Cray implementation is described in the following sections.

When multiple threads access the same shared memory location and at least one thread is a write, threads should be ordered by explicit synchronization to avoid data race conditions and the potential for non-deterministic results. Always use explicit synchronization for any access smaller than one byte.

Table 1. Initial Values of OpenMP ICVs
ICV	Initial Value	Note
`nthreads-var`	1
`dyn-var`	TRUE	Behaves according to Algorithm 2-1 of the specification.
`run-sched-var`	static
`stacksize-var`	128 MB
`wait-policy-var`	ACTIVE
`thread-limit-var`	64	Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.
`max-active-levels-var`	4095
`def-sched-var`	static	The chunksize is rounded up to improve alignment for vectorized loops.

Dynamic Adjustment of Threads

The ICV dyn-var is enabled by default. Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.

If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates. The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option. The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.

Directives and Clauses

atomic directive
- When supported by the target architecture, atomic directives are lowered into hardware atomic instructions. Otherwise, atomicity is guaranteed with a lock. OpenMP atomic directives are compatible with C11 and C++11 atomic operations, as well as GNU atomic builtins.
for directive
- For the schedule(guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads.
- For the schedule(runtime) clause, the schedule type and, optionally, chunk size can be chosen at runtime by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the default behavior of the schedule(runtime) clause is as if the schedule(static) clause appeared instead.
- In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
- The integer type or kind used to compute the iteration count of a collapsed loop are signed 64-bit integers, regardless of how the original induction variables and loop bounds are defined. If the schedule specified by the runtime schedule clause is specified and run-sched-var is auto, then the Cray implementation generates a static schedule.
parallel directive
- If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates.
- The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option.
- The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.
private clause
- If a variable is declared as private, the variable is referenced in the definition of a statement function, and the statement function is used within the lexical extent of the directive construct, then the statement function references the private version of the variable.
sections construct
- Multiple structured blocks within a single sections construct are scheduled in lexical order and an individual block is assigned to the first thread that reaches it. It is possible for a different thread to execute each section block, or for a single thread to execute multiple section blocks. There is not a guaranteed order of execution of structured blocks within a section.
single directive
- A single block is assigned to the first thread in the team to reach the block; this thread may or may not be the master thread.
thread_limit clause
- The thread_limit clause places a limit on the number of threads that a teams construct may create. For NVIDIA GPU accelerator targets, this clause controls the number of CUDA threads per thread block. Only constant integer expressions are supported. If CCE does not support a thread_limit expression, then it will issue a warning message indicating the default value that will be used instead.

Library Routines

omp_set_num_threads
- Sets nthreads-var to a positive integer. If the argument is < 1, then set nthreads-var to 1.
omp_set_schedule
- Sets the schedule type as defined by the current specification. There are no implementation defined schedule types.
omp_set_max_active_levels
- Sets the max-active-levels-var ICV. Defaults to 4095. If argument is < 1, then set to 1.
omp_set_dynamic()
- The omp_set_dynamic() routine enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions by setting the value of the dyn-var ICV. The default is on.
omp_set_nested()
- The omp_set_nested() routine enables or disables nested parallelism, by setting the nest-var internal control variable (ICV). The default is false.
omp_get_max_active_levels
- There is a single max-active-levels-var ICV for the entire runtime system. Thus, a call to omp_get_max_active_levels will bind to all threads, regardless of which thread calls it.

Environment Variables

OMP_SCHEDULE

The default value for this environment variable is static. For the schedule(runtime) clause, the schedule type and, optionally, chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable.

OMP_NUM_THREADS

If this environment variable is not set and the omp_set_num_threads() routine is not used to set the number of OpenMP threads, the default is 1 thread. The maximum number of threads per compute node is 4 times the number of allocated processors. If the requested value of OMP_NUM_THREADS is more than the number of threads an implementation can support, the behavior of the program depends on the value of the OMP_DYNAMIC environment variable. If OMP_DYNAMIC is false, the program terminates. If OMP_DYNAMIC is true, it uses up to 4 times the number of allocated processors.

OMP_PROC_BIND

When set to false, the OpenMP runtime does not attempt to set or change affinity binding for OpenMP threads. When not false, this environment variable controls the policy for binding threads to places. Care must be taken when using OpenMP affinity binding with other binding mechanisms. For example, when launching an application with ALPS aprun, the -cc cpu affinity binding option (the default) should only be used with OMP_PROC_BIND=false or OMP_PROC_BIND=auto –otherwise, the ALPS/CLE binding will severely over-constrain OpenMP binding. When setting OMP_PROC_BIND to a value other than false or auto, applications should be launched with -cc depth or -cc none. Using -cc depth is particularly important when running multiple PEs per compute node, since it will allow each PE to bind to CPUs in non-overlapping subsets of the node. Valid values for this environment variable are true, false, or auto; or, a comma-separated list of spread, close, and master. A value of true is maps to spread.

The default value for OMP_PROC_BIND is auto, a Cray-specific extension. The auto binding policy directs the OpenMP runtime library to select the affinity binding setting that it determines to be most appropriate for a given situation. If there is only a single place in the place-partition-var ICV, and that place corresponds to the initial affinity mask of the master thread, then the auto binding policy maps to false (i.e., binding is disabled). Otherwise, the auto binding policy causes threads to bind in a manner that partitions the available places across OpenMP threads.

OMP_PLACES

This environment variable has no effect if OMP_PROC_BIND=false; when OMP_PROC_BIND is not false, then OMP_PLACES defines a set of places, or CPU affinity masks, to which threads are bound. When using the threads, cores and sockets keywords, places are constructed according to the CPU topology presented by Linux. However, the place list is always constrained by the initial affinity mask of the master thread. As a result, specific numeric CPU identifiers appearing in OMP_PLACES will map onto CPUs in the initial CPU affinity mask. If an application is launched with -cc none, then numeric CPU identifiers will exactly match Linux CPU numbers. If instead it is launched with -cc depth, then numeric CPU identifier 0 will map to the first CPU in the initial affinity mask for the master thread; identifier 1 will map to the second CPU in the initial mask, and so on. This allows the same OMP_PLACES environment variable for all PEs to be used, even when launching multiple PEs per node – the -cc depth setting ensures that each PE begins executing with a non-overlapping initial affinity mask, allowing each instance of the OpenMP runtime to assign thread affinity within those non-overlapping affinity masks.

The default value of OMP_PLACES depends on the value of OMP_PROC_BIND. If OMP_PROC_BIND is auto, then the default value for OMP_PLACES is cores. Otherwise, the default value of OMP_PLACES is threads.

OMP_DYNAMIC

The default value is true.

OMP_NESTED

The default value is false.

OMP_STACKSIZE

The default value for this environment variable is 128 MB.

OMP_WAIT_POLICY

Provides a hint to an OpenMP implementation about the desired behavior of waiting threads by setting the wait-policy-var ICV. Possible values are ACTIVE and PASSIVE, as defined by the OpenMP specification, and AUTO, a Cray-specific extension. The default value for this environment variable is AUTO, which direct the OpenMP runtime library to select the most appropriate wait policy for the situation. In general, the AUTO policy behaves like ACTIVE, unless the number of OpenMP threads or affinity binding results in over subscription of the available hardware processors. If over subscription is detected, the AUTO policy behaves like PASSIVE.

OMP_MAX_ACTIVE_LEVELS

The default value is 4095.

OMP_THREAD_LIMIT

Sets the number of OpenMP threads to use for the entire OpenMP program by setting the thread-limit-var ICV. The Cray implementation defaults to 4 times the number of available processors.

Cray-specific OpenMP API

This section describes Open MP API specific to Cray.

void cray_omp_set_wait_policy( const char *policy );

This routine allows dynamic modification of the wait-policy-var ICV value, which corresponds to the OMP_WAIT_POLICY environment variable. The policy argument provides a hint to the OpenMP runtime library environment about the desired behavior of waiting threads; acceptable values are AUTO, ACTIVE, or PASSIVE (case insensitive). It is an error to call this routine in an active parallel region. The OpenMP runtime library supports a "wait policy" and a "contention policy," both of which can be set with the following environment variables:

OMP_WAIT_POLICY=(AUTO|ACTIVE|PASSIVE)
CRAY_OMP_CONTENTION_POLICY=(Automatic|Standard|MonitorMwait)

These environment variables allow the policies to be set once at program launch for the entire execution. However, in some circumstances it would be useful for the programmer to explicitly change the policy at various points during a program's execution. This Cray-specific routine allows the programmer to dynamically change the wait policy (and potentially the contention policy). This addresses the situation when an application needs OpenMP for the first part of program execution, but there is a clear point after which OpenMP is no longer used. Unfortunately, the idle OpenMP threads still consume resources since they are waiting for more work, resulting in performance degradation for the remainder of the application. A passive-waiting policy might eliminate the performance degradation after OpenMP is no longer needed, but the developer may still want an active-waiting policy for the OpenMP-intensive region of the application. This routine notifies all threads of the policy change at the same time, regardless of whether they are idle or active (to avoid deadlock from waiting and signaling threads using different policies).

CRAY_OMP_CHECK_AFFINITY

Set the CRAY_OMP_CHECK_AFFINITY variable to TRUE at execution time to display affinity binding for each OpenMP thread. The messages contain hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.

OpenMP Accelerator Support

The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.

When targeting NVIDIA GPUs, teams constructs are mapped to CUDA thread blocks and simd constructs are mapped to CUDA threads within a thread block. For teams regions that do not contain any simd constructs, CCE will still take advantage of all available CUDA parallelism, either by automatically parallelizing nested loops across CUDA threads, or by mapping the teams parallelism across both CUDA thread blocks and threads. Currently, parallel constructs appearing within a teams construct are executed with a single thread. CCE will attempt to select an appropriate number CUDA threads and thread blocks for each construct based on the code that appears in it. For a given teams construct, users may use the num_teams and thread_limit clauses to specify the number of CUDA thread blocks and threads per thread block, respectively.