OpenMP Overview
The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors.
-homp option.Supported Version
OMP_DISPLAY_AFFINITY, an OpenMP 5.0 feature, is supported in this release. Cray recommends the use ofOMP_DISPLAY_AFFINITYinstead ofCRAY_OMP_CHECK_AFFINITY. Affinity display environment variables and API calls (OMP_DISPLAY_AFFINITY,OMP_AFFINITY_FORMAT,omp_set_affinity_format,omp_get_affinity_format,omp_display_affinity, andomp_capture_affinity) are supported.- OpenMP
dependclauses are supported ontaskwaitdirectives. This is an OpenMP 5.0 feature supported in this release. - For Cray Classic C/C++, support for OpenMP Random Access Iterators (RAIs) in the C++ Standard Template Library (STL) is deferred. This limitation does not apply to Cray Clang (LLVM-based) C/C++.
- For Cray Classic C/C++, cancellation does not destruct/deallocate implicitly private local variables. It correctly handles explicitly private variables.
- The device clause is not supported. The other mechanisms for selecting a default device are supported:
OMP_DEFAULT_DEVICEandomp_set_default_device. - The only API calls allowed in target regions are:
omp_is_initial_device,omp_get_thread_num,omp_get_num_threads,omp_get_team_num, andomp_get_num_teams. - User-defined reductions are not supported in target regions.
- Individual structure members are not supported in the map clause or the target update construct. Instead, CCE only supports mapping and updating entire structure variables, handling all of the members together as a single aggregate object.
- An untied task that starts execution on a thread and suspends will always resume execution on that same thread.
- simd functions will not vectorize if inlining is disabled or the function definition is not visible at the callsite.
- simd loops containing function calls will not vectorize if inlining is disabled or the function definitions are not visible.
OpenMP Offloading Support
OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the currect CPU target. An appropriate accelerator target module must be loaded in order to use target directives. When the accelerator target is an NVIDIA GPU, CCE generally maps omp teams to GPU threadblocks and omp simd to GPU threads within a threadblock.
For CCE Classic, omp parallel constructs are limited to a single observable GPU thread, but CCE performs aggressive autothreading and will often map omp do or omp for loops to GPU threads. The loopmark listing file will indicate how each construct maps to GPU parallelism.
For CCE Clang, omp parallel constructs will map to GPU threads if all function calls in the construct have visible definitions and the construct contains only the following OpenMP constructs are API calls: omp barrier, omp for (with a static schedule), omp_get_thread_num, and omp_get_num_threads.
When the accelerator target is the host and a teams construct is encountered, the number of teams that execute the region is determined by the num_teams clause if present, the nthreads-var ICV if the num_teams clause is not present and nthreads-var is set to a value greater than 1, or else it will execute with one team.
Compiling
OpenMP is disabled at default and must be explicitly enabled. These CCE options affect OpenMP applications:-h [no]omp-h threadn
Executing
For OpenMP applications, use both theOMP_NUM_THREADS environment variable to specify the number of threads and the aprun -ddepth option to specify the number of CPUs hosting the threads. The number of threads specified by OMP_NUM_THREADS should not exceed the number of cores in the CPU. If neither the OMP_NUM_THREADS environment variable nor the omp_set_num_threads call is used to set the number of OpenMP threads, the system defaults to 1 thread. For further information, including example OpenMP programs, see the Cray Application Developer's Environment User's Guide.Debugging
The -g option is compatible with the -homp option, and together the options provide debugging support for OpenMP directives. The -g option, when specified with no optimization options or with -O0, provides debugging support identical to specifying the -G0 option. If any optimization is specified, -g is ignored.OpenMP Implementation Defined Behavior
The OpenMP Application Program Interface Specification, presents a list of implementation defined behaviors. The Cray implementation is described in the following sections.
When multiple threads access the same shared memory location and at least one thread is a write, threads should be ordered by explicit synchronization to avoid data race conditions and the potential for non-deterministic results. Always use explicit synchronization for any access smaller than one byte.
| ICV | Initial Value | Note |
|---|---|---|
| nthreads-var | 1 | |
| dyn-var | TRUE |
Behaves according to Algorithm 2-1 of the specification. |
| run-sched-var | static | |
| stacksize-var | 128 MB | |
| wait-policy-var | ACTIVE | |
| thread-limit-var | 64 |
Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription. |
| max-active-levels-var | 4095 | |
| def-sched-var | static | The chunksize is rounded up to improve alignment for vectorized loops. |
Dynamic Adjustment of Threads
The ICV dyn-var is enabled by default. Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.
If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates. The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option. The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.
Directives and Clauses
-
atomic directive
- When supported by the target architecture, atomic directives are lowered into hardware atomic instructions. Otherwise, atomicity is guaranteed with a lock. OpenMP atomic directives are compatible with C11 and C++11 atomic operations, as well as GNU atomic builtins.
- for directive
- For the schedule(guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads.
- For the schedule(runtime) clause, the schedule type and, optionally, chunk size can be chosen at runtime by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the default behavior of the schedule(runtime) clause is as if the schedule(static) clause appeared instead.
- In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
- The integer type or kind used to compute the iteration count of a collapsed loop are signed 64-bit integers, regardless of how the original induction variables and loop bounds are defined. If the schedule specified by the runtime schedule clause is specified and run-sched-var is auto, then the Cray implementation generates a static schedule.
- parallel directive
- If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates.
- The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option.
- The
OMP_NESTEDenvironment variable and theomp_set_nested()call control nested parallelism. To enable nesting, setOMP_NESTEDtotrueor use theomp_set_nested()call. Nesting is disabled by default.
- private clause
- If a variable is declared as private, the variable is referenced in the definition of a statement function, and the statement function is used within the lexical extent of the directive construct, then the statement function references the private version of the variable.
- sections construct
- Multiple structured blocks within a single sections construct are scheduled in lexical order and an individual block is assigned to the first thread that reaches it. It is possible for a different thread to execute each section block, or for a single thread to execute multiple section blocks. There is not a guaranteed order of execution of structured blocks within a section.
- single directive
- A single block is assigned to the first thread in the team to reach the block; this thread may or may not be the master thread.
- thread_limit clause
- The thread_limit clause places a limit on the number of threads that a teams construct may create. For NVIDIA GPU accelerator targets, this clause controls the number of CUDA threads per thread block. Only constant integer expressions are supported. If CCE does not support a thread_limit expression, then it will issue a warning message indicating the default value that will be used instead.
Library Routines
- omp_set_num_threads
- Sets nthreads-var to a positive integer. If the argument is < 1, then set nthreads-var to 1.
- omp_set_schedule
- Sets the schedule type as defined by the current specification. There are no implementation defined schedule types.
- omp_set_max_active_levels
- Sets the max-active-levels-var ICV. Defaults to 4095. If argument is < 1, then set to 1.
- omp_set_dynamic()
- The omp_set_dynamic() routine enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions by setting the value of the dyn-var ICV. The default is on.
- omp_set_nested()
- The omp_set_nested() routine enables or disables nested parallelism, by setting the nest-var internal control variable (ICV). The default is false.
- omp_get_max_active_levels
- There is a single max-active-levels-var ICV for the entire runtime system. Thus, a call to omp_get_max_active_levels will bind to all threads, regardless of which thread calls it.
Environment Variables
OMP_SCHEDULE- The default value for this environment variable is
static. For the schedule(runtime) clause, the schedule type and, optionally, chunk size can be chosen at run time by setting theOMP_SCHEDULEenvironment variable. OMP_NUM_THREADS- If this environment variable is not set and the
omp_set_num_threads()routine is not used to set the number of OpenMP threads, the default is 1 thread. The maximum number of threads per compute node is 4 times the number of allocated processors. If the requested value ofOMP_NUM_THREADSis more than the number of threads an implementation can support, the behavior of the program depends on the value of theOMP_DYNAMICenvironment variable. IfOMP_DYNAMICis false, the program terminates. IfOMP_DYNAMICis true, it uses up to 4 times the number of allocated processors. OMP_PROC_BIND-
When set to
false, the OpenMP runtime does not attempt to set or change affinity binding for OpenMP threads. When notfalse, this environment variable controls the policy for binding threads to places. Care must be taken when using OpenMP affinity binding with other binding mechanisms. For example, when launching an application with ALPS aprun, the-cccpu affinity binding option (the default) should only be used withOMP_PROC_BIND=falseorOMP_PROC_BIND=auto–otherwise, the ALPS/CLE binding will severely over-constrain OpenMP binding. When settingOMP_PROC_BINDto a value other thanfalseorauto, applications should be launched with-ccdepth or-ccnone. Using-ccdepth is particularly important when running multiple PEs per compute node, since it will allow each PE to bind to CPUs in non-overlapping subsets of the node. Valid values for this environment variable aretrue,false, orauto; or, a comma-separated list ofspread,close, andmaster. A value oftrueis maps tospread.The default value for
OMP_PROC_BINDisauto, a Cray-specific extension. Theautobinding policy directs the OpenMP runtime library to select the affinity binding setting that it determines to be most appropriate for a given situation. If there is only a single place in theplace-partition-varICV, and that place corresponds to the initial affinity mask of the master thread, then theautobinding policy maps tofalse(i.e., binding is disabled). Otherwise, theautobinding policy causes threads to bind in a manner that partitions the available places across OpenMP threads. OMP_PLACES-
This environment variable has no effect if
OMP_PROC_BIND=false; whenOMP_PROC_BINDis not false, thenOMP_PLACESdefines a set of places, or CPU affinity masks, to which threads are bound. When using thethreads,coresandsocketskeywords, places are constructed according to the CPU topology presented by Linux. However, the place list is always constrained by the initial affinity mask of the master thread. As a result, specific numeric CPU identifiers appearing inOMP_PLACESwill map onto CPUs in the initial CPU affinity mask. If an application is launched with-ccnone, then numeric CPU identifiers will exactly match Linux CPU numbers. If instead it is launched with-ccdepth, then numeric CPU identifier0will map to the first CPU in the initial affinity mask for the master thread; identifier 1 will map to the second CPU in the initial mask, and so on. This allows the sameOMP_PLACESenvironment variable for all PEs to be used, even when launching multiple PEs per node – the-ccdepth setting ensures that each PE begins executing with a non-overlapping initial affinity mask, allowing each instance of the OpenMP runtime to assign thread affinity within those non-overlapping affinity masks.The default value of
OMP_PLACESdepends on the value ofOMP_PROC_BIND. IfOMP_PROC_BINDisauto, then the default value forOMP_PLACESis cores. Otherwise, the default value ofOMP_PLACESisthreads. -
OMP_DYNAMIC - The default value is true.
OMP_NESTED- The default value is false.
OMP_STACKSIZE- The default value for this environment variable is 128 MB.
OMP_WAIT_POLICY- Provides a hint to an OpenMP implementation about the desired behavior of waiting threads by setting the
wait-policy-varICV. Possible values areACTIVEandPASSIVE, as defined by the OpenMP specification, andAUTO, a Cray-specific extension. The default value for this environment variable isAUTO, which direct the OpenMP runtime library to select the most appropriate wait policy for the situation. In general, theAUTOpolicy behaves likeACTIVE, unless the number of OpenMP threads or affinity binding results in over subscription of the available hardware processors. If over subscription is detected, theAUTOpolicy behaves likePASSIVE. OMP_MAX_ACTIVE_LEVELS- The default value is 4095.
OMP_THREAD_LIMIT- Sets the number of OpenMP threads to use for the entire OpenMP program by setting the thread-limit-var ICV. The Cray implementation defaults to 4 times the number of available processors.
Cray-specific OpenMP API
This section describes Open MP API specific to Cray.
void cray_omp_set_wait_policy( const char *policy );
OMP_WAIT_POLICY environment variable. The policy argument provides a hint to the OpenMP runtime library environment about the desired behavior of waiting threads; acceptable values are AUTO, ACTIVE, or PASSIVE (case insensitive). It is an error to call this routine in an active parallel region. The OpenMP runtime library supports a "wait policy" and a "contention policy," both of which can be set with the following environment variables:OMP_WAIT_POLICY=(AUTO|ACTIVE|PASSIVE)
CRAY_OMP_CONTENTION_POLICY=(Automatic|Standard|MonitorMwait)These environment variables allow the policies to be set once at program launch for the entire execution. However, in some circumstances it would be useful for the programmer to explicitly change the policy at various points during a program's execution. This Cray-specific routine allows the programmer to dynamically change the wait policy (and potentially the contention policy). This addresses the situation when an application needs OpenMP for the first part of program execution, but there is a clear point after which OpenMP is no longer used. Unfortunately, the idle OpenMP threads still consume resources since they are waiting for more work, resulting in performance degradation for the remainder of the application. A passive-waiting policy might eliminate the performance degradation after OpenMP is no longer needed, but the developer may still want an active-waiting policy for the OpenMP-intensive region of the application. This routine notifies all threads of the policy change at the same time, regardless of whether they are idle or active (to avoid deadlock from waiting and signaling threads using different policies).CRAY_OMP_CHECK_AFFINITY
Set the CRAY_OMP_CHECK_AFFINITY variable to TRUE at execution time to display affinity binding for each OpenMP thread. The messages contain hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.
OpenMP Accelerator Support
The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.
When targeting NVIDIA GPUs, teams constructs are mapped to CUDA thread blocks and simd constructs are mapped to CUDA threads within a thread block. For teams regions that do not contain any simd constructs, CCE will still take advantage of all available CUDA parallelism, either by automatically parallelizing nested loops across CUDA threads, or by mapping the teams parallelism across both CUDA thread blocks and threads. Currently, parallel constructs appearing within a teams construct are executed with a single thread. CCE will attempt to select an appropriate number CUDA threads and thread blocks for each construct based on the code that appears in it. For a given teams construct, users may use the num_teams and thread_limit clauses to specify the number of CUDA thread blocks and threads per thread block, respectively.