OpenMP Overview

The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors.

The OpenMP API provides a parallel programming model that is portable across shared memory architectures from Cray and other vendors. The OpenMP specification is accessible at http://openmp.org/wp/openmp-specifications/.

Supported Version

CCE supports the OpenMP API, Version 4.5, with the following exceptions. The most up-to-date exceptions are listed on the man pages:

Task switching is not implemented. The thread that starts executing a task is the thread that finishes the task.
Support for OpenMP Random Access Iterators (RAIs) in the C++ Standard Template Library (STL) is deferred.
Cancellation does not destruct/deallocate implicitly private local variables. It correctly handles explicitly private variables.
simd functions will not vectorize if the function definition is not visible for inlining at the callsite.
The device clause is not supported. The other mechanisms for selecting a default device are supported: OMP_DEFAULT_DEVICE and omp_set_default_device.
The only API calls allowed in target regions are: omp_is_initial_device, omp_get_thread_num, omp_get_num_threads, omp_get_team_num, and omp_get_num_teams.
Parallel constructs are supported in target regions, but they are limited to one thread.
User-defined reductions are not supported in target regions.
Individual structure members are not supported in the map clause or the target update construct. Instead, CCE only supports mapping and updating entire structure variables, handling all of the members together as a single aggregate object.

Compiling

By default, the CCE compiler recognizes OpenMP directives. These CCE options affect OpenMP applications:

-h [no]omp
-h threadn

Executing

For OpenMP applications, use both the OMP_NUM_THREADS environment variable to specify the number of threads and the aprun -ddepth option to specify the number of CPUs hosting the threads. The number of threads specified by OMP_NUM_THREADS should not exceed the number of cores in the CPU. If neither the OMP_NUM_THREADS environment variable nor the omp_set_num_threads call is used to set the number of OpenMP threads, the system defaults to 1 thread. For further information, including example OpenMP programs, see the Cray Application Developer's Environment User's Guide.

Debugging

The -g option provides debugging support for OpenMP directives. The -g option, when specified with no optimization options or with -O0, provides debugging support identical to specifying the -G0 option. If any optimization is specified, -g is ignored. This level of debugging implies -homp which means that most optimizations disabled but OpenMP directives are recognized, and -h fp0. To debug without OpenMP, use -g -xomp or -g -hnoomp, which will disable OpenMP and turn on debugging.

OpenMP Implementation Defined Behavior

The OpenMP Application Program Interface Specification, presents a list of implementation defined behaviors. The Cray implementation is described in the following sections.

When multiple threads access the same shared memory location and at least one thread is a write, threads should be ordered by explicit synchronization to avoid data race conditions and the potential for non-deterministic results. Always use explicit synchronization for any access smaller than one byte.

Table 1. Initial Values of OpenMP ICVs
ICV	Initial Value	Note
`nthreads-var`	1
`dyn-var`	TRUE	Behaves according to Algorithm 2-1 of the specification.
`run-sched-var`	static, 0
`stacksize-var`	128 MB
`wait-policy-var`	ACTIVE
`thread-limit-var`	64	Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.
`max-active-levels-var`	4095
`def-sched-var`	static, 0	The chunksize is rounded up to improve alignment for vectorized loops.

Dynamic Adjustment of Threads

The ICV dyn-var is enabled by default. Threads may be dynamically created up to an upper limit which is 4 times the number of cores/node. It is up to the programmer to try to limit oversubscription.

If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates. The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option. The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.

Tasks

There are no untied tasks in this implementation of OpenMP. There are also no implementation-defined task scheduling points.

Directives and Clauses

atomic directive
- When supported by the target architecture, atomic directives are lowered into hardware atomic instructions. Otherwise, atomicity is guaranteed with a lock. OpenMP atomic directives are compatible with C11 and C++11 atomic operations, as well as GNU atomic builtins.
do and parallel do directives
- For the schedule(guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads.
- For the schedule(runtime) clause, the schedule type and chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the schedule type and chunk size default to static and 0, respectively.
- In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
parallel directive
- If a parallel region is encountered while dynamic adjustment of the number of threads is disabled, and the number of threads specified for the parallel region exceeds the number that the runtime system can supply, the program terminates.
- The number of physical processors actually hosting the threads at any given time is fixed at program startup and is specified by the aprun -d depth option.
- The OMP_NESTED environment variable and the omp_set_nested() call control nested parallelism. To enable nesting, set OMP_NESTED to true or use the omp_set_nested() call. Nesting is disabled by default.
loop directive
- The integer type or kind used to compute the iteration count of a collapsed loop are signed 64-bit integers, regardless of how the original induction variables and loop bounds are defined. If the schedule specified by the runtime schedule clause is specified and run-sched-var is auto, then the Cray implementation generates a static schedule.
private clause
- If a variable is declared as private, the variable is referenced in the definition of a statement function, and the statement function is used within the lexical extent of the directive construct, then the statement function references the private version of the variable.
sections construct
- Multiple structured blocks within a single sections construct are scheduled in lexical order and an individual block is assigned to the first thread that reaches it. It is possible for a different thread to execute each section block, or for a single thread to execute multiple section blocks. There is not a guaranteed order of execution of structured blocks within a section.
single directive
- A single block is assigned to the first thread in the team to reach the block; this thread may or may not be the master thread.
threadprivate directive
- The threadprivate directive specifies that variables are replicated, with each thread having its own copy. If the dynamic threads mechanism is enabled, the definition and association status of a thread's copy of the variable is undefined, and the allocation status of an allocatable array is undefined.
thread_limit clause
- The thread_limit clause places a limit on the number of threads that a teams construct may create. For NVIDIA GPU accelerator targets, this clause controls the number of CUDA threads per thread block. Only constant integer expressions are supported. If CCE does not support a thread_limit expression, then it will issue a warning message indicating the default value that will be used instead.

Library Routines

omp_set_num_threads
- Sets nthreads-var to a positive integer. If the argument is < 1, then set nthreads-var to 1.
omp_set_schedule
- Sets the schedule type as defined by the current specification. There are no implementation defined schedule types.
omp_set_max_active_levels
- Sets the max-active-levels-var ICV. Defaults to 4095. If argument is < 1, then set to 1.
omp_set_dynamic()
- The omp_set_dynamic() routine enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions by setting the value of the dyn-var ICV. The default is on.
omp_set_nested()
- The omp_set_nested() routine enables or disables nested parallelism, by setting the nest-var internal control variable (ICV). The default is false.
omp_get_max_active_levels
- There is a single max-active-levels-var ICV for the entire runtime system. Thus, a call to omp_get_max_active_levels will bind to all threads, regardless of which thread calls it.

Runtime Library Definitions

It is implementation defined whether the include file omp_lib.h or the module omp_lib (or both) is provided. It is implementation defined whether any of the OpenMP runtime library routines that take an argument are extended with a generic interface so arguments of different KIND type can be Fortran accommodated . Both omp_lib.h and the module omp_lib are provided. Cray Fortran uses generic interfaces for routines. If an OMP runtime library routine is defined to be generic, use of arguments of kind other than those specified by OMP_*_KIND constants is undefined.

Environment Variables

OMP_SCHEDULE: The default values for this environment variable are static for type and 0 for chunk. For the schedule (guided,chunk) clause, the size of the initial chunk for the master thread and other team members is approximately equal to the trip count divided by the number of threads. For the schedule(runtime) clause, the schedule type and chunk size can be chosen at run time by setting the OMP_SCHEDULE environment variable. If this environment variable is not set, the schedule type and chunk size default to static and 0, respectively. In the absence of the schedule clause, the default schedule is static and the default chunk size is approximately the number of iterations divided by the number of threads.
OMP_NUM_THREADS: If this environment variable is not set and the omp_set_num_threads() routine is not used to set the number of OpenMP threads, the default is 1 thread. The maximum number of threads per compute node is 4 times the number of allocated processors. If the requested value of OMP_NUM_THREADS is more than the number of threads an implementation can support, the behavior of the program depends on the value of the OMP_DYNAMIC environment variable. If OMP_DYNAMIC is false, the program terminates. If OMP_DYNAMIC is true, it uses up to 4 times the number of allocated processors. For example, on a 8-core Cray XE system, this means the program can use up to 32 threads per compute node.
OMP_PROC_BIND: The default value is false. When set to false, the OpenMP runtime does not attempt to set or change affinity binding for OpenMP threads. When not false, this environment variable controls the policy for binding threads to places (as specified by the OMP_PLACES environment variable). Care must be taken when using OpenMP affinity binding with other binding mechanisms. For example, when launching an application with ALPS aprun, the -cc cpu affinity binding option (the default) should only be used with OMP_PROC_BIND=false. Otherwise, the ALPS/CLE binding will severely over-constrain OpenMP binding. When setting OMP_PROC_BIND to a value other than false, applications should be launched with -cc depth or -cc none. Using -cc depth is particularly important when running multiple PEs per compute node, since it will allow each PE to bind to CPUs in non-overlapping subsets of the node. Valid values for this environment variable are true, false, spread, close, and master. A value of true is equivalent to spread.
OMP_PLACES: The default value is threads. This environment variable has no effect if OMP_PROC_BIND=false (the default); when OMP_PROC_BIND is not false, then OMP_PLACES defines a set of places, or CPU affinity masks, to which threads are bound. When using the threads, cores, and sockets keywords, places are constructed according to the CPU topology presented by Linux. However, the place list is always constrained by the initial CPU mask of the master thread. As a result, specific numeric CPU identifiers appearing in OMP_PLACES will map onto CPUs in the initial CPU affinity mask. If an application is launched with -cc none, then numeric CPU identifiers will exactly match Linux CPU numbers. If instead it is launched with -cc depth, then numeric CPU identifier 0 will map to the first CPU in the initial affinity mask for the master thread; identifier 1 will map to the second CPU in the initial mask, and so on. This allows the same OMP_PLACES environment variable for all PEs to be used, even when launching multiple PEs per node – the -cc depth setting ensures that each PE begins executing with a non-overlapping initial affinity mask, allowing each instance of the OpenMP runtime to assign thread affinity within those non-overlapping affinity masks.
OMP_DYNAMIC: The default value is true.
OMP_NESTED: The default value is false.
OMP_STACKSIZE: The default value for this environment variable is 128 MB.
OMP_WAIT_POLICY: Provides a hint to an OpenMP implementation about the desired behavior of waiting threads by setting the wait-policy-var ICV. A compliant OpenMP implementation may or may not abide by the setting of the environment variable. The default value for this environment variable is active.
OMP_MAX_ACTIVE_LEVELS: The default value is 4095.
OMP_THREAD_LIMIT: Sets the number of OpenMP threads to use for the entire OpenMP program by setting the thread-limit-var ICV. The Cray implementation defaults to 4 times the number of available processors.

Cray-specific OpenMP API

This section describes Open MP API specific to Cray.

subroutine cray_omp_set_wait_policy ( policy )
          character(*), intent(in) :: policy

This routine allows dynamic modification of the wait-policy-var ICV value, which corresponds to the OMP_WAIT_POLICY environment variable. The policy argument provides a hint to the OpenMP runtime library environment about the desired behavior of waiting threads; acceptable values are ACTIVE or PASSIVE (case insensitive). It is an error to call this routine in an active parallel region. The OpenMP runtime library supports a "wait policy" and a "contention policy," both of which can be set with the following environment variables:

OMP_WAIT_POLICY=(ACTIVE|PASSIVE)
CRAY_OMP_CONTENTION_POLICY=(Automatic|Standard|MonitorMwait)

These environment variables allow the policies to be set once at program launch for the entire execution. However, in some circumstances it would be useful for the programmer to explicitly change the policy at various points during a program's execution. This Cray-specific routine allows the programmer to dynamically change the wait policy (and potentially the contention policy). This addresses the situation when an application needs OpenMP for the first part of program execution, but there is a clear point after which OpenMP is no longer used. Unfortunately, the idle OpenMP threads still consume resources since they are waiting for more work, resulting in performance degradation for the remainder of the application. A passive-waiting policy might eliminate the performance degradation after OpenMP is no longer needed, but the developer may still want an active-waiting policy for the OpenMP-intensive region of the application. This routine notifies all threads of the policy change at the same time, regardless of whether they are idle or active (to avoid deadlock from waiting and signaling threads using different policies).

CRAY_OMP_CHECK_AFFINITY

Set the CRAY_OMP_CHECK_AFFINITY variable to TRUE at execution time to display affinity binding for each OpenMP thread. The messages contain hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding.

omp_lib

If the omp_lib module is not used and the kind of the actual argument does not match the kind of the dummy argument, the behavior of the procedure is undefined.

omp_get_wtime omp_get_wtick

These procedures return real(kind=8) values instead of double precision values.

OpenMP Accelerator Support

The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.

When targeting NVIDIA GPUs, teams constructs are mapped to CUDA thread blocks and simd constructs are mapped to CUDA threads within a thread block. For teams regions that do not contain any simd constructs, CCE will still take advantage of all available CUDA parallelism, either by automatically parallelizing nested loops across CUDA threads, or by mapping the teams parallelism across both CUDA thread blocks and threads. Currently, parallel constructs appearing within a teams construct are executed with a single thread. CCE will attempt to select an appropriate number CUDA threads and thread blocks for each construct based on the code that appears in it. For a given teams construct, users may use the num_teams and thread_limit clauses to specify the number of CUDA thread blocks and threads per thread block, respectively.

Optimizations

A certain amount of overhead is associated with multiprocessing a loop. If the work occurring in the loop is small, the loop can actually run slower by multiprocessing than by single processing. To avoid this, make the amount of work inside the multiprocessed region as large as possible, as is shown in the following examples.

Consider the following code:

DO K = 1, N
     DO I = 1, N
        DO J = 1, N
           A(I,J) = A(I,J) + B(I,K) * C(K,J)
        END DO
     END DO
END DO

For the preceding code fragment, parallelize the J loop or the I loop. The K loop cannot be parallelized because different iterations of the K loop read and write the same values of A(I,J). Try to parallelize the outermost DO loop if possible, because it encloses the most work. In this example, that is the I loop. For this example, use the technique called loop interchange. Although the parallelizable loops are not the outermost ones, the loops can be reordered to make one of them outermost.

Thus, loop interchange would produce the following code fragment:

!$OMP PARALLEL DO PRIVATE(I, J, K)
        DO I = 1, N
           DO K = 1, N
              DO J = 1, N
                 A(I,J) = A(I,J) + B(I,K) * C(K,J)
              END DO
           END DO
        END DO

Now the parallelizable loop encloses more work and shows better performance. In practice, relatively few loops can be reordered in this way. However, it does occasionally happen that several loops in a nest of loops are candidates for parallelization. In such a case, it is usually best to parallelize the outermost one.

Occasionally, the only loop available to be parallelized has a fairly small amount of work. It may be worthwhile to force certain loops to run without parallelism or to select between a parallel version and a serial version, on the basis of the length of the loop.

The loop is worth parallelizing if N is sufficiently large. To overcome the parallel loop overhead, N needs to be around 1000, depending on the specific hardware and the context of the program. The optimized version would use an IF clause on the PARALLEL DO directive:

!$OMP PARALLEL DO IF (N .GE. 1000), PRIVATE(I)
        DO I = 1, N
           A(I) = A(I) + X*B(I)
        END DO

aprun Options

The -d depth option of the aprun command is required to reserve more than one physical processor for an OpenMP process. For best performance, depth should be the same as the maximum number of threads the program uses. The maximum number of threads per compute node is 4 times the number of allocated processors.

This example shows how to reserve the physical processors:

aprun -d depth ompProgram

If neither the OMP_NUM_THREADS environment variable nor the omp_set_num_threads() call is used to set the number of OpenMP threads, the system defaults to 1 thread.

The aprun options -n processes and -N processes_per_node are compatible with OpenMP but do not directly affect the execution of OpenMP programs.