Vectorization Directives

This topic describes the directives used for vectorization.

Because vector operations cannot be expressed directly in the compiler, the compiler must be capable of transforming scalar operations into equivalent vector operations. The candidates for vectorization are operations in loops and assignments of structures. Compiler directives may be used to control vectorization.

`COPY_ASSUMED_SHAPE`

!DIR$ COPY_ASSUMED_SHAPE array , array ...

array: The name of an array to be copied to temporary storage. If no array names are specified, all assumed-shape dummy arrays are copied to temporary contiguous storage upon entry to the procedure. When the procedure is exited, the arrays in temporary storage are copied back to the dummy argument arrays. If one or more arrays are specified, only those arrays specified are copied. The arrays specified must not have the TARGET attribute.; All arrays specified, or all assumed-shape dummy arrays (if specified without array arguments), on a single COPY_ASSUMED_SHAPE directive must be shape conformant with each other. Incorrect code may be generated if the arrays are not. The -R c command line option can be used to verify whether the arrays are shape conformant.

The COPY_ASSUMED_SHAPE directive copies assumed-shape dummy array arguments into contiguous local temporary storage upon entry to the procedure in which the directive appears. During execution, it is the temporary storage that is used when the assumed-shape dummy array argument is referenced or defined.

The COPY_ASSUMED_SHAPE directive applies only to the program unit in which it appears.

Assumed-shape dummy arguments cannot be assumed to be stored in contiguous storage. In the case of multidimensional arrays, the elements cannot be assumed to be stored with uniform stride between each element of the array. These conditions can arise, for example, when an actual array argument associated with an assumed-shape dummy array is a non-unit strided array slice or section.

If the compiler cannot determine whether an assumed-shape dummy array is stored contiguously or with a uniform stride between each element, some optimizations are inhibited in order to ensure that correct code is generated. If an assumed-shape dummy array is passed to a procedure and becomes associated with an explicit-shape dummy array argument, additional copy-in and copy-out operations may occur at the call site. For multidimensional assumed-shape arrays, some classes of loop optimizations cannot be performed when an assumed-shape dummy array is referenced or defined in a loop or an array assignment statement. The lost optimizations and the additional copy operations performed can significantly reduce the performance of a procedure that uses assumed-shape dummy arrays when compared to an equivalent procedure that uses explicit-shape array dummy arguments.

The COPY_ASSUMED_SHAPE directive causes a single copy to occur upon entry and again on exit. The compiler generates a test at run time to determine wheteher the array is contiguous. If the array is contiguous, the array is not copied. This directive allows the compiler to perform all the optimizations it would otherwise perform if explicit-shape dummy arrays were used. If there is sufficient work in the procedure using assumed-shape dummy arrays, the performance improvements gained by the compiler outweigh the cost of the copy operations upon entry and exit of the procedure.

`CONCURRENT`

!DIR$ CONCURRENT [ SAFE_DISTANCE=n]

n: An integer number that represents the number of additional consecutive loop iterations that can be executed in parallel without danger of data conflict. n must be an integer constant > 0. If SAFE_DISTANCE=n is not specified, the distance is assumed to be infinite, and the compiler ignores all cross-iteration data dependencies. The CONCURRENT directive is ignored if the SAFE_DISTANCE argument is used and vectorization is requested on the command line.

The CONCURRENT directive indicates that no data dependence exists between array references in different iterations of the loop. This directive affects the loop that immediately follows it. This can be useful for vectorization optimizations.

Consider the following code:

!DIR$ CONCURRENT SAFE_DISTANCE=3
      DO I = K+1, N
        X(I) = A(I) + X(I-K)
      ENDDO

The CONCURRENT directive in this example informs the optimizer that the relationship K>3 is true. This allows the compiler to load all of the following array references safely during the Ith loop iteration:

X(I-K)
X(I-K+1)
X(I-K+2)
X(I-K+3)

`HAND_TUNED`

!DIR$ HAND_TUNED

Assert that the code in the next loop nest has been arranged by hand for maximum performance, and the compiler should restrict some of the more aggressive automatic expression rewrites. The compiler should still fully optimize and vectorize the loop within the constraints of the directive. The hand_tuned directive applies to the next loop in the same manner as the CONCURRENT and SAFE_ADDRESS directives.

Use of this directive may severely impede performance. Use carefully and evaluate performance before and after employing this directive.

`IVDEP`

!DIR$ IVDEP [ SAFEVIL=vlen | INFINITEVL]

vlen: Specifieds a vector length in which no dependency will occur. vlen must be an integer between 1 and 1024 inclusive.
INFINITEVL: Specifies an infinite safe vector length. That is, no dependency will occur at any vector length.

When the IVDEP directive appears before a loop, the compiler ignores vector dependencies, including explicit dependencies, in any attempt to vectorize the loop. ivdep applies only to the first for loop or while loop that follows the directive within the same program unit.

For array operations, Fortran requires that the complete right-hand side (RHS) expression be evaluated before the assignment to the array or array section on the left-hand side (LHS). If possible dependencies exist between the RHS expression and the LHS assignment target, the compiler creates temporary storage to hold the RHS expression result. If an IVDEP directive appears before an array syntax statement, the compiler ignores potential dependencies and suppresses the creation and use of array temporaries for that statement. Using array syntax statements allows the reference of referencing arrays in a compact manner. Array syntax allows the use of either the array name, or the array name with a section subscript, to specify actions on all the elements of an array, or array section, without using DO loops.

If no vector length is specified, the vector length used is infinity.

If a loop with an IVDEP directive is enclosed within another loop with an IVDEP directive, the IVDEP directive on the outer loop is ignored. When the Cray compiler vectorizes a loop, it may reorder the statements in the source code to remove vector dependencies. When IVDEP is specified, the statements in the loop or array syntax statement are assumed to contain no dependencies as written, and the Cray compiler does not reorder loop statements.

`LOOP_INFO`

!DIR$ LOOP_INFO min_trips(c) est_trips(c) max_trips(c) cache( symbol , symbol ... ) cache_nt( symbol , symbol ... ) prefetch noprefetch

c: An expression that evaluates to an integer constant at compilation time.
min_trips: Specifies guaranteed minimum number of trips.
est_trips: Specifies estimated or average number of trips.
max_trips: Specifies guaranteed maximum number of trips.
cache: Specifies that symbol is to be allocated in cache; this is the default if no hint is specified and the cache_nt directive is not specified.
cache_nt: Specifies that symbol is to use non-temporal reads and writes.
symbol: The base name of the object that should (cache) or should not (cache_nt) be placed into cache. This can be the base name of any object such as an array or scalar structure without member references. If specifying a pointer in the list, only the references, not the pointer itself, are subject to the cache or cache_nt instruction.
prefetch: Specifies a preference that prefetches be performed for the following loop.
noprefetch: Specifies a preference that no prefetches be performed for the following loop.

The LOOP_INFO directive allows additional information to be specified about the behavior of a loop, including run time trip count, hints on cache allocation strategy, and threading preference. The LOOP_INFO directive provides information to the optimizer and can produce faster code sequences.

Use LOOP_INFO immediately before a for loop to indicate minimum, maximum, estimated trip count. The compiler will diagnose misuse at compile time when able, or when option -h dir_check is specified at run time.

For cache allocation hints, use the LOOP_INFO directive to override default settings, cache or cache_nt directives, or override automatic cache management decisions. The cache hints are local and apply only to the specified loop nest.

Use the LOOP_INFO PREFER_THREAD directive to indicate the preference that the loop following the directive be threaded. The LOOP_INFO PREFER_NOTHREAD indicates the preference that the loop following the directive should not be threaded.

The prefetch clause instructs the compiler to preload scalar data into the first-level cache to improve the frequency of cache hits and to lower latency. They are generated in situations where the compiler expects them to improve performance. Strategic use of prefetch instructions can hide latency for scalar loads that feed vector instructions or scalar loads in purely scalar loops. Prefetch instructions are generated at default and higher levels of optimization. Thus, they are turned off at -O0 or -O1. Prefetch can be turned off at the loop level via the following directive:

!DIR$ LOOP_INFO NOPREFETCH
     DO I = 1, N

`LOOP_INFO PREFER_[NO]THREAD`

!DIR$ LOOP_INFO PREFER_THREAD
     DO I = 1, N

!DIR$ LOOP_INFO PREFER_THREAD
     DO J = 1, N

The PREFER_THREAD and PREFER_NOTHREAD directives are special cases of the LOOP_INFO advisory directive. Use these directives to indicate a preference for turning threading on or off for the subsequent loop. Use the LOOP_INFO PREFER_THREAD directive to indicate the preference that the loop following the directive be threaded. Use the LOOP_INFO PREFER_NOTHREAD directive to indicate that the loop should not be threaded.

`NEXTSCALAR`

!DIR$ NEXTSCALAR

The NEXTSCALAR directive disables vectorization for the first DO loop or DO WHILE loop that follows the directive. The directive applies to only one loop, the first loop that appears after the directive within the same program unit. NEXTSCALAR is ignored if vectorization has been disabled.

If the NEXTSCALAR directive appears prior to any array syntax statement, it disables vectorization for the array syntax statement.

`[NO]PATTERN`

!DIR$ PATTERN

!DIR$ NOPATTERN

The nopattern directive disables pattern matching for the loop immediately following the directive. By default, the compiler detects coding patterns in source code sequences and replaces these sequences with calls to optimized library functions. In most cases, this replacement improves performance. There are cases, however, in which this substitution degrades performance. This can occur, for example, in loops with very low trip counts. In such a case, use the nopattern directive to disable pattern matching and cause the compiler to generate inline code.

The nopattern directive disables pattern matching for the loop immediately following the directive.

By default, the compiler would detect that the following loop is a matrix multiply and replace it with a call to a matrix multiply library routine. By preceding the loop with a !DIR$ NOPATTERN directive, however, pattern matching is inhibited and no replacement is done.

!DIR$ NOPATTERN
       DO k= 1,n
         DO i= 1,n
           DO j= 1,m
             A(i,j) = A(i,j) + B(i,k) * C(k,j)
           END DO
         END DO
       END DO

`[NO]VECTOR`

!DIR$ VECTOR [clause[, clause]... ]

!DIR$ NOVECTOR

!dec$ vector [clause[, clause]... ]

ALWAYS: Vectorize the loop that immediately follows the directive. This directive states a vectorization preference and does not guarantee that the loop has no memory-dependence hazard. This directive has the same effect as the prefervector directive.
ALIGNED: Directs the compiler to generate aligned data movement instructions for array references when vectorizing. For current Intel processors, data alignment is necessary for efficient vectorization. Use with care to improve performance. If some of the access patterns are actually unaligned, using the ALIGNED clause may generate incorrect code. This directive also directs the compiler to ignore explicit and implicit vector dependencies.
UNALIGNED: Directs the compiler to generate unaligned data movement instructions for all array references when vectorizing.

The novector directive suppresses compiler attempts to vectorize loops and array syntax statements. It overrides any other vectorization-related directives, as well as the -h vector and -O vectorn command line options. These directives are ignored if vectorization or scalar optimization has been disabled.

In Fortran, NOVECTOR applies to the rest of the program unit unless it is superseded by a VECTOR directive. When NOVECTOR has been used within the same program unit, VECTOR causes the compiler to resume its attempts to vectorize loops and array syntax statements. After a VECTOR directive is specified, automatic vectorization is enabled for all loop nests.

`PERMUTATION`

!DIR$ PERMUTATION (symbol [, symbol ] ...)

ia: Integer array that has no repeated values for the entire routine.

Specifies that an integer array has no repeated values. This directive is useful when the integer array is used as a subscript for another array (vector-valued subscript). This directive may improve code performance.

In a sequence of array accesses that read array element values from the specified symbols with no intervening accesses that modify the array element values, each of the accessed elements will have a distinct value.

When an array with a vector-valued subscript appears on the left side of the equal sign in a loop, many-to-one assignment is possible. Many-to-one assignment occurs if any repeated elements exist in the subscripting array. If it is known that the integer array is used merely to permute the elements of the subscripted array, it can often be determined that many-to-one assignment does not exist with that array reference.

`permutation` Directive

Sometimes a vector-valued subscript is used as a means of indirect addressing because the elements of interest in an array are sparsely distributed; in this case, an integer array is used to select only the desired elements, and no repeated elements exist in the integer array. The permutation directive does not apply to the array a. Rather, it applies to the pointer used to index into it, ipnt. By knowing that ipnt is a permutation, the compiler can safely generate an unordered scatter for the write to a.

!DIR$ PERMUTATION(IPNT) ! IPNT has no repeated values
      ...
      DO I = 1, N
         A(IPNT(I)) = B(I) + C(I)
      END DO

`[NO]PIPELINE`

!DIR$ PIPELINE

!DIR$ NOPIPELINE

Software-based vector pipelining (software vector pipelining) provides additional optimization beyond the normal hardware-based vector pipelining. In software vector pipelining, the compiler analyzes all vector loops and automatically attempts to pipeline a loop if doing so can be expected to produce a significant performance gain. This optimization also performs any necessary loop unrolling.

In some cases the compiler either does not pipeline a loop that could be pipelined or pipelines a loop without producing performance gains. In these situations, use the PIPELINE or NOPIPELINE directive to advise the compiler to pipeline or not pipeline the loop immediately following the directive.

Software vector pipelining is valid only for the innermost loop of a loop nest. These directives are advisory only. While the NOPIPELINE directive can be used to inhibit automatic pipelining, and the PIPELINE directive can be used to attempt to override the compiler's decision not to pipeline a loop, the compiler cannot be forced to pipeline a loop that cannot be pipelined.

Vector loops that have been pipelined generate compile-time messages to that effect, if optimization messaging is enabled (-O msgs).

`PREFERVECTOR`

!DIR$ PREFERVECTOR

Directs the compiler to vectorize the loop immediately following the directive if the loop contains more than one loop in the nest that can be vectorized. The directive states a vectorization preference and does not guarantee that the loop has no memory-dependence hazard.

In the following example, both loops can be vectorized, but the compiler generates vector code for the outer DO I loop:

!DIR$ PREFERVECTOR
      DO I = 1, N
        DO J = 1, M
          A(I) = A(I) + B(J,I)
        END DO
      END DO

`PROBABILITY`

!DIR$ probability const

!DIR$ probability_almost_always

!DIR$ probability_almost_never

const: Expression that evaluates to a floating point constant at compilation time. (0.0 <= const <= 1.0.)

The probability directives specify information used by interprocedure analysis (IPA) and the optimizer to produce faster code sequences. The specified probability is a hint, rather than a statement of fact. This information is used to guide inlining decisions, branch elimination optimizations, branch hint marking, and the choice of the optimal algorithmic approach to the vectorization of conditional code. These directives can appear anywhere executable code is legal. Each directive applies to the block of code where it appears. It is important to realize that the directive should not be applied to a conditional test directly; rather, it should be used to indicate the relative probability of a THEN or ELSE branch being executed.

Specify almost_never and almost_always by using the probability const values 0.0 and 1.0, respectively.

`PROBABILITY` directive

This example states that the probability of entering the block of code with the assignment statement is 0.3 or 30%. This also means that a[i] is expected to be greater than b[i] 30% of the time. Note that the probability directive appears within the conditional block of code, rather than before it. This removes some of the ambiguity that has plagued other implementations that tie the directive directly to the conditional code.

     IF ( A(I) > B(I) ) THEN
!DIR$ PROBABILITY 0.3
         A(I) = B(I)
      ENDIF

For vector IF code, a probability of very low (<0.1) or probability_almost_never will cause the compiler to use the vector gather/scatter methods used for sparse IF vector code instead of the vector merge methods used for denser IF code. For example:

      DO I = 1,N
         IF ( A(I) > 0.0 ) THEN
!DIR$ PROBABILITY_ALMOST_NEVER
            B(I) = B(I)/A(I) + A(I)/B(I) ! EVALUATE USING SPARSE METHODS
         ENDIF
      ENDDO

Note that the PROBABILITY directive appears within the conditional, rather than before the condition. This removes some of the ambiguity of tying the directive directly to the conditional test.

`SAFE_ADDRESS`

Specifies that it is safe to speculatively execute memory references within all conditional branches of a loop; these memory references can be safely executed in each iteration of the loop. For most code, this directive can improve performance significantly by preloading vector expressions. However, most loops do not require this directive to have preloading performed. SAFE_ADDRESS is required only when the safety of the operation cannot be determined or index expressions are very complicated.

The SAFE_ADDRESS directive is an advisory directive. That is, the compiler may override the directive if it determines the directive is not beneficial. If the directive is not used on a loop and the compiler determines that it would benefit from the directive, it issues a message indicating such. The message is similar to this:

  DO I = 1,N
FTN-6375 FTN_DRIVER.EXE: VECTOR X7, FILE = 10928.F, LINE = 110
  A LOOP STARTING AT LINE 110 WOULD BENEFIT FROM "!DIR$ SAFE_ADDRESS".

If using the directive on a loop and the compiler determines that it does not benefit from the directive, it issues a message that states the directive is superfluous and can be removed.

To see the messages, use the -O msgs option.

Incorrect use of the directive can result in segmentation faults, bus errors, or excessive page faulting. However, it should not result in incorrect answers. Incorrect usage can result in very severe performance degradations or program aborts.

`SAFE_ADDRESS` directive

In this example, the compiler will not preload vector expressions, because the value of j is unknown. However, if it is known that references to b (i,j) are safe to evaluate for all iterations of the loop, regardless of the condition, the SAFE_ADDRESS directive can be used. With the directive, the compiler can load b (i,j) with a full vector mask, merge 0.0 where the condition is true, and store the resulting vector using a full mask.

SUBROUTINE X3( A, B, N, M, J )
REAL A(N), B(N,M)

!DIR$ SAFE_ADDRESS
DO I = 1,64            ! VECTORIZED LOOP
   IF ( A(I).NE.0.0 ) THEN
      B(I,J) = 0.0     ! VALUE OF 'J' IS UNKNOWN
   ENDIF
ENDDO
END

`SAFE_CONDITIONAL`

!DIR$ SAFE_CONDITIONAL

Specifies that it is safe to execute all memory references and arithmetic operations within all conditional branches of the subsequent scalar or vector loop nest. It can improve performance by allowing the hoisting of invariant expressions from conditional code and by allowing prefetching of memory references.

The SAFE_CONDITIONAL directive is an advisory directive. The compiler may override the directive if it determines the directive is not beneficial.

`SAFE_CONDITIONAL` directive

In the example below, the compiler cannot precompute the invariant expression s1*s2 because these values are unknown and may cause an arithmetic trap if executed unconditionally. However, if the condition is known to be true at least once, then it is safe to use the SAFE_CONDITIONAL directive and execute s1*s2 speculatively. With the directive, the compiler evaluates s1*s2 outside of the loop, rather than under control of the conditional code. In addition, all control flow is removed from the body of the vector loop as s1*s2 no longer poses a safety risk.

      SUBROUTINE SAFE_COND( A, N, S1, S2 )
      REAL A(N), S1, S2

!DIR$ SAFE_CONDITIONAL
      DO I = 1,N
         IF ( A(I) /= 0.0 ) THEN
            A(I) = A(I) + S1*S2
         ENDIF 
      ENDDO
      END

`SAME_TBS`

!DIR$ SAME_TBS array, array, array)

array: Two or more array arguments are required. array is the name of an assumed-shape dummy array. The arrays specified must not have the TARGET attribute. All arrays, specified on a single SAME_TBS directive must have the same element type, bounds, and strides. Use the -Rd command line option to verify that the arrays have the same element type, bounds, and strides.

The SAME_TBS directive informs the compiler that the specified assumed size arrays are of the same rank and type, and that they have identical low-bound, extent, and stride multiplier information for corresponding dimensions. See the SAME_TBS(7) man page.

This information allows the compiler to generate more efficient code by reducing the number of potentially distinct intermediate values required for array element accesses. This may offer significant execution performance improvement when using assumed-shape dummy arrays of corresponding type, low-bound, extent, and stride.

The SAME_TBS directive applies only to the program unit in which it appears.

Oridnarily, for multidimensional assumed-shape arrays, some classes of loop optimizations cannot be performed when an assumed-shape dummy array is referenced or defined in a loop or an array assignment statement. The lost optimizations and the additional copy operations performed can significantly reduce the performance of a procedure that uses assumed-shape dummy arrays when compared to an equivalent procedure that uses explicit-shape array dummy arguments. This directive may provide significant performance improvement depending on certain factors such as greater numbers of assumed-shape arrays and smaller array sizes.

`[NO]UNROLL`

!DIR$ UNROLL

!DIR$ NOUNROLL

n: Specifies the total number of loop body copies to be generated. n is an integer value from 0 through 1024.; If a value for n is specified, the compiler unrolls the loop by that amount. If n is not specified, the compiler determines if it is appropriate to unroll the loop, and if so, the unroll amount.; The subsequent DO loop is not unrolled if UNROLL0, UNROLL1, or NOUNROLL are specified. These directives are equivalent.

Scope: Local

The unroll directive allows the user to control unrolling for individual loops or to specify no unrolling of a loop. Loop unrolling can improve program performance by revealing cross-iteration memory optimization opportunities such as read-after-write and read-after-read. The effects of loop unrolling also include:

Improved loop scheduling by increasing basic block size
Reduced loop overhead
Improved chances for cache hits

Disable loop unrolling for the next loop. The NOUNROLL directive is functionally equivalent to the UNROLL 0 and UNROLL 1 directives. The n argument applies only to the UNROLL directive. If a value for n is not specified, the compiler will determine the number of copies to generate based on the number of statements in the loop nest. Note: The compiler cannot always safely unroll non-innermost loops due to data dependencies. In these cases, the directive is ignored. The UNROLL directive can be used only on loops with iteration counts that can be calculated before entering the loop. If UNROLL is specified on a loop that is not the innermost loop in a loop nest, the inner loops must be nested perfectly. That is, all loops in the nest can contain only one loop, and the innermost loop can contain work.

Unroll outer loops

Assume that the outer loop of the following nest will be unrolled by two:

!DIR$ UNROLL 2
      DO I = 1, 10
        DO J = 1,100
              A(J,I) = B(J,I) + 1
        END DO
      END DO

With outer loop unrolling, the compiler produces the following nest, in which the two bodies of the inner loop are adjacent to each other:

DO I = 1, 10, 2
  DO J = 1,100
        A(J,I) = B(J,I) + 1
  END DO
  DO J = 1,100
        A(J,I+1) = B(J,I+1) + 1
  END DO
END DO

The compiler jams, or fuses, the inner two loop bodies together, producing the following nest:

DO I = 1, 10, 2
  DO J = 1,100
        A(J,I)   = B(J,I) + 1
        A(J,I+1) = B(J,I+1) + 1
  END DO
END DO

Illegal unrolling of outer loops

Outer loop unrolling is not always legal because the transformation can change the semantics of the original program. For example, unrolling the following loop nest on the outer loop would change the program semantics because of the dependency between A(...,I) and A(...,I+1):

!DIR$ UNROLL 2
      DO I = 1, 10
        DO J = 1,100
              A(J,I) = A(J-1,I+1) + 1
        END DO
      END DO

Unrolling nearest neighbor pattern

The following example shows unrolling with nearest neighbor pattern. This allows register reuse and reduces memory references from 2 per trip to 1.5 per trip.

!DIR$ UNROLL 2
      DO J = 1,N
         DO I = 1,N      ! VECTORIZE
            A(I,J) = B(I,J) + B(I,J+1)
         ENDDO
      ENDDO

The preceding code fragment is converted to the following code:

DO J = 1,N,2       ! UNROLLED FOR REUSE OF B(I,J+1)
   DO I = 1,N      ! VECTORIZED
      A(I,J) = B(I,J) + B(I,J+1)
      A(I,J+1) = B(I,J+1) + B(I,J+2)
   END DO
END DO

Vectorization Directives

COPY_ASSUMED_SHAPE

CONCURRENT

HAND_TUNED

IVDEP

LOOP_INFO

LOOP_INFO PREFER_[NO]THREAD

NEXTSCALAR

[NO]PATTERN

[NO]VECTOR

PERMUTATION

permutation Directive

[NO]PIPELINE

PREFERVECTOR

PROBABILITY

PROBABILITY directive

SAFE_ADDRESS

SAFE_ADDRESS directive

SAFE_CONDITIONAL

SAFE_CONDITIONAL directive

SAME_TBS

[NO]UNROLL