Scalar Optimization Directives

Scalar optimization directives control aspects of code generation, register storage, and other scalar operations.

Scalar optimization directives control aspects of code generation, register storage, and other scalar operations.

blockable

#pragma _CRI blockable(num_loops)
num_loops
Number of subsequent loops to be blocked

The blockable directive specifies that it is legal and desirable to cache block the subsequent loop nest, even when the compiler has not made such a determination. To be legally blockable, the nest must be perfect (without code between constituent loops), rectangular (trip counts of member loops are fixed over the life time of nest), and fully permutable (loop interchange and unrolling is legal at all levels). This directive both permits and requests blocking of the indicated loop nest.

If a blockingsize directive is also provided for the indicated loop, the following rules apply:
  • If blockingsize is at least two, the indicated blockingsize is used.
  • If blockingsize is zero, the loop itself is not blocked and it is treated as an inner loop (as part of the nest that traverses the cache block tile).
  • If blockingsize is one, the loop itself is not blocked and it is treated as an outer loop (as a loop in the nest that moves from tile to tile).
When no blockingsize directive is supplied the compiler chooses the blockingsize according to its own heuristics.

blockable and blockingsize Directives

%cat blk.c
#define N 1000

float A[N][N];
float B[N][N];

void
func(int n)
{
#pragma _CRI blockable(2)
#pragma _CRI blockingsize( 32 )
for (int i = 2; i <= N-1; ++i)  {
#pragma _CRI blockingsize( 128 )
		for (int j = 2; j <= N-1; ++j)  {
			A[i][j] = B[i-1][j-1]
	            		+ B[i-1][j+1]
	            		+ B[i+1][j-1]
	            		+ B[i+1][j+1];
			}
		}
}
% cc -c -hlist=md blk.c; cat blk.lst
...
    7.              func(int n)
    8.              {
    9.              #pragma _CRI blockable(2)
   10.              #pragma _CRI blockingsize( 32 )
   11.  + b-------<     for (int i = 2; i <= N-1; ++i)  {
   12.    b         #pragma _CRI blockingsize( 128 )
   13.    b Vbr4--< 	for (int j = 2; j <= N-1; ++j)  {
   14.    b Vbr4    	    A[i][j] = B[i-1][j-1]
   15.    b Vbr4    	            + B[i-1][j+1]
   16.    b Vbr4    	            + B[i+1][j-1]
   17.    b Vbr4    	            + B[i+1][j+1];
   18.    b Vbr4--> 	}
   19.    b------->     }
   20.              }

CC-6294 CC: VECTOR File = blk.c, Line = 11 
  A loop was not vectorized because a better candidate was found at line 13.

CC-6051 CC: SCALAR File = blk.c, Line = 11 
  A loop was blocked according to user directive with block size 32.

CC-6051 CC: SCALAR File = blk.c, Line = 13 
  A loop was blocked according to user directive with block size 128.
...

blockingsize

#pragma _CRI blockingsize(n1 [,n2])
#pragma _CRI noblocking
n1
Specify a value greater than or equal to 0 for the primary cache.
n2
Specify a value less than or equal to 2**30 for the secondary cache.
If n1 or n2 are 0, the loop is not blocked, but the entire loop is inside the block.

The blockingsize directive asserts that the loop following the directive is involved in a cache blocking situation for the primary or secondary cache.

The noblocking directive prevents the compiler from involving the subsequent loop in a cache blocking situation.

If the loop is involved in a blocking situation, it will have a block size of n1 for the primary cache and n2 for the secondary cache. The compiler attempts to include this loop within such a block but cannot guarantee inclusion.

blockingsize Directive

The compiler makes 20 x 20 blocks when blocking, but it could block the loop nest such that loop K is not included in the file.


      SUBROUTINE AMAT(X,Y,Z,N,M,MM)
      REAL(KIND=8) X(100,100), Y(100,100), Z(100,100)
      DO K = 1, N
!DIR$ BLOCKABLE(J,I)
!DIR$ BLOCKING SIZE (20)
         DO J = 1, M
!DIR$ BLOCKING SIZE (20)
            DO I = 1, MM
               Z(I,K) = Z(I,K) + X(I,J)*Y(J,K)
            END DO
         END DO
      END DO
      END
If K is excluded, add a BLOCKINGSIZE(0) directive just before loop K to specify that the compiler should generate a loop such as the following example:

      SUBROUTINE AMAT(X,Y,Z,N,M,MM)
      REAL(KIND=8) X(100,100), Y(100,100), Z(100,100)
      DO JJ = 1, M, 20
         DO II = 1, MM, 20
            DO K = 1, N
               DO J = JJ, MIN(M, JJ+19)
                  DO I = II, MIN(MM, II+19)
                     Z(I,K) = Z(I,K) + X(I,J)*Y(J,K)
                  END DO
               END DO
            END DO
         END DO
      END DO
      END

noblocking

#pragma _CRI noblocking

Asserts that the loop following the directive should not be cache blocked for the primary or secondary cache. It is an error to place a noblocking directive before a loop that is part of a blockable collection.

[no]collapse

#pragma _CRI collapse(loop-number1, loop-number2 [,loop-number3] ... )
loop-number
Specify a value greater than or equal to 0.
#pragma _CRI nocollapse
Scope: Local

When the collapse directive is applied to a loop nest, the loop numbers of the participating loops must be listed in order of increasing access stride. Loop numbers range from 1 to the nesting level of the most deeply nested loop. The directive enables the compiler to assume appropriate conformity between trip counts. The compiler diagnoses misuse at compile time (when able); or, if -h dir_check is specified, at run time.

The nocollapse directive disqualifies the immediately following loop from collapsing with any other loop. Collapse is almost always desirable, so use this directive sparingly. Loop collapse is a special form of loop coalesce. Any perfect loop nest may be coalesced into a single loop, with explicit rediscovery of the intermediate values of original loop control variables. The rediscovery cost, which generally involves integer division, is quite high. Therefore, coalesce is rarely suitable for vectorization. It may be beneficial for multithreading. By definition, loop collapse occurs when loop coalesce may be done without the rediscovery overhead. To meet this requirement, all memory accesses must have uniform stride.

[no]interchange

#pragma _CRI interchange(loop_number1, loop_number2[, loop_number3] ...)
loop_number
Number from 1 to nesting depth of the most deeply nested loop
#pragma _CRI nointerchange
Scope: Local

The interchange control directives specify whether or not the order of the following two or more, perfectly nested loops should be interchanged. These directives apply to the subsequent loops.

The interchange directive specifies two or more loop numbers, ranging from 1 to the nesting depth of the most deeply nested loop, specified in any order. The compiler reorders perfectly nested loops. If they are not perfectly nested, unexpected results may occur.

The nointerchange directive inhibits loop interchange on the loop that immediately follows the directive.

interchange Directive

The interchange directive reorders the loops; the k loop becomes the outermost and the i loop the innermost:


#define N 100

A[N][N][N];

void
f(int n)
{
  int i, j, k;

#pragma _CRI interchange( 2, 3, 1 )
  for (i=0; i < n; i++) {
    for (k=0; k < n; k++) {
      for (j = 0; j < n; j++) {
        A[k][j][i] = 1.0;
      }
    }
  }
}

suppress

#pragma _CRI suppress func

Scope: Global

#pragma _CRI suppress [var]

Scope: Local

This directive suppresses optimization in two ways, determined by its use with either global or local scope.
  • The global scope suppress directive specifies that all associated local variables are to be written to memory before a call to the specified function. This ensures that the value of the variables will always be current.
  • The local scope suppress directive stores current values of the specified variables in memory. If the directive lists no variables, all variables are stored to memory. This directive causes the values of these variables to be reloaded from memory at the first reference following the directive. The net effect of the local suppress directive is similar to declaring the affected variables to be volatile except that the volatile qualifier affects the entire program, whereas the local suppress directive affects only the block of code in which it resides.

[no]unroll

#pragma _CRI unroll [n]
n
Specifies no loop unrolling (n = 0 or 1) or the total number of loop body copies to be generated (2 ≤ n ≤ 63)
#pragma _CRI nounroll
Scope: Local
The unroll directive allows the user to control unrolling for individual loops or to specify no unrolling of a loop. Loop unrolling can improve program performance by revealing cross-iteration memory optimization opportunities such as read-after-write and read-after-read. The effects of loop unrolling also include:
  • Improved loop scheduling by increasing basic block size
  • Reduced loop overhead
  • Improved chances for cache hits

Disable loop unrolling for the next loop. The nounroll directive is functionally equivalent to the unroll 0 and unroll 1 directives. The n argument applies only to the unroll directive and if a value for n is not specified, the compiler will determine the number of copies to generate based on the number of statements in the loop nest. Note: The compiler cannot always safely unroll non-innermost loops due to data dependencies. In these cases, the directive is ignored. The unroll directive can be used only on loops with iteration counts that can be calculated before entering the loop. If unroll is specified on a loop that is not the innermost loop in a loop nest, the inner loops must be nested perfectly. That is, all loops in the nest can contain only one loop, and the innermost loop can contain work.

unroll Directive

unroll by 2.


#pragma _CRI unroll 2
for (i = 0; i < 10; i++) {
      for (j = 0; j < 100; j++) {
        a[i][j] = b[i][j] + 1;
      }
}

With outer loop unrolling, the compiler produces the following nest, in which the two bodies of the inner loop are adjacent:


for (i = 0; i < 10; i += 2) {
      for (j = 0; j < 100; j++) {
        a[i][j] = b[i][j] + 1;
      }
      for (j = 0; j < 100; j++) {
        a[i+1][j] = b[i+1][j] + 1;
      }
}

The compiler then jams, or fuses, the inner two loop bodies, producing the following nest:


for (i = 0; i < 10; i += 2) {
      for (j = 0; j < 100; j++) {
        a[i][j] = b[i][j] + 1;
        a[i+1][j] = b[i+1][j] + 1;
      }
}

Illegal unrolling of outer loops

Outer loop unrolling is not always legal because the transformation can change the semantics of the original program. For example, unrolling the following loop nest on the outer loop would change the program semantics because of the dependency between a[i][...] and a[i+1][...]. The directive will cause incorrect code due to dependencies.


#pragma _CRI  unroll 2
for (i = 0; i < 10; i++) {
  for (j = 1; j < 100; j++) {
    a[i][j] = a[i+1][j-1] + 1;
  }
}

nofission

#pragma _CRI nofission func

Scope: Local

Instructs the compiler not to split statements in a given loop into distinct loops. Fission is prevented only for the loop specified; loops nested within the indicated loop remain fission candidates unless likewise annotated.

[no]fusion

#pragma _CRI fusion
#pragma _CRI nofusion

Scope: Local

The nofusion directive instructs the compiler to not attempt loop fusion on the following loop even when the -h fusion option was specified on the compiler command line. The fusion directive instructs the compiler to attempt loop fusion on the following loop unless -h nofusion was specified on the compiler command line.