Scalar Optimization Directives
Scalar optimization directives control aspects of code generation, register storage, and other scalar operations.
Scalar optimization directives control aspects of code generation, register storage, and other scalar operations.
blockable
#pragma _CRI blockable(num_loops)- num_loops
- Number of subsequent loops to be blocked
The blockable directive specifies that it is legal and desirable to cache block the subsequent loop nest, even when the compiler has not made such a determination. To be legally blockable, the nest must be perfect (without code between constituent loops), rectangular (trip counts of member loops are fixed over the life time of nest), and fully permutable (loop interchange and unrolling is legal at all levels). This directive both permits and requests blocking of the indicated loop nest.
blockingsize directive is also provided for the indicated loop, the following rules apply:- If
blockingsizeis at least two, the indicated blockingsize is used. - If
blockingsizeis zero, the loop itself is not blocked and it is treated as an inner loop (as part of the nest that traverses the cache block tile). - If
blockingsizeis one, the loop itself is not blocked and it is treated as an outer loop (as a loop in the nest that moves from tile to tile).
blockable and blockingsize Directives
%cat blk.c#define N 1000
float A[N][N];
float B[N][N];
void
func(int n)
{
#pragma _CRI blockable(2)
#pragma _CRI blockingsize( 32 )
for (int i = 2; i <= N-1; ++i) {
#pragma _CRI blockingsize( 128 )
for (int j = 2; j <= N-1; ++j) {
A[i][j] = B[i-1][j-1]
+ B[i-1][j+1]
+ B[i+1][j-1]
+ B[i+1][j+1];
}
}
}
% cc -c -hlist=md blk.c; cat blk.lst
...
7. func(int n)
8. {
9. #pragma _CRI blockable(2)
10. #pragma _CRI blockingsize( 32 )
11. + b-------< for (int i = 2; i <= N-1; ++i) {
12. b #pragma _CRI blockingsize( 128 )
13. b Vbr4--< for (int j = 2; j <= N-1; ++j) {
14. b Vbr4 A[i][j] = B[i-1][j-1]
15. b Vbr4 + B[i-1][j+1]
16. b Vbr4 + B[i+1][j-1]
17. b Vbr4 + B[i+1][j+1];
18. b Vbr4--> }
19. b-------> }
20. }
CC-6294 CC: VECTOR File = blk.c, Line = 11
A loop was not vectorized because a better candidate was found at line 13.
CC-6051 CC: SCALAR File = blk.c, Line = 11
A loop was blocked according to user directive with block size 32.
CC-6051 CC: SCALAR File = blk.c, Line = 13
A loop was blocked according to user directive with block size 128.
...
blockingsize
#pragma _CRI blockingsize(n1 [,n2])#pragma _CRI noblocking- n1
- Specify a value greater than or equal to 0 for the primary cache.
- n2
- Specify a value less than or equal to 2**30 for the secondary cache.
The blockingsize directive asserts that the loop following the directive is involved in a cache blocking situation for the primary or secondary cache.
The noblocking directive prevents the compiler from involving the subsequent loop in a cache blocking situation.
If the loop is involved in a blocking situation, it will have a block size of n1 for the primary cache and n2 for the secondary cache. The compiler attempts to include this loop within such a block but cannot guarantee inclusion.
blockingsize Directive
The compiler makes 20 x 20 blocks when blocking, but it could block the loop nest such that loop K is not included in the file.
SUBROUTINE AMAT(X,Y,Z,N,M,MM)
REAL(KIND=8) X(100,100), Y(100,100), Z(100,100)
DO K = 1, N
!DIR$ BLOCKABLE(J,I)
!DIR$ BLOCKING SIZE (20)
DO J = 1, M
!DIR$ BLOCKING SIZE (20)
DO I = 1, MM
Z(I,K) = Z(I,K) + X(I,J)*Y(J,K)
END DO
END DO
END DO
END
SUBROUTINE AMAT(X,Y,Z,N,M,MM)
REAL(KIND=8) X(100,100), Y(100,100), Z(100,100)
DO JJ = 1, M, 20
DO II = 1, MM, 20
DO K = 1, N
DO J = JJ, MIN(M, JJ+19)
DO I = II, MIN(MM, II+19)
Z(I,K) = Z(I,K) + X(I,J)*Y(J,K)
END DO
END DO
END DO
END DO
END DO
ENDnoblocking
#pragma _CRI noblockingAsserts that the loop following the directive should not be cache blocked for the primary or secondary cache. It is an error to place a noblocking directive before a loop that is part of a blockable collection.
[no]collapse
#pragma _CRI collapse(loop-number1, loop-number2 [,loop-number3] ... )- loop-number
- Specify a value greater than or equal to 0.
#pragma _CRI nocollapseWhen the collapse directive is applied to a loop nest, the loop numbers of the participating loops must be listed in order of increasing access stride. Loop numbers range from 1 to the nesting level of the most deeply nested loop. The directive enables the compiler to assume appropriate conformity between trip counts. The compiler diagnoses misuse at compile time (when able); or, if -h dir_check is specified, at run time.
The nocollapse directive disqualifies the immediately following loop from collapsing with any other loop. Collapse is almost always desirable, so use this directive sparingly. Loop collapse is a special form of loop coalesce. Any perfect loop nest may be coalesced into a single loop, with explicit rediscovery of the intermediate values of original loop control variables. The rediscovery cost, which generally involves integer division, is quite high. Therefore, coalesce is rarely suitable for vectorization. It may be beneficial for multithreading. By definition, loop collapse occurs when loop coalesce may be done without the rediscovery overhead. To meet this requirement, all memory accesses must have uniform stride.
[no]interchange
#pragma _CRI interchange(loop_number1, loop_number2[, loop_number3] ...)- loop_number
- Number from 1 to nesting depth of the most deeply nested loop
#pragma _CRI nointerchangeThe interchange control directives specify whether or not the order of the following two or more, perfectly nested loops should be interchanged. These directives apply to the subsequent loops.
The interchange directive specifies two or more loop numbers, ranging from 1 to the nesting depth of the most deeply nested loop, specified in any order. The compiler reorders perfectly nested loops. If they are not perfectly nested, unexpected results may occur.
The nointerchange directive inhibits loop interchange on the loop that immediately follows the directive.
interchange Directive
The interchange directive reorders the loops; the k loop becomes the outermost and the i loop the innermost:
#define N 100
A[N][N][N];
void
f(int n)
{
int i, j, k;
#pragma _CRI interchange( 2, 3, 1 )
for (i=0; i < n; i++) {
for (k=0; k < n; k++) {
for (j = 0; j < n; j++) {
A[k][j][i] = 1.0;
}
}
}
}
suppress
#pragma _CRI suppress funcScope: Global
#pragma _CRI suppress [var]Scope: Local
- The global scope
suppressdirective specifies that all associated local variables are to be written to memory before a call to the specified function. This ensures that the value of the variables will always be current. - The local scope
suppressdirective stores current values of the specified variables in memory. If the directive lists no variables, all variables are stored to memory. This directive causes the values of these variables to be reloaded from memory at the first reference following the directive. The net effect of the localsuppressdirective is similar to declaring the affected variables to bevolatileexcept that the volatile qualifier affects the entire program, whereas the localsuppressdirective affects only the block of code in which it resides.
[no]unroll
#pragma _CRI unroll [n]- n
- Specifies no loop unrolling (n = 0 or 1) or the total number of loop body copies to be generated (2 ≤ n ≤ 63)
#pragma _CRI nounroll- Improved loop scheduling by increasing basic block size
- Reduced loop overhead
- Improved chances for cache hits
Disable loop unrolling for the next loop. The nounroll directive is functionally equivalent to the unroll 0 and unroll 1 directives. The n argument applies only to the unroll directive and if a value for n is not specified, the compiler will determine the number of copies to generate based on the number of statements in the loop nest. Note: The compiler cannot always safely unroll non-innermost loops due to data dependencies. In these cases, the directive is ignored. The unroll directive can be used only on loops with iteration counts that can be calculated before entering the loop. If unroll is specified on a loop that is not the innermost loop in a loop nest, the inner loops must be nested perfectly. That is, all loops in the nest can contain only one loop, and the innermost loop can contain work.
unroll Directive
unroll by 2.
#pragma _CRI unroll 2
for (i = 0; i < 10; i++) {
for (j = 0; j < 100; j++) {
a[i][j] = b[i][j] + 1;
}
}
With outer loop unrolling, the compiler produces the following nest, in which the two bodies of the inner loop are adjacent:
for (i = 0; i < 10; i += 2) {
for (j = 0; j < 100; j++) {
a[i][j] = b[i][j] + 1;
}
for (j = 0; j < 100; j++) {
a[i+1][j] = b[i+1][j] + 1;
}
}
The compiler then jams, or fuses, the inner two loop bodies, producing the following nest:
for (i = 0; i < 10; i += 2) {
for (j = 0; j < 100; j++) {
a[i][j] = b[i][j] + 1;
a[i+1][j] = b[i+1][j] + 1;
}
}
Illegal unrolling of outer loops
Outer loop unrolling is not always legal because the transformation can change the semantics of the original program. For example, unrolling the following loop nest on the outer loop would change the program semantics because of the dependency between a[i][...] and a[i+1][...]. The directive will cause incorrect code due to dependencies.
#pragma _CRI unroll 2
for (i = 0; i < 10; i++) {
for (j = 1; j < 100; j++) {
a[i][j] = a[i+1][j-1] + 1;
}
}
nofission
#pragma _CRI nofission funcScope: Local
Instructs the compiler not to split statements in a given loop into distinct loops. Fission is prevented only for the loop specified; loops nested within the indicated loop remain fission candidates unless likewise annotated.
[no]fusion
#pragma _CRI fusion#pragma _CRI nofusionScope: Local
The nofusion directive instructs the compiler to not attempt loop fusion on the following loop even when the -h fusion option was specified on the compiler command line. The fusion directive instructs the compiler to attempt loop fusion on the following loop unless -h nofusion was specified on the compiler command line.