Coarray C++ Use

Coarray C++ is a template library that implements the coarray concept for Partitioned Global Address Space (PGAS) programming in C++.

Coarray C++ is a template library that implements the coarray concept for Partitioned Global Address Space (PGAS) programming in C++. The template library specifications are contained on a set of *.html pages that the CCE installation copies to /opt/cray/cce/version/doc/html/ on the Cray platform; they may be copied to any location which provides HTML web content for the site, or any location that can be accessed by site local web browsers.

The coarray concept used in Coarray C++ is intentionally very similar to Fortran (ISO/IEC 1539-1:2010) coarrays. Users familiar with Fortran coarrays will notice that terminology and even function names are identical, although the syntax follows C++ conventions.

A coarray adds an additional dimension, called a codimension, to a normal scalar or array type. The codimension spans instances of a Single-Program Multiple-Data (SPMD) application, called images, such that each image contains a slice of the coarray equivalent in shape to the original scalar or array type. Each image has immediate access via processor loads and stores to its own slice of the coarray, which resides in that image's local partition of the global address space. By specifying an image number in the cosubscript of the codimension, each image also has access to the slices residing in other images' partitions.

Images are an orthogonal concept to threads, such as those provided by C++11 or OpenMP. Threads are used for shared memory programming where each thread has immediate access to the address space of a single process and possibly some thread-local storage to which only it has access. Images are a broader concept intended to provide communication among cooperating processes that each have their own address space. The mechanism for this cooperation varies by implementation. Typically it involves network communication between processes that have arranged to have identical virtual memory layouts. This communication is one-sided such that a programmer can have an image read or write data that belongs to a different image without writing any code for the second image. Note that images and threads may coexist in the same application; a large networked system with multicore nodes could use coarrays to communicate among nodes but use threads within each node to exploit the multicore parallelism.

In Coarray C++, a coarray is presented as a class template that collectively allocates an object of a specified type within the address space of each image. The coarray object is responsible for managing storage for the object that it allocates. When used in an expression context, the coarray object automatically converts to its managed object so that an image can access its own slice of the coarray without using special syntax. Accessing a slice that belongs to a different image requires specifying the image number as a cosubscript in parenthesis immediately following the coarray object, before any array subscripts. Therefore, the codimension is the slowest-running array dimension, just like Fortran.

The subscript order is backwards from Fortran because in Fortran the slowest-running dimension is rightmost whereas in C++ it is leftmost.

In addition to providing the fundamental ability to allocate and access a coarray, Coarray C++ provides image synchronization, atomic operations, and collectives.

Although this chapter presents Cray's implementation, Coarray C++ is designed to allow portable applications to be written for a variety of computing platforms in the sense that the template library interface is platform independent and can be compiled by any C++03 (ISO/IEC 14882:2003) or C++11 (ISO/IEC 14882:2011) compliant compiler. The implementation of the template library is likely to differ for each platform due to different transport layers (e.g., shared memory or various networks) for communicating data between images.

Compile Coarray C++

The following program is the Coarray C++ equivalent of the classic "Hello World" program. The header file coarray_cpp.h provides all Coarray C++ declarations within namespace coarray_cpp. Normally a program imports all of the declarations into its namespace with a using directive, but having the namespace gives the programmer flexibility to deal with name conflicts.
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int main( int argc, char* argv[] )
{
    std::cout << "Hello from image " << this_image()
    << " of " << num_images() << std::endl;
    return 0;
}
The program is compiled with the Cray compiler and executed using four images as follows:
> module load PrgEnv-cray
> CC -o hello hello.cpp
> aprun -n4 ./hello
Hello from image 0 of 4
Hello from image 1 of 4
Hello from image 2 of 4
Hello from image 3 of 4

Declare and Access Coarrays

The general form of a coarray declaration is:
coarray<T> name;
Where T is the type of the object that will be allocated in the address space of each image.
A coarray declaration may appear anywhere that a C++ object can be declared. Therefore, a coarray may be declared as a global variable, local variable, static local variable, or as part of a struct or class. It may be allocated statically or dynamically. The only restriction is that a coarray allocation must be executed collectively by all images. The C++ language ensures that this restriction is met for global and static local coarray declarations, but the programmer is responsible for ensuring that local and dynamically-allocated coarrays are declared collectively. For example:
coarray<int> x; // global
void
foo( void )
{
   static coarray<int> y; // static local
   coarray<int> z; // local
   coarray<int>* p = new coarray<int>; // dynamically allocated
   ...
   delete p;
}  // z is automatically destroyed here

Basic Types

A coarray of a basic C++ type is the simplest kind of coarray. Each image has an instance of the basic type that is managed by its coarray object. A coarray of type int is declared as:
coarray<int> x;
The declaration may pass an initial value to the constructor. Different images may pass different initial values:
coarray<int> x(2);
The initializer syntax below is not supported. If it were permitted, then automatic conversion from int to coarray<int> would be allowed, which would loosen type checking and lead to unexpected collective allocations:
coarray<int> x = 2;
This coarray object will behave as if it were the int that it manages. Assigning to the coarray object will assign a value to the int that is managed by the coarray object:
x = 42;
Likewise, using the coarray object in any expression where an int is expected shall read the value of the managed int:
int y = x + 1;
If the coarray object needs to be used in an expression where no particular type is expected, then the managed object can be accessed explicitly via empty parenthesis:
// prints the address of the coarray object
std::cout << &x << std::endl;
// prints the address of the int managed by the coarray object
std::cout << &x() << std::endl;
Accessing an int that is managed by another image requires specifying the image number within the parenthesis:
x(5) = 42; // set x = 42 within the address space of image 5
int y = x(2); // obtain the value of x from the address space of image 2
Finally, consider an enhanced version of the Hello World program. In this program, all images write their image number to their local object and then call sync_all(), which synchronizes control flow across all images. After the sync_all(), each image computes the image number of its left and right neighbors in the image space and prints the values that were written by its neighbors.
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int main( int argc, char* argv[] )
{
    coarray<int> x;
    x = this_image();
    sync_all();
    const int left = ( this_image() - 1 ) % num_images();
    const int right = ( this_image() + 1 ) % num_images();
    std::cout << "Hello from image " << x << "
         where x(left) = " << x(left) << " and x(right) = "
         << x(right) << std::endl;
    return 0;}

> CC -o hello2 hello2.cpp
> aprun -n4 ./hello2
Hello from image 0 where x(left) = 3 and x(right) = 1
Hello from image 3 where x(left) = 2 and x(right) = 0
Hello from image 2 where x(left) = 1 and x(right) = 3
Hello from image 1 where x(left) = 0 and x(right) = 2

Arrays

A coarray of an array type gives every image an array of the same shape. An example of a statically-sized coarray is below. The complete array type, including all extents, is provided as the coarray template's type argument:
// Declares a coarray of an array of 10 arrays of 20 ints
coarray<int[10][20]> x;
The following declaration is very different:
// Declares an array of 10 arrays of 20 coarrays
// of type int. Legal, but very inefficient!
coarray<int> bad[10][20];
A coarray of a multidimensional array type is not achieved via nested coarray types. Although such declarations are legal, they are strange and not particularly useful:
// Declares a coarray of an array of 10 coarrays of arrays of 20 ints
coarray< coarray<int[20]>[10] > weird;
In a dynamically-sized coarray declaration, the extent of the leading dimension is left unbounded. The size of this extent cannot be part of the template type because it is not known at compile time. Instead, the size is passed as a constructor argument:
coarray<int[][20]> y(n); // each image must pass the same value
Later, the extent of the leading dimension can be extracted from the coarray object via the extent() member function:
size_t y_extent = y.extent();
An individual element of the local array managed by the coarray object is accessed by applying subscripts directly to the coarray object. When accessing part of the coarray managed by another image, the cosubscript appears in parenthesis before the subscripts:
x[4][5] = 1; // set x[4][5] = 1 within this image's address space
y(3)[6][7] = 2; // set y[6][7] = 2 within the address space of image 3

Pointers

A coarray of pointers is typically used to implement a "ragged array" where different images need to allocate a different amount of memory as part of the same coarray. An example of a coarray of pointers is:
coarray<int*> x;
Each image allocates additional memory independently from the collective allocation of the coarray object itself:
x = new int[n]; // n usually varies per image
Due to the independent allocations, the allocated memory might not be located at the same address within every image's address space. Therefore, accessing the data requires an additional read of the pointer from the target image before a normal read or write can occur. This additional read happens automatically as part of the usual syntax for accessing the data:
x(i)[3] = 4; // set x[3] = 4 within the address space of image i
The address stored within the pointer may be valid only on the allocating image, unless the program is careful to target the pointer at only symmetric virtual addresses. Great care should be taken with the following code pattern:
int* p = x(i); // get an address from image i
p[3] = 4;      // and dereference it on this image
Finally, the program must ensure prior to performing any accesses that other images have allocated their memory:
coarray<int*> x;x = new int[n];
sync_all();
x(i)[3] = 4;

Structs, Unions, and Classes

A coarray of a struct, union, or class behaves like a coarray of a basic type when the entire object is accessed, however special syntax is required for member access due to limitations of C++ operator overloading:
struct Point { int x, y; };

coarray<Point> pt;
Point p;

pt = p;    // set pt = p in this image's address space
pt(2) = p; // set pt = p within the address space of image

2pt->x = 0;  // set pt.x = 0 in this image's address space
pt().x = 0; // alternate syntax

// set pt.x = 1 within address space of image i
pt(i).member( &Point::x ) = 1;
Calling a member function of an object that resides in the address space of another image (i.e., a remote procedure call) is not supported. By default, when a struct, union, or class is copied between images, it is treated as a Plain Old Data (POD) type such that a bitwise copy occurs. This behavior is not appropriate if the type contains pointers to allocated data. The default behavior can be changed by creating a specialization of coarray_traits where is_trivially_gettable is false. C++ requires that the specialization be placed in the same namespace as the general template:
struct my_string {
    char* data;
    size_t length;
};

namespace coarray_cpp {
    template < >
    struct coarray_traits<my_string> {
        static const bool is_trivially_gettable = false;
        static const bool is_trivially_puttable = false;
    };
}
When is_trivially_gettable is false for a type, Coarray C++ expects the type to have a special constructor and a special assignment operator to facilitate reading an object from a remote image:
struct my_string {
  char* data;
  size_t length;

  // remote constructor
  my_string( const_coref<my_string> ref );
  // remote assignment operator
  my_string& operator = ( const_coref<my_string> ref );
};

The role of the remote constructor or remote assignment operator is to read the POD parts of the object from the other image, use that data to calculate how much memory needs to be allocated, allocate the memory, then read the rest of the object into the newly allocated memory.

Typically, if is_trivially_gettable is false for a type, then is_trivially_puttable should also be false. When is_trivially_puttable is false for a type, a compile time error will occur the program attempts to copy an instance of the type to another image.

Type System

The Coarray C++ type system is modeled closely on the C++ type system. In addition to the coarray type that extends the C++ array concept across images, there are coreferences and copointers that extend the C++ concepts of references and pointers to refer to objects on other images.

A coreference is returned when a cosubscript is applied to a coarray. Like a C++ reference, a coreference is always associated with an object, called its referent, can never be rebound to a different object, and can never be null. Typically a coreference is either immediately converted to its referent or subscripted, such that it is not necessary to declare a coreference and its fleeting presence can be ignored. Nevertheless, explicit coreferences are useful in some situations. Suppose that a function needs to have access to an object in another image's address space, but does not need to know anything about the coarray containing the object. For example:
void foo( coref<int> );
int main( int argc, char* argv[] ){
    coarray<int> x;
    coarray<int[10]> y;
    ...
    foo( x(2) );
    foo( y(3)[4] );
    ...
return 0;
}
In the above code, function foo can access an int that is part of either x or y even though x and y have different shapes. If foo were to require a coarray parameter instead, then it could accept either x or y but not both because the coarrays have different types. Furthermore, foo's coreference parameter makes it clear to someone reading the code that the function's effect is narrow, limited to one object instead of an entire coarray. Two other uses of coreferences are to operate on coarray slices that are larger than a single object and to move data in bulk between images. To make these techniques more useful, coreferences can be created for local objects:
int main( int argc, char* argv[] ){
    coarray<int[5][10]> x;
    int local[10];
    coref<int[10]> local_ref( local );
    ...
    // local[0...9] = x(2)[1][0...9]
    local_ref = x(2)[1];
    ...
    // x(3)[4][0...9] = local[0...9]
    x(3)[4] = local_ref;
    ...
    return 0;
}
For convenience, the make_coref and make_const_coref functions create coreferences for local objects without requiring the programmer to write the type of the local object:
int main( int argc, char* argv[] )
{
    coarray<int[5][10]> x;
    int local[10];
    ...
    // local[0...9] = x(2)[1][0...9]
    make_coref( local ) = x(2)[1];
    ...
    // x(3)[4][0...9] = local[0...9]
    x(3)[4] = make_const_coref( local );
    ...
    return 0;
}

A const_coref behaves exactly like a coref except that it cannot be used to modify its referent.

A coreference can be converted to a copointer by calling its address function; the address-of operator is not overloaded. Local pointers are automatically convertible to copointers. Unlike coreferences, a copointer can be reassociated and can be unassociated or null. Arithmetic on a copointer changes the address to which it points but never changes the image to which it points. Comparisons between two copointers are allowed provided that both copointers point to the same image. Copointers can be used as iterators with standard C++ function templates. For example, the following code will not assert:
int
main( int argc, char* argv[] )
{
    coarray<int[10]> x;
    const size_t left = ( this_image() - 1 ) % num_images();
    const size_t right = ( this_image() + 1 ) % num_images();
    coptr<int> begin = x(right)[0].address();
    // Apply a standard algorithm, using a coptr as an iterator.
    coptr<int> end = x(right)[10].address();
    std::fill( begin, end, image );
    sync_all();
    for ( int i = 0; i < 10; ++i ) {
        assert( x[i] == left );
    }
    return 0;
}
They can be used to form linked lists spanning images. The list even can include links that point to local data:
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;

template < typename T >
struct Link {
    T data;
    coptr< Link<T> > next;
};

coarray< Link<int> > global_links;

int main( int argc, char* argv[] )
{
    Link<int> local_link;
    global_links->data = 2 * this_image();
    global_links->next = &local_link;
    local_link.data = 2 * this_image() + 1;
    if ( this_image() < num_images() - 1 ) {
        local_link.next = global_links(this_image() + 1).address();
    }
    else {
        local_link.next = 0;
    }
    sync_all(); // ensure every image has setup the data

    if ( this_image() == 0 ) {
       for ( coptr< Link<int> > p = global_links(0).address();
             p != NULL; p = p->member( &Link<int>::next ) ) {
           std::cout << p->member( &Link<int>::data ) << std::endl;
       }
    }
    // ensure local_link is not destroyed before it's read by image 0
    sync_all();
    return 0;
}
Compiling and executing the above program:
> CC -o list list.cpp
> aprun -n4 ./list

A const_coptr behaves exactly like a coptr except that it cannot be used to modify its target.

Various different array types have the same number of elements even though they have a different shape. For example, int[100], int[2][50], and int[2][2][25] all have 100 elements. A reference or pointer to a coarray of one of these types can be reinterpreted as a coarray of any of the others via a shape_cast, which has the same syntax as the standard C++ static_cast, dynamic_cast, reinterpret_cast, and const_cast. A shape_cast converts between coarray types of the same ultimate type that have different shapes. For example, a shape_cast cannot be used to reinterpret a coarray<int[100]>& as a coarray<float[100]>&; that conversion will throw a std::bad_cast exception. A shape_cast can be used to convert to a smaller shape but not to a larger shape. For example, a coarray<int[100]>& may be converted to a coarray<int[50]>&, in which case the new coarray can access only the first 50 elements of the original, but it may not be converted to a coarray<int[200]>& because that requires more storage and will throw a std::bad_cast exception. The example code below shows various legal shape_casts:
#include <cassert>
#include <iostream>
#include <coarray_cpp.h>

using namespace coarray_cpp;

void foo( const coarray<int[]>& y ) { }
void foo10( const coarray<int[10]>& y ) { }
void foo5( const coarray<int[][5]>& y ) { }
void foo10_5( const coarray<int[10][5]>& y ) { }
void foo50( const coarray<int[50]>& y ) { }
int
main( int argc, char* argv[] )
{
    int extent = 10;
    coarray<int[10]> x_10_s;
    coarray<int[]> x_10_d(extent);
    coarray<int[10][5]> x_10_5_s;
    coarray<int[][5]> x_10_5_d(extent);
    coarray<int> y;

    // Perform all valid combinations of passing the coarrays to the functions,
    // using shape_cast when necessary.
    foo( x_10_s );
    foo( x_10_d );
    foo( shape_cast<int[]>( x_10_5_s ) );
    foo( shape_cast<int[]>( x_10_5_d ) );
    foo10( x_10_s );
    foo10( x_10_d );
    foo5( shape_cast<int[2][5]>( x_10_s ) );
    foo5( shape_cast<int[][5]>( x_10_d ) );

    foo5( x_10_5_s );
    foo5( x_10_5_d );
    foo10_5( x_10_5_s );
    foo10_5( x_10_5_d );
    foo50( shape_cast<int[50]>( x_10_5_s ) );
    foo50( shape_cast<int[50]>( x_10_5_d ) );

    // Trivial reshape to same shape.
    shape_cast<int>( y );

    // shape_cast from scalar to array.
    shape_cast<int[1]>( y );

    // shape_cast from array to scalar.
    shape_cast<int>( x_10_s );

    // shape_cast to smaller array.
    shape_cast<int[5]>( x_10_s );

    // shape_cast to larger array.
    bool passed = false;
    try {
        shape_cast<int[25]>( x_10_s );
    } catch ( std::bad_cast& e ){
        passed = true;
    }
    assert( passed );

    return 0;
}

Control Flow and Synchronization

Coarray C++ follows the Single-Program Multiple-Data model where all images begin executing the same main program but may operate on different data. Conditional code is used to restrict execution to certain images:
#include <iostream>
#include <coarray_cpp.h>

using namespace coarray_cpp;

int main( int argc, char* argv[] )
{
    if ( this_image() % 2 == 0 ){
       std::cout << "Hello from even image "
          << this_image() << std::endl;
    }
    else {
        std::cout << "Hello from odd image "
           << this_image() << std::endl;
    }
    return 0;
}
> aprun -n4 ./a.out
Hello from odd image 3
Hello from even image 0
Hello from even image 2
Hello from odd image 1

A sync_all() ensures that all images must execute a sync_all() before any image may proceed beyond the sync_all() which it executed. It is not required that all images execute exactly the same sync_all() in the source code, just that they must execute some sync_all(). Failure of all images to participate will cause deadlock.

A coarray may be passed to a function via a reference or a pointer, but may not be passed by value. If a coarray could be passed by value, the call would have to be collective. There would be a collective allocation of a temporary coarray, the data within the original coarray would need to be copied into the temporary coarray, and eventually the temporary coarray would need to be collectively destroyed. Pass by value is expensive and there are better alternatives, like passing a coarray as a const reference, so it is a compile-time error. No matter how a coarray parameter is declared, the type of the actual argument must agree. Automatic conversions are provided between bounded and unbounded arrays; a conversion from unbounded to bounded performs a run-time check to ensure that the extents match and may throw a mismatched_extent_error exception.

The coatomic template is similar to the C++11 std::atomic template, but provides operations that are atomic with respect to images rather than threads. Specializations exist for all basic types and the same operations are supported as for the C++11 std::atomic template. Similar convenience typedefs are provided as well so that, for example, coatomic_long can be used in place of coatomic<long>.
coarray< coatomic<long> > x; // or coarray<coatomic_long>

x(i) ^= 3; // atomic update x = x ^ 3 on image i

long old_value = x(i)++; // atomic increment, saving the old value

long new_value = ++x(i); // atomic increment, saving the new value

coevent

A coevent permits point-to-point synchronization between images. It wraps a coatomic_long that acts as a counter and provides two operations, post and wait. Post atomically increments the counter and wait blocks execution of the calling image until it can atomically decrement the counter to a non-negative value.
coarray<coevent> x;

if ( this_image() == 0 ) {
   // do something, then notify image 1
   x(1).post();
}
else if ( this_image() == 1 ) {
    // wait for notification from another image
    x().wait(); // then do something
}
A comutex provides mutual exclusion. The lock function blocks until the mutex can be acquired and the unlock function releases the mutex. The try_lock function attempts to acquire the lock and returns a true upon success.
coarray<comutex> m;

m(i).lock();
// critical section, typically guarding access to data on image i
m(i).unlock();

Collectives

Coarray C++ provides broadcast and reduction collectives

cobroadcast replicates the value of a coarray on one image across all other images.
#include <cassert>
#include <iostream>
#include <coarray_cpp.h>

using namespace coarray_cpp;

int
main( int argc, char* argv[] )
{
    coarray<int> x;
    size_t image = this_image();
    size_t n = num_images();
    if ( image == 0 ) {
        x = 42;
    }
    sync_all();

    // Make x on every image equal the x on image 0.
    cobroadcast( x, 0 );
    sync_all();
    assert( x == 42 );
    return 0;
}
coreduce coreduce applies a function across the coarray values of all images. For convenience, template specializations of coreduce are provided for the addition, min, and max operations from the C++ functional header. Implementations are likely to provide optimized versions of at least these reductions.
#include <cassert>
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int
main( int argc, char* argv[] )
{
   coarray<int> sum;
    coarray<int> min;
    coarray<int> max;
    size_t image = this_image();
    size_t n = num_images();
    sum = image;
    min = image;
    max = image;

    sync_all();

    cosum( sum ); // equivalent to coreduce( sum, std::plus<int> )
    comin( min ); // equivalent to coreduce( min, std::less<int> )
    comax( max ); // equivalent to coreduce( max, std::greater<int> )
    sync_all();

    assert( sum == ( n * ( n - 1 ) / 2 ) );
    assert( min == 0 );
    assert( max == ( n - 1 ) );
    return 0;
}

Exceptions

Coarray C++ throws standard C++ exceptions, like std::bad_cast, but also throws some special exceptions for coarray-specific errors.
invalid_image_error
This exception is thrown whenever a cosubscript is invalid. For example, given a coarray x in a program executed with 4 images, x(4) triggers an exception because the only valid image numbers are 0, 1, 2, and 3.
invalid_put_error
This exception is thrown whenever a user-defined type is copied to a different image, but that type has coarray_traits that specify that it is not trivially puttable.
mismatched_extent_error
This exception is thrown when two arrays in an array assignment have a different shape.
mismatched_image_error
This exception is thrown when two copointers are compared or subtracted, but the copointers point to objects on different images.

Memory Consistency Model

The atomic_image_fence() function is the Coarray C++ equivalent of the C++11 std::atomic_thread_fence() function. It has the same behavior with respect to images as std::atomic_thread_fence() has with respect to threads. Typically, it is used to ensure that all memory accesses made by the calling image are visible to all images before performing subsequent memory accesses.

The effect of two memory accesses made by an image to its own address space is governed by the C++ memory consistency model. The C++ memory consistency model depends on which version of the C++ standard is implemented by the compiler. In general, a C++03 compiler assumes that an image is single-threaded and offers no memory consistency guarantees if multiple threads perform the accesses, whereas a C++11 compiler provides a detailed memory consistency model that can be used to reason about the effect of memory accesses within a multithreaded image.

A memory access of an object of size N bytes shall be treated as if it was performed as N arbitrarily ordered single-byte memory accesses. For example, the target image of a write shall not rely on the Nth byte being written last to detect whether the full object has been written.

The execution of a program contains a data race if it contains two conflicting actions in different images, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior. For example, if two images both write to the same object without any synchronization:
if ( this_image() == 0 ) {
    x(i) = 0;
}else if ( this_image() == 1 ) {
    x(i) = 1;
}
Then the final value of the object is undefined. Various forms of synchronization can impose a specific order, such as in this example:
if ( this_image() == 0 ) {
    x(i) = 0;
}

sync_all();

if ( this_image() == 1 )
 { x(i) = 1;
}
Where the assignment by image 0 happens before the assignment by image 1 because of the sync_all().

Two atomic operations issued by different images to the same coatomic object have the same ordering relationship as two C++11 threads that perform the same atomic operations on the same object.

Two memory accesses issued by the same image to non-conflicting memory addresses are unordered.

Two memory accesses issued by the same image to conflicting memory addresses within the address space of a single, different image shall have the same order as if they were made within the issuing image's address space. For example, in the following code:
x(i) = 1;int y = x(i);
the value of y will be 1 provided that there are no data races. Therefore, a Coarray C++ implementation for a shared memory system could inline x(i) as a direct memory access, allowing the compiler to make the following optimization (forward substitution):
x(i) = 1;int y = 1;

For distributed memory systems, providing this ordering guarantee is unfortunately somewhat onerous, but it is consistent with ordering guarantees of other PGAS languages, namely UPC and Fortran. Two memory accesses issued by an image to the same distant memory location typically will pass through the issuing processor's memory system, a high-speed communication network, and finally the target processor's memory system. Each hardware component is likely to contain multiple data pathways to increase bandwidth and resiliency, such that two memory accesses traveling on different pathways could bypass each other. Providing the ordering guarantee may require constraining two memory accesses to the same target location to always take the same hardware path to prevent bypass. Alternatively, software can track outstanding memory accesses and defer issuing an access if there is a conflict; however, software ordering adds overhead to each memory access to check for conflicts as well as storage overhead to track the accesses.

Blocking Versus Non-blocking Accesses

When an image makes a blocking read or write access, it does not proceed to execute its next operation until the access fully completes. By contrast, a non-blocking read or write access permits an image to proceed to execute its next operation before the access fully completes and provides some mechanism for ensuring that the operation has completed later.

Neither the target image nor any other image besides the issuing image is required to be able to observe the effects of a write until some form of image synchronization occurs. Therefore, an implementation is permitted to issue non-blocking writes for all writes provided that it can ensure that conflicting accesses issued by the same image occur in program order. Whether this guarantee is provided by software or hardware depends on the implementation. To explicitly issue and manage completion of a non-blocking write, see Cofutures.

A Coarray C++ read access is blocking in order to provide a value for use in an arbitrary expression context:
coarray<int> x;
...
int y = x(i) + 1; // read of x(i) shall block
A non-blocking read is performed via an explicit get() member function of coref:
  int y;x(i).get( &y );
... // some code that does not access y
atomic_image_fence();
++y;

The get() member function issues a non-blocking read that is not guaranteed to complete until the next fence. The atomic_image_fence() ensures completion of all previously issued memory accesses. The get() plus fence solution is appropriate in many cases, but it may be too broad if the fence would force completion of other accesses on which the issuing image does not yet need to wait. To explicitly issue and manage completion of a non-blocking read, see Cofutures.

Coarray C++ provides explicit completion management of a non-blocking access via a cofuture, which is modeled on C++11's std::future. A coref plays a similar role to C++11's std::promise, providing member functions that create a cofuture. Here is an example of a non-blocking read where the storage for the value is contained within the cofuture. The value cannot be accidentally used before the operation has completed, but existing storage cannot be used as the target of the read:
coarray<int> x;
...
cofuture<int> f = x(i).get_cofuture(); // or just x(i)
...
int z = f + 1; // using f waits then implicitly returns the value
For convenience, a coref can automatically convert to a cofuture so that the get_cofuture() call can be omitted. Here is an example of a non-blocking read where the storage for the value is external to the cofuture. Care must be taken to not access the storage until wait() has been called:
int y[100];
...
cofuture<void> f = x(i).get_cofuture( y );
...      // code that does not read or write y
f.wait();
...      // code that reads or writes y

Note that the cofuture's parameter type is void because it does not store any value.

Here is an example of a non-blocking write. Care must be taken to not overwrite the source of the write until wait() has been called.
coarray<int> x;
int y;
...
cofuture<void> f = x(i).put_cofuture( y );
...       // code that does not write y
f.wait(); // ensure that the x(i) = y assignment completed

Note that the cofuture's parameter type is void because a cofuture for a write never stores a value.

Code Patterns

When a coarray is included as a member of a class, it can be allocated with the class object or it can be allocated later:
// An X must be allocated and destroyed
// collectively because it contains a coarray.
class X {
    coarray<int> x;
    ...
};

// But a Y defers its "collectiveness" until
// it needs to allocate the coarray.
class Y {
    coarray<int>* y;
    ...
};

These two options provide flexibility for implementing collective objects, or coobjects, which can encapsulate coarray data movement.

When a coarray of pointer type is accessed within a loop, there may be unnecessary reads of the pointer from the target image if the same image is accessed repeatedly:
coarray<int*> x;
...
for ( int i = 0; i < n; ++i ) {
    int y = x(1)[i]; // reads pointer x(1) each time
    ...
}
A coptr or const_coptr can be used to hoist the read of the pointer:

coarray<int*> x;
...
const_coptr<int> p = x(1)[0].address(); // reads pointer x(1) once
for ( int i = 0; i < n; ++i ) {
int y = p[i];
...
}