Coarray C++ Use
Coarray C++ is a template library that implements the coarray concept for Partitioned Global Address Space (PGAS) programming in C++.
Coarray C++ is a template library that implements the coarray concept for Partitioned Global Address Space (PGAS) programming in C++. The template library specifications are contained on a set of *.html pages that the CCE installation copies to /opt/cray/cce/version/doc/html/ on the Cray platform; they may be copied to any location which provides HTML web content for the site, or any location that can be accessed by site local web browsers.
The coarray concept used in Coarray C++ is intentionally very similar to Fortran (ISO/IEC 1539-1:2010) coarrays. Users familiar with Fortran coarrays will notice that terminology and even function names are identical, although the syntax follows C++ conventions.
A coarray adds an additional dimension, called a codimension, to a normal scalar or array type. The codimension spans instances of a Single-Program Multiple-Data (SPMD) application, called images, such that each image contains a slice of the coarray equivalent in shape to the original scalar or array type. Each image has immediate access via processor loads and stores to its own slice of the coarray, which resides in that image's local partition of the global address space. By specifying an image number in the cosubscript of the codimension, each image also has access to the slices residing in other images' partitions.
Images are an orthogonal concept to threads, such as those provided by C++11 or OpenMP. Threads are used for shared memory programming where each thread has immediate access to the address space of a single process and possibly some thread-local storage to which only it has access. Images are a broader concept intended to provide communication among cooperating processes that each have their own address space. The mechanism for this cooperation varies by implementation. Typically it involves network communication between processes that have arranged to have identical virtual memory layouts. This communication is one-sided such that a programmer can have an image read or write data that belongs to a different image without writing any code for the second image. Note that images and threads may coexist in the same application; a large networked system with multicore nodes could use coarrays to communicate among nodes but use threads within each node to exploit the multicore parallelism.
In Coarray C++, a coarray is presented as a class template that collectively allocates an object of a specified type within the address space of each image. The coarray object is responsible for managing storage for the object that it allocates. When used in an expression context, the coarray object automatically converts to its managed object so that an image can access its own slice of the coarray without using special syntax. Accessing a slice that belongs to a different image requires specifying the image number as a cosubscript in parenthesis immediately following the coarray object, before any array subscripts. Therefore, the codimension is the slowest-running array dimension, just like Fortran.
The subscript order is backwards from Fortran because in Fortran the slowest-running dimension is rightmost whereas in C++ it is leftmost.
In addition to providing the fundamental ability to allocate and access a coarray, Coarray C++ provides image synchronization, atomic operations, and collectives.
Although this chapter presents Cray's implementation, Coarray C++ is designed to allow portable applications to be written for a variety of computing platforms in the sense that the template library interface is platform independent and can be compiled by any C++03 (ISO/IEC 14882:2003) or C++11 (ISO/IEC 14882:2011) compliant compiler. The implementation of the template library is likely to differ for each platform due to different transport layers (e.g., shared memory or various networks) for communicating data between images.
Compile Coarray C++
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int main( int argc, char* argv[] )
{
std::cout << "Hello from image " << this_image()
<< " of " << num_images() << std::endl;
return 0;
}> module load PrgEnv-cray
> CC -o hello hello.cpp
> aprun -n4 ./hello
Hello from image 0 of 4
Hello from image 1 of 4
Hello from image 2 of 4
Hello from image 3 of 4Declare and Access Coarrays
coarray<T> name;Where T is the type of the object that will be allocated in the address space of each image.coarray<int> x; // global
void
foo( void )
{
static coarray<int> y; // static local
coarray<int> z; // local
coarray<int>* p = new coarray<int>; // dynamically allocated
...
delete p;
} // z is automatically destroyed hereBasic Types
coarray<int> x;coarray<int> x(2);coarray<int> x = 2;x = 42;int y = x + 1;// prints the address of the coarray object
std::cout << &x << std::endl;
// prints the address of the int managed by the coarray object
std::cout << &x() << std::endl;x(5) = 42; // set x = 42 within the address space of image 5
int y = x(2); // obtain the value of x from the address space of image 2#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int main( int argc, char* argv[] )
{
coarray<int> x;
x = this_image();
sync_all();
const int left = ( this_image() - 1 ) % num_images();
const int right = ( this_image() + 1 ) % num_images();
std::cout << "Hello from image " << x << "
where x(left) = " << x(left) << " and x(right) = "
<< x(right) << std::endl;
return 0;}
> CC -o hello2 hello2.cpp
> aprun -n4 ./hello2
Hello from image 0 where x(left) = 3 and x(right) = 1
Hello from image 3 where x(left) = 2 and x(right) = 0
Hello from image 2 where x(left) = 1 and x(right) = 3
Hello from image 1 where x(left) = 0 and x(right) = 2Arrays
// Declares a coarray of an array of 10 arrays of 20 ints
coarray<int[10][20]> x;// Declares an array of 10 arrays of 20 coarrays
// of type int. Legal, but very inefficient!
coarray<int> bad[10][20];// Declares a coarray of an array of 10 coarrays of arrays of 20 ints
coarray< coarray<int[20]>[10] > weird;coarray<int[][20]> y(n); // each image must pass the same valuesize_t y_extent = y.extent();x[4][5] = 1; // set x[4][5] = 1 within this image's address space
y(3)[6][7] = 2; // set y[6][7] = 2 within the address space of image 3Pointers
coarray<int*> x;x = new int[n]; // n usually varies per imagex(i)[3] = 4; // set x[3] = 4 within the address space of image iint* p = x(i); // get an address from image i
p[3] = 4; // and dereference it on this imagecoarray<int*> x;x = new int[n];
sync_all();
x(i)[3] = 4;Structs, Unions, and Classes
struct Point { int x, y; };
coarray<Point> pt;
Point p;
pt = p; // set pt = p in this image's address space
pt(2) = p; // set pt = p within the address space of image
2pt->x = 0; // set pt.x = 0 in this image's address space
pt().x = 0; // alternate syntax
// set pt.x = 1 within address space of image i
pt(i).member( &Point::x ) = 1;struct my_string {
char* data;
size_t length;
};
namespace coarray_cpp {
template < >
struct coarray_traits<my_string> {
static const bool is_trivially_gettable = false;
static const bool is_trivially_puttable = false;
};
}struct my_string {
char* data;
size_t length;
// remote constructor
my_string( const_coref<my_string> ref );
// remote assignment operator
my_string& operator = ( const_coref<my_string> ref );
};The role of the remote constructor or remote assignment operator is to read the POD parts of the object from the other image, use that data to calculate how much memory needs to be allocated, allocate the memory, then read the rest of the object into the newly allocated memory.
Typically, if is_trivially_gettable is false for a type, then is_trivially_puttable should also be false. When is_trivially_puttable is false for a type, a compile time error will occur the program attempts to copy an instance of the type to another image.
Type System
The Coarray C++ type system is modeled closely on the C++ type system. In addition to the coarray type that extends the C++ array concept across images, there are coreferences and copointers that extend the C++ concepts of references and pointers to refer to objects on other images.
void foo( coref<int> );
int main( int argc, char* argv[] ){
coarray<int> x;
coarray<int[10]> y;
...
foo( x(2) );
foo( y(3)[4] );
...
return 0;
}int main( int argc, char* argv[] ){
coarray<int[5][10]> x;
int local[10];
coref<int[10]> local_ref( local );
...
// local[0...9] = x(2)[1][0...9]
local_ref = x(2)[1];
...
// x(3)[4][0...9] = local[0...9]
x(3)[4] = local_ref;
...
return 0;
}int main( int argc, char* argv[] )
{
coarray<int[5][10]> x;
int local[10];
...
// local[0...9] = x(2)[1][0...9]
make_coref( local ) = x(2)[1];
...
// x(3)[4][0...9] = local[0...9]
x(3)[4] = make_const_coref( local );
...
return 0;
}A const_coref behaves exactly like a coref except that it cannot be used to modify its referent.
int
main( int argc, char* argv[] )
{
coarray<int[10]> x;
const size_t left = ( this_image() - 1 ) % num_images();
const size_t right = ( this_image() + 1 ) % num_images();
coptr<int> begin = x(right)[0].address();
// Apply a standard algorithm, using a coptr as an iterator.
coptr<int> end = x(right)[10].address();
std::fill( begin, end, image );
sync_all();
for ( int i = 0; i < 10; ++i ) {
assert( x[i] == left );
}
return 0;
}#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
template < typename T >
struct Link {
T data;
coptr< Link<T> > next;
};
coarray< Link<int> > global_links;
int main( int argc, char* argv[] )
{
Link<int> local_link;
global_links->data = 2 * this_image();
global_links->next = &local_link;
local_link.data = 2 * this_image() + 1;
if ( this_image() < num_images() - 1 ) {
local_link.next = global_links(this_image() + 1).address();
}
else {
local_link.next = 0;
}
sync_all(); // ensure every image has setup the data
if ( this_image() == 0 ) {
for ( coptr< Link<int> > p = global_links(0).address();
p != NULL; p = p->member( &Link<int>::next ) ) {
std::cout << p->member( &Link<int>::data ) << std::endl;
}
}
// ensure local_link is not destroyed before it's read by image 0
sync_all();
return 0;
}> CC -o list list.cpp
> aprun -n4 ./listA const_coptr behaves exactly like a coptr except that it cannot be used to modify its target.
#include <cassert>
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
void foo( const coarray<int[]>& y ) { }
void foo10( const coarray<int[10]>& y ) { }
void foo5( const coarray<int[][5]>& y ) { }
void foo10_5( const coarray<int[10][5]>& y ) { }
void foo50( const coarray<int[50]>& y ) { }
int
main( int argc, char* argv[] )
{
int extent = 10;
coarray<int[10]> x_10_s;
coarray<int[]> x_10_d(extent);
coarray<int[10][5]> x_10_5_s;
coarray<int[][5]> x_10_5_d(extent);
coarray<int> y;
// Perform all valid combinations of passing the coarrays to the functions,
// using shape_cast when necessary.
foo( x_10_s );
foo( x_10_d );
foo( shape_cast<int[]>( x_10_5_s ) );
foo( shape_cast<int[]>( x_10_5_d ) );
foo10( x_10_s );
foo10( x_10_d );
foo5( shape_cast<int[2][5]>( x_10_s ) );
foo5( shape_cast<int[][5]>( x_10_d ) );
foo5( x_10_5_s );
foo5( x_10_5_d );
foo10_5( x_10_5_s );
foo10_5( x_10_5_d );
foo50( shape_cast<int[50]>( x_10_5_s ) );
foo50( shape_cast<int[50]>( x_10_5_d ) );
// Trivial reshape to same shape.
shape_cast<int>( y );
// shape_cast from scalar to array.
shape_cast<int[1]>( y );
// shape_cast from array to scalar.
shape_cast<int>( x_10_s );
// shape_cast to smaller array.
shape_cast<int[5]>( x_10_s );
// shape_cast to larger array.
bool passed = false;
try {
shape_cast<int[25]>( x_10_s );
} catch ( std::bad_cast& e ){
passed = true;
}
assert( passed );
return 0;
}Control Flow and Synchronization
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int main( int argc, char* argv[] )
{
if ( this_image() % 2 == 0 ){
std::cout << "Hello from even image "
<< this_image() << std::endl;
}
else {
std::cout << "Hello from odd image "
<< this_image() << std::endl;
}
return 0;
}
> aprun -n4 ./a.out
Hello from odd image 3
Hello from even image 0
Hello from even image 2
Hello from odd image 1A sync_all() ensures that all images must execute a sync_all() before any image may proceed beyond the sync_all() which it executed. It is not required that all images execute exactly the same sync_all() in the source code, just that they must execute some sync_all(). Failure of all images to participate will cause deadlock.
A coarray may be passed to a function via a reference or a pointer, but may not be passed by value. If a coarray could be passed by value, the call would have to be collective. There would be a collective allocation of a temporary coarray, the data within the original coarray would need to be copied into the temporary coarray, and eventually the temporary coarray would need to be collectively destroyed. Pass by value is expensive and there are better alternatives, like passing a coarray as a const reference, so it is a compile-time error. No matter how a coarray parameter is declared, the type of the actual argument must agree. Automatic conversions are provided between bounded and unbounded arrays; a conversion from unbounded to bounded performs a run-time check to ensure that the extents match and may throw a mismatched_extent_error exception.
coarray< coatomic<long> > x; // or coarray<coatomic_long>
x(i) ^= 3; // atomic update x = x ^ 3 on image i
long old_value = x(i)++; // atomic increment, saving the old value
long new_value = ++x(i); // atomic increment, saving the new valuecoevent
coarray<coevent> x;
if ( this_image() == 0 ) {
// do something, then notify image 1
x(1).post();
}
else if ( this_image() == 1 ) {
// wait for notification from another image
x().wait(); // then do something
}coarray<comutex> m;
m(i).lock();
// critical section, typically guarding access to data on image i
m(i).unlock();Collectives
Coarray C++ provides broadcast and reduction collectives
#include <cassert>
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int
main( int argc, char* argv[] )
{
coarray<int> x;
size_t image = this_image();
size_t n = num_images();
if ( image == 0 ) {
x = 42;
}
sync_all();
// Make x on every image equal the x on image 0.
cobroadcast( x, 0 );
sync_all();
assert( x == 42 );
return 0;
}#include <cassert>
#include <iostream>
#include <coarray_cpp.h>
using namespace coarray_cpp;
int
main( int argc, char* argv[] )
{
coarray<int> sum;
coarray<int> min;
coarray<int> max;
size_t image = this_image();
size_t n = num_images();
sum = image;
min = image;
max = image;
sync_all();
cosum( sum ); // equivalent to coreduce( sum, std::plus<int> )
comin( min ); // equivalent to coreduce( min, std::less<int> )
comax( max ); // equivalent to coreduce( max, std::greater<int> )
sync_all();
assert( sum == ( n * ( n - 1 ) / 2 ) );
assert( min == 0 );
assert( max == ( n - 1 ) );
return 0;
}Exceptions
- invalid_image_error
- This exception is thrown whenever a cosubscript is invalid. For example, given a coarray x in a program executed with 4 images, x(4) triggers an exception because the only valid image numbers are 0, 1, 2, and 3.
- invalid_put_error
- This exception is thrown whenever a user-defined type is copied to a different image, but that type has coarray_traits that specify that it is not trivially puttable.
- mismatched_extent_error
- This exception is thrown when two arrays in an array assignment have a different shape.
- mismatched_image_error
- This exception is thrown when two copointers are compared or subtracted, but the copointers point to objects on different images.
Memory Consistency Model
The atomic_image_fence() function is the Coarray C++ equivalent of the C++11 std::atomic_thread_fence() function. It has the same behavior with respect to images as std::atomic_thread_fence() has with respect to threads. Typically, it is used to ensure that all memory accesses made by the calling image are visible to all images before performing subsequent memory accesses.
The effect of two memory accesses made by an image to its own address space is governed by the C++ memory consistency model. The C++ memory consistency model depends on which version of the C++ standard is implemented by the compiler. In general, a C++03 compiler assumes that an image is single-threaded and offers no memory consistency guarantees if multiple threads perform the accesses, whereas a C++11 compiler provides a detailed memory consistency model that can be used to reason about the effect of memory accesses within a multithreaded image.
A memory access of an object of size N bytes shall be treated as if it was performed as N arbitrarily ordered single-byte memory accesses. For example, the target image of a write shall not rely on the Nth byte being written last to detect whether the full object has been written.
if ( this_image() == 0 ) {
x(i) = 0;
}else if ( this_image() == 1 ) {
x(i) = 1;
}Then the final value of the object is undefined. Various forms of synchronization can impose a specific order, such as in this example:if ( this_image() == 0 ) {
x(i) = 0;
}
sync_all();
if ( this_image() == 1 )
{ x(i) = 1;
}Where the assignment by image 0 happens before the assignment by image 1 because of the sync_all().Two atomic operations issued by different images to the same coatomic object have the same ordering relationship as two C++11 threads that perform the same atomic operations on the same object.
Two memory accesses issued by the same image to non-conflicting memory addresses are unordered.
x(i) = 1;int y = x(i);the value of y will be 1 provided that there are no data races. Therefore, a Coarray C++ implementation for a shared memory system could inline x(i) as a direct memory access, allowing the compiler to make the following optimization (forward substitution):x(i) = 1;int y = 1;For distributed memory systems, providing this ordering guarantee is unfortunately somewhat onerous, but it is consistent with ordering guarantees of other PGAS languages, namely UPC and Fortran. Two memory accesses issued by an image to the same distant memory location typically will pass through the issuing processor's memory system, a high-speed communication network, and finally the target processor's memory system. Each hardware component is likely to contain multiple data pathways to increase bandwidth and resiliency, such that two memory accesses traveling on different pathways could bypass each other. Providing the ordering guarantee may require constraining two memory accesses to the same target location to always take the same hardware path to prevent bypass. Alternatively, software can track outstanding memory accesses and defer issuing an access if there is a conflict; however, software ordering adds overhead to each memory access to check for conflicts as well as storage overhead to track the accesses.
Blocking Versus Non-blocking Accesses
When an image makes a blocking read or write access, it does not proceed to execute its next operation until the access fully completes. By contrast, a non-blocking read or write access permits an image to proceed to execute its next operation before the access fully completes and provides some mechanism for ensuring that the operation has completed later.
Neither the target image nor any other image besides the issuing image is required to be able to observe the effects of a write until some form of image synchronization occurs. Therefore, an implementation is permitted to issue non-blocking writes for all writes provided that it can ensure that conflicting accesses issued by the same image occur in program order. Whether this guarantee is provided by software or hardware depends on the implementation. To explicitly issue and manage completion of a non-blocking write, see Cofutures.
coarray<int> x;
...
int y = x(i) + 1; // read of x(i) shall block int y;x(i).get( &y );
... // some code that does not access y
atomic_image_fence();
++y;The get() member function issues a non-blocking read that is not guaranteed to complete until the next fence. The atomic_image_fence() ensures completion of all previously issued memory accesses. The get() plus fence solution is appropriate in many cases, but it may be too broad if the fence would force completion of other accesses on which the issuing image does not yet need to wait. To explicitly issue and manage completion of a non-blocking read, see Cofutures.
coarray<int> x;
...
cofuture<int> f = x(i).get_cofuture(); // or just x(i)
...
int z = f + 1; // using f waits then implicitly returns the valueint y[100];
...
cofuture<void> f = x(i).get_cofuture( y );
... // code that does not read or write y
f.wait();
... // code that reads or writes yNote that the cofuture's parameter type is void because it does not store any value.
coarray<int> x;
int y;
...
cofuture<void> f = x(i).put_cofuture( y );
... // code that does not write y
f.wait(); // ensure that the x(i) = y assignment completedNote that the cofuture's parameter type is void because a cofuture for a write never stores a value.
Code Patterns
// An X must be allocated and destroyed
// collectively because it contains a coarray.
class X {
coarray<int> x;
...
};
// But a Y defers its "collectiveness" until
// it needs to allocate the coarray.
class Y {
coarray<int>* y;
...
};These two options provide flexibility for implementing collective objects, or coobjects, which can encapsulate coarray data movement.
coarray<int*> x;
...
for ( int i = 0; i < n; ++i ) {
int y = x(1)[i]; // reads pointer x(1) each time
...
}A coptr or const_coptr can be used to hoist the read of the pointer:
coarray<int*> x;
...
const_coptr<int> p = x(1)[0].address(); // reads pointer x(1) once
for ( int i = 0; i < n; ++i ) {
int y = p[i];
...
}