Summary: This module introduces the Partitioned Global Address Space (PGAS) programming paradigm. This paradigm provides both a data and execution model that has the potential to dramatically improve runtime performance and programmer productivity on increasingly ubiquitous multi-core architectures. The fundamental principles of the PGAS paradigm are presented in comparison with traditional parallel programming models. Furthermore, the richer 'Asynchronous PGAS' programming model is also introduced in association with the next-generation parallel programming languages Chapel and X10.
While Moore's Law continues to predict the doubling of transistors on an integrated circuit every 18 months, performance and power considerations have forced chip designers to embrace multi-core processors in place of higher frequency uni-core processors.
As desktop and high-performance computing architectures tend towards distributed collections of multi-core nodes1, a new parallel programming paradigm is required to fully exploit the complex distributed and shared-memory hierarchies of these evolutionary systems.
Recently, a programming model has been developed that has the potential to exploit the best features of this distributed shared-memory architecture. Not only does this model promise improved runtime performance on distributed clusters of SMPs, its data and execution semantics support increased programmer productivity. This model is called the Partitioned Global Address Space (PGAS) model.
The shared-memory programming model typically exploits a shared memory system, where any memory location is directly accessible by any of the computing processes (i.e. there is a single global address space). This programming model is similar in some respects to the sequential single-processor programming model with the addition of new constructs for synchronising multiple access to shared variables and memory locations.
The shared-memory architecture and programming model is illustrated in Figure 1. In this model any processing element (PE) can make a reference to any memory address within the global address space. The memory request is forwarded through an interconnection network to memory, with the result returned back, via the network, to the PE.
Typically, a shared-memory system will implement a multi-threaded programming model, where each processing element executes program thread. Currently, OpenMP and Pthreads are the two application programming interfaces (API) predominantly used to implement multi-threaded applications on shared-memory systems.
Which of the following properties of the shared-memory programming model (e.g using OpenMP) are generally true:
(a) It's a simpler programming model than the programming models for distributed-memory systems
(b) Provides precise control over data locality and processor affinity
(c) Supports finer-grain parallelism e.g., loop-level parallelism
(d) Supports incremental parallelization
(e) Shared-memory programming bugs are easier to track down
(a) TRUE - It's a simpler programming model than the programming models for distributed-memory systems
!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
DO I = 1, N
C(I) = A(I) + B(I)
WRITE(*,100) TID,I,C(I)
100 FORMAT(' Thread',I2,': C(',I3,')=',F8.2)
ENDDO
!$OMP END DO NOWAIT
The distributed-memory programming model exploits a distributed-memory system where each processor maintains its own local memory and has no direct knowledge about another processor's memory (a "share nothing" approach). For data to be shared, it must be passed from one processor to another as a message.
The distributed-memory architecture and programming model is illustrated in Figure 2. If a processing element (PE) requires data from another PE to continue its execution, it must request the data. The request is made by sending a message, through the interconnection network, to the destination PE. When the data is available on the destination PE it is returned within a message, through the interconnection network, to the requesting PE.
Typically, a distributed-memory system will implement a message-passing programming model, where processing elements communicate data by sending and receiving messages.
Currently, MPI is the de facto standard specification for implementing message-passing on distributed-memory systems. MPI programs generally follow a Single Program Multiple Data (SPMD) programming style.
Which of the following properties of the distributed-memory programming model (using MPI) are generally true:
(a) For many architectures, it can result in near-optimal performance
(b) Provides precise control over data locality and processor affinity
(c) Allows natural mapping of algorithms to implementation
(d) Supports incremental parallelization
(e) Runs on most parallel platforms
(a) TRUE - For many architectures, it can result in near-optimal performance
(b) TRUE - Provides precise control over data locality and processor affinity
Ideally a successful parallel programming model should hope to marry the performance and data locality (partitioning) features of MPI with the programmability and data referencing simplicity of a shared-memory (global address space) model.
The PGAS programming model aims to achieve these characteristics by providing:
The PGAS programming model is illustrated in Figure 3. In this model variables and arrays can be either shared or local. Each process has private memory for local data items and shared memory for globally shared data values. While the shared-memory is partitioned among the cooperating processes (each process will contribute memory to to the shared global memory), a process can directly access any data item within the global address space with a single address.
In Figure 3, five data objects have been declared within the PGAS programming model:
Each instance of variable x can directly write and read values to/from the global address space. Likewise an address in the global address space can be directly read or written to from another address within the global address space.
Currently there are three (3) PGAS programming languages that are becoming commonplace on modern computing systems:
For your curiosity, the following code examples illustrate some of the PGAS language features provided by UPC, CAF and Titanium respectively. For each language, a primitive matrix multiplication operation has been implemented.
#include <upc_relaxed.h>
// a and c are blocked shared matrices, initialization is not currently implemented
shared [N*P /THREADS] int a[N][P], c[N][M];
shared[M/THREADS] int b[P][M];
int b_local[P][M];
void main (void) {
int i, j , l; // private variables
upc_memget(b_local, b, P*M*sizeof(int));
upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l= 0 ; l<P ; l++) c[i][j] += a[i][l]*b_local[l][j];
}
}
}Matrix Multiplication in UPC
real,dimension(n,n)[p,*] :: a,b,c
do k=1,n
do q=1,p
c(i,j) = c(i,j) + a(i,k)[myP, q]*b(k,j)[q,myQ]
end do
end do
Matrix Multiplication in Co-Array Fortran
public static void matMul(double [2d] a,
double [2d] b,
double [2d] c) {
foreach (ij in c.domain()) {
double [1d] aRowi = a.slice(1, ij[1]);
double [1d] bColj = b.slice(2, ij[2]);
foreach (k in aRowi.domain()) {
c[ij] += aRowi[k] * bColj[k];
}
}
}
Matrix Multiplication in Titanium
Within the U.S., the DARPA HPCS project has the goal to raise high-performance computing (HPC) user productivity by a factor of 10 by the year 2010.
As part of this project two novel parallel programming languages are being funded for development, with the aim of improving programmer productivity on next-generation computing architectures.
The two languages under development are:
Both these languages implement an asynchronous partitioned global address space (APGAS) programming model; a term originally coined within the X10 language project. The APGAS model extends the PGAS programming model by providing a richer execution framework than the SPMD style generally used by the traditional PGAS languages.
In particular, asynchrony is achieved by:
This example illustrates asynchronous task creation in Chapel using Chapel's Locales concept.
// Chapel programs begin running on locale 0 by default
var x, y: real; // allocate x and y on locale 0
x=1;y=2;
begin on Locales(1) { // asynchronously create task on locale 1
var z: real; // allocate z on locale 1
writeln(x.locale.id); // print “0”
writeln(z.locale.id); // print “1”
z = x + y; // requires “get” for x and y
writeln(z);
}Asynchronous Task Creation in Chapel
This example illustrates asynchronous task creation in X10 using X10's async concept.
// global dist. array
final double a[D] = …;
final int k = …;
async ( a.distribution[99] ) {
// executed at A[99]’s place
atomic a[99] = k;
}
Asynchronous Task Creation in X10
Chapel and X10 provide many more high-level language features to improve programmer productivity during the development of parallel applications. These features will be explored in further detail in separate modules.
If you are interested in learning more about the next-generation parallel programming languages X10 and Chapel, please view the following video and audio introductions, presented by the technical leads of each project.
You can download the transcript of the audio presentation here.
Google Seattle Conference on Scalability 2008: Chapel - Productive Parallel Programming at Scale.
The X10 asynchronous execution language construct.
async (P) S:
Cascade High-Productivity Language: a new parallel programming language being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS).
Further information: http://chapel.cs.washington.edu/
Explicit PGAS language extensions to Fortran 95.
Further information: http://www.co-array.org/
A distributed-memory system is a multiple-processor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors.
Further information: http://en.wikipedia.org/wiki/Distributed_memory
The DARPA High Productivity Computing Systems project is focused on providing a new generation of economically viable high productivity computing systems for national security and for the industrial user community.
Further information: http://www.highproductivity.org/
Chapel's abstract representation of the target architecture. A locales properties include:
A programming model in which a process only has direct access to the local components of a distributed data structure.
A communication protocol where data is sent and received in packets between communicating processes.
Further information: http://en.wikipedia.org/wiki/Message_passing
Message-Passing Interface: a language-independent communications protocol used to program parallel computers using both point-to-point and collective communications. MPI defines a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation
Further information: http://www.mpi-forum.org/
A multi-core processor combines two or more independent cores (normally a CPU) into a single package composed of a single integrated circuit (IC).
Further information: http://en.wikipedia.org/wiki/Multi_core
Multi-threading is a popular programming and execution model that allows multiple threads to exist within the context of a single process. These threads share the process' resources but are able to execute independently.
Further information: http://en.wikipedia.org/wiki/Thread_(computer_science)
Traditional interprocess communication (used by MPI) requires cooperation and synchronization between sending and receiving processes; this is referred to as two-sided communication.
In one-sided communication, a process can update or interrogate the memory of another process without any intervention from the destination process.
OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming (typically using a fork-join model) in C, C++ and Fortran on many architectures. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.
Further information: http://openmp.org/wp/
A programming paradigm or model is a mental framework that governs the entire programming lifecycle including design, coding, testing, debugging and tuning.
Pthreads or POSIX Threads, is a POSIX standard that defines an API for creating and manipulating threads.
Further information: http://en.wikipedia.org/wiki/POSIX_Threads
Symmetric Multiprocessing: a multiprocessor computer-architecture where two or more identical processors can connect to a single shared main memory.
Further information: http://en.wikipedia.org/wiki/Symmetric_multiprocessing
Single Program Multiple Data: multiple autonomous processors simultaneously execute the same program at independent points.
Further information: http://en.wikipedia.org/wiki/SPMD
Explicit PGAS language extensions to Java.
Further information: http://titanium.cs.berkeley.edu/
Explicit PGAS language extensions to ANSI C.
Further information: http://upc.gwu.edu/
Experimental new language currently under development at IBM in collaboration with academic partners. The X10 effort is part of the IBM PERCS project (Productive Easy-to-use Reliable Computer Systems) in the DARPA program on High Productivity Computer Systems.
Further information: http://domino.research.ibm.com/comm/research_projects.nsf/pages/x10.index.html