Skip to content Skip to navigation

OpenStax-CNX

You are here: Home » Content » An Introduction to the Partitioned Global Address Space (PGAS) Programming Model

Navigation

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
 

An Introduction to the Partitioned Global Address Space (PGAS) Programming Model

Module by: Tim Stitt Ph.D.. E-mail the author

Summary: This module introduces the Partitioned Global Address Space (PGAS) programming paradigm. This paradigm provides both a data and execution model that has the potential to dramatically improve runtime performance and programmer productivity on increasingly ubiquitous multi-core architectures. The fundamental principles of the PGAS paradigm are presented in comparison with traditional parallel programming models. Furthermore, the richer 'Asynchronous PGAS' programming model is also introduced in association with the next-generation parallel programming languages Chapel and X10.

Note: You are viewing an old version of this document. The latest version is available here.

Introduction

The Rise of the 'Multicore' Machines

While Moore's Law continues to predict the doubling of transistors on an integrated circuit every 18 months, performance and power considerations have forced chip designers to embrace multi-core processors in place of higher frequency uni-core processors.

As desktop and high-performance computing architectures tend towards distributed collections of multi-core nodes1, a new parallel programming paradigm is required to fully exploit the complex distributed and shared-memory hierarchies of these evolutionary systems.

Recently, a programming model has been developed that has the potential to exploit the best features of this distributed shared-memory architecture. Not only does this model promise improved runtime performance on distributed clusters of SMPs, its data and execution semantics support increased programmer productivity. This model is called the Partitioned Global Address Space (PGAS) model.

A Brief Review of Conventional Parallel Programming Models

Shared-Memory Model

The shared-memory programming model typically exploits a shared memory system, where any memory location is directly accessible by any of the computing processes (i.e. there is a single global address space). This programming model is similar in some respects to the sequential single-processor programming model with the addition of new constructs for synchronising multiple access to shared variables and memory locations.

Figure 1: Shared-Memory Architecture and Programming Model

The shared-memory architecture and programming model is illustrated in Figure 1. In this model any processing element (PE) can make a reference to any memory address within the global address space. The memory request is forwarded through an interconnection network to memory, with the result returned back, via the network, to the PE.

Typically, a shared-memory system will implement a multi-threaded programming model, where each processing element executes program thread. Currently, OpenMP and Pthreads are the two libraries predominantly used to implement multi-threaded applications on shared-memory systems.

Exercise 1

Which of the following properties of the shared-memory programming model (e.g using OpenMP) are generally true:

(a) It's a simpler programming model than the programming models for distributed-memory systems
(b) Provides precise control over data locality and processor affinity
(c) Supports finer-grain parallelism e.g., loop-level parallelism
(d) Supports incremental parallelization
(e) Shared-memory bugs are easier to track down

Solution

(a) It's a simpler programming model than the programming models for distributed-memory systems

Aside:
Arguably, a single global address space simplifies memory references (e.g. array and variable references) within the program code, akin to the sequential programming model.

(c) Supports finer-grain parallelism e.g., loop-level parallelism
Aside:
Generally with a multi-threaded based programming language, program loops can be parallelized by executing independent iterations on different threads. In OpenMP this can be accomplished with the PARALLEL DO directive:

!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
      DO I = 1, N
        C(I) = A(I) + B(I)
        WRITE(*,100) TID,I,C(I)
 100    FORMAT(' Thread',I2,': C(',I3,')=',F8.2)
      ENDDO
!$OMP END DO NOWAIT

(d) Supports incremental parallelization
Aside:
As illustrated in the example above, individual code segments can be parallelized by simply wrapping them in OpenMP directives.

Distributed-Memory Model

The distributed-memory programming model exploits a distributed-memory system where each processor maintains its own local memory and has no direct knowledge about another processor's memory. For data to be shared, it must be passed from one processor to another as a message.

Important:

Data that resides on the local memory of a process can be accessed much more quickly than data that resides on another process.
Figure 2: Distributed-Memory Architecture and Programming Model

The distributed-memory architecture and programming model is illustrated in Figure 2. If a processing element (PE) requires data from another PE to continue its execution, it must request the data. The request is made by sending a message, through the interconnection network, to the destination PE. When the data is available on the destination PE it is returned within a message, through the interconnection network, to the requesting PE.

Typically, a distributed-memory system will implement a message-passing programming model, where processing elements communicate data by sending and receiving messages.

Currently, MPI is the de facto standard specification for implementing message-passing on distributed-memory systems. MPI programs generally follow a Single Program Multiple Data (SPMD) or local-view programming style.

Exercise 2

Which of the following properties of the distributed-memory programming model (using MPI) are generally true:

(a) For many architectures, it can result in near-optimal performance
(b) Provides precise control over data locality and processor affinity
(c) Allows natural mapping of algorithms to implementation
(d) Supports incremental parallelization
(e) Runs on most parallel platforms

Solution

(a) For many architectures, it can result in near-optimal performance
(b) Provides precise control over data locality and processor affinity

Aside:
Each MPI process has direct access to its local memory for reading and writing data values. Accessing local data is more efficient than accessing data on a remote process.

(e) Runs on most parallel platforms

The Partitioned Global Address Space (PGAS) Model

The 'Best' of Both Worlds

Ideally a successful parallel programming model should hope to marry the performance and data locality (partitioning) features of MPI with the programmability and data referencing simplicity of a shared-memory (global address space) model.

The PGAS programming model aims to achieve these characteristics by providing:

  1. a local-view programming style (which differentiates between local and remote data partitions)
  2. a global address space (which is directly accessible by any process)
  3. compiler-introduced communication to resolve remote references
  4. one-sided communication for improved inter-process performance
  5. support for distributed data structures
Figure 3: PGAS Programming Model

The PGAS programming model is illustrated in Figure 3. In this model variables and arrays can be either shared or local. Each process has private memory for local data items and shared memory for globally shared data values. While the shared-memory is partitioned among the cooperating processes (each process will contribute memory to to the shared global memory), a process can directly access any data item within the global address space with a single address.

Important:

Greater performance is achieved when a process accesses data which is held locally (whether in its private memory or a partition of the global address space).

In Figure 3, five data objects have been declared within the PGAS programming model:

  • Each process has declared a private copy of variable x
  • Process 2 has declared a shared variable y
  • A shared array A is distributed across the global address space

Each instance of variable x can directly write and read values to/from the global address space. Likewise an address in the global address space can be directly read or written to from another address within the global address space.

Note:

Typically, PGAS language compilers generate the necessary communication during program compilation.

PGAS Programming Languages

Currently there are three (3) PGAS programming languages that are becoming ubiquitous on modern computing systems:

Matrix Multiplication with UPC, CAF and Titanium

For your curiosity, the following code examples illustrate some of the PGAS language features provided by UPC, CAF and Titanium respectively. For each language, a primitive matrix multiplication operation has been implemented.

Example 1


#include <upc_relaxed.h>

// a and c are blocked shared matrices, initialization is not currently implemented
shared [N*P /THREADS] int a[N][P], c[N][M];
shared[M/THREADS] int b[P][M];
int b_local[P][M];

void main (void) {
  int i, j , l; // private variables
  upc_memget(b_local, b, P*M*sizeof(int));

  upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
    for (j=0 ; j<M ;j++) {
      c[i][j] = 0;
      for (l= 0 ; l<P ; l++) c[i][j] += a[i][l]*b_local[l][j];
    }
  }
}

Matrix Multiplication in UPC

Example 2


real,dimension(n,n)[p,*] :: a,b,c

do k=1,n
  do q=1,p
    c(i,j) = c(i,j) + a(i,k)[myP, q]*b(k,j)[q,myQ]
  end do
end do

Matrix Multiplication in Co-Array Fortran

Example 3


public static void matMul(double [2d] a,
                          double [2d] b,
                          double [2d] c) {

  foreach (ij in c.domain()) {
    double [1d] aRowi = a.slice(1, ij[1]);
    double [1d] bColj = b.slice(2, ij[2]);

      foreach (k in aRowi.domain()) {
        c[ij] += aRowi[k] * bColj[k];
      }
  }
}

Matrix Multiplication in Titanium

The Asynchronous Partitioned Global Address Space Model

Within the U.S., the DARPA HPCS project has the goal to raise high-performance computing (HPC) user productivity by a factor of 10 by the year 2010.

Note:

Productivity = Performance+Programmability+Portability+Robustness

As part of this project two novel parallel programming languages are being funded for development, with the aim of improving programmer productivity (and achieve satisfactory performance) on next-generation computing architectures.

The two languages under development are:

Both these languages implement an asynchronous partitioned global address space (APGAS) programming model; a term originally coined within the X10 language project. The APGAS model extends the PGAS programming model but providing a richer execution framework than the SPMD style generally used by the traditional PGAS languages.

In particular, asynchrony is achieved by:

  1. permitting each node to execute multiple tasks from a task pool
  2. permitting nodes to invoke work on other nodes

Example 4

This example illustrates asynchronous task creation in Chapel using Chapel's Locales concept.


// Chapel programs begin running on locale 0 by default
var x, y: real;         // allocate x & y on locale 0
on Locales(1) {         // migrate task to locale 1
  var z: real;          // allocate z on locale 1
  writeln(x.locale.id); // prints “0”
  writeln(z.locale.id); // prints “1”
  z = x + y;            // requires “get” for x and y
  on Locales(0)do       // migrate back to locale 0
  z = x + y;            // requires “get” for z
                        // return to locale 1
}                       // return

Asynchronous Task Creation in Chapel

Example 5

This example illustrates asynchronous task creation in X10 using X10's async concept.


// global dist. array
final double a[D] =  …;  
final int k = …;

async ( a.distribution[99] ) { 
    // executed at A[99]’s place
    atomic a[99] = k; 
}

Asynchronous Task Creation in X10

Chapel and X10 provide many more high-level language features to improve programmer productivity during the development of parallel applications. These features will be explored in further detail in separate modules.

If you are interested in learning more about the next-generation parallel programming languages X10 and Chapel, please view the following video and audio introductions, presented by the technical leads of each project.

An Audio Introduction to X10 by Vijay Saraswat (IBM)

Musical Example: X10_audio.mp3

You can download the transcript of the audio presentation here.

A Video Introduction to Chapel by Brad Chamberlain (Cray Inc.)

Seattle Conference on Scalability 2008: Chapel - Productive Parallel Programming at Scale.

Footnotes

  1. We can view a multi-core processor and its associated memory as a traditional symmetric multiprocessing (SMP) node.

Glossary

Multi-core Processor:

A multi-core processor combines two or more independent cores (normally a CPU) into a single package composed of a single integrated circuit (IC).

Further information: http://en.wikipedia.org/wiki/Multi_core

SMP:

Symmetric Multiprocessing: a multiprocessor computer-architecture where two or more identical processors can connect to a single shared main memory.

Further information: http://en.wikipedia.org/wiki/Symmetric_multiprocessing

Shared-Memory:

A shared memory system is one that typically provides a large block of random access memory that can be accessed by several different central processing units (CPUs) in a multiple-processor computer system.

Further information: http://en.wikipedia.org/wiki/Shared_memory

Multi-threaded Programming Model:

Multi-threading is a popular programming and execution model that allows multiple threads to exist within the context of a single process. These threads share the process' resources but are able to execute independently.

Further information: http://en.wikipedia.org/wiki/Thread_(computer_science)

OpenMP:

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming (typically using a fork-join model) in C, C++ and Fortran on many architectures. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

Further information: http://openmp.org/wp/

Pthreads:

Pthreads or POSIX Threads, is a POSIX standard that defines an API for creating and manipulating threads.

Further information: http://en.wikipedia.org/wiki/POSIX_Threads

Distributed-Memory System:

A distributed-memory system is a multiple-processor computer system in which each processor has its own private memory. Computational tasks can only operate on local data, and if remote data is required, the computational task must communicate with one or more remote processors.

Further information: http://en.wikipedia.org/wiki/Distributed_memory

Message-Passing Model:

A communication protocol were data is sent and received in packets between communicating processes.

Further information: http://en.wikipedia.org/wiki/Message_passing

MPI:

Message-Passing Interface: a language-independent communications protocol used to program parallel computers using both point-to-point and collective communications. MPI defines a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any implementation

Further information: http://www.mpi-forum.org/

SPMD:

Single Program Multiple Data: multiple autonomous processors simultaneously execute the same program at independent points.

Further information: http://en.wikipedia.org/wiki/SPMD

See Also: Local-view Model
Local-view Model:

A programming model in which a process only has direct access to the local components of a distributed data structure.

See Also: SPMD
One-sided Communication:

Traditional interprocess communication (used by MPI) requires cooperation and synchronization between sending and receiving processes; this is referred to as two-sided communication. In one-sided communication, a process can update or interrogate the memory of another process without any intervention from the destination process.

Unified Parallel C (UPC):

Explicit PGAS language extensions to ANSI C.

Further information: http://upc.gwu.edu/

Co-array Fortran (CAF):

Explicit PGAS language extensions to Fortran 95.

Further information: http://www.co-array.org/

Titanium:

Explicit PGAS language extensions to Java.

Further information: http://titanium.cs.berkeley.edu/

HPCS:

The DARPA High Productivity Computing Systems project is focused on providing a new generation of economically viable high productivity computing systems for national security and for the industrial user community.

Further information: http://www.highproductivity.org/

X10:

Experimental new language currently under development at IBM in collaboration with academic partners. The X10 effort is part of the IBM PERCS project (Productive Easy-to-use Reliable Computer Systems) in the DARPA program on High Productivity Computer Systems.

Further information: http://domino.research.ibm.com/comm/research_projects.nsf/pages/x10.index.html

Chapel:

Cascade High-Productivity Language: a new parallel programming language being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS).

Further information: http://chapel.cs.washington.edu/

Locales:

Chapel's abstract representation of the target architecture. A locales properties include:

  1. threads within a locale have uniform access to local memory
  2. memory within other locales is accessible, but at a price
  3. locales are defined for a given architecture by a Chapel compiler e.g., a multicore processor or SMP node could be a locale

async:

The X10 asynchronous execution language construct.

async (P) S:

  1. creates a new child activity at place P, that executes statement S
  2. returns immediately

Content actions

Download module as:

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks