In this example, we implement our heat flow problem in MPI using a similar decomposition to the PVM example. There are several ways to approach the prob- lem. We could almost translate PVM calls to corresponding MPI calls using the MPI_COMM_WORLD communicator. However, to showcase some of the MPI features, we create a Cartesian communicator:
PROGRAM MHEATC
INCLUDE ’mpif.h’
INCLUDE ’mpef.h’
INTEGER ROWS,COLS,TOTCOLS
PARAMETER(MAXTIME=200)
* This simulation can be run on MINPROC or greater processes.
* It is OK to set MINPROC to 1 for testing purposes
* For a large number of rows and columns, it is best to set MINPROC
* to the actual number of runtime processes
PARAMETER(MINPROC=2) PARAMETER(ROWS=200,TOTCOLS=200,COLS=TOTCOLS/MINPROC)
DOUBLE PRECISION RED(0:ROWS+1,0:COLS+1),BLACK(0:ROWS+1,0:COLS+1)
INTEGER S,E,MYLEN,R,C
INTEGER TICK,MAXTIME
CHARACTER*30 FNAME
The basic data structures are much the same as in the PVM example. We allocate a subset of the heat arrays in each process. In this example, the amount of space allocated in each process is set by the compile-time variable MINPROC. The simulation can execute on more than MINPROC processes (wasting some space in each process), but it can’t execute on less than MINPROC processes, or there won’t be sufficient total space across all of the processes to hold the array:
INTEGER COMM1D,INUM,NPROC,IERR
INTEGER DIMS(1),COORDS(1)
LOGICAL PERIODS(1)
LOGICAL REORDER
INTEGER NDIM
INTEGER STATUS(MPI_STATUS_SIZE)
INTEGER RIGHTPROC, LEFTPROC
These data structures are used for our interaction with MPI. As we will be doing a one-dimensional Cartesian decomposition, our arrays are dimensioned to one. If you were to do a two-dimensional decomposition, these arrays would need two elements:
PRINT *,’Calling MPI_INIT’
CALL MPI_INIT( IERR )
PRINT *,’Back from MPI_INIT’
CALL MPI_COMM_SIZE( MPI_COMM_WORLD, NPROC, IERR )
The call to MPI_INIT creates the appropriate number of processes. Note that in the output, the PRINT statement before the call only appears once, but the second PRINT appears once for each process. We call MPI_COMM_SIZE to determine the size of the global communicator MPI_COMM_WORLD. We use this value to set up our Cartesian topology:
* Create new communicator that has a Cartesian topology associated
* with it - MPI_CART_CREATE returns COMM1D - A communicator descriptor
DIMS(1) = NPROC
PERIODS(1) = .FALSE.
REORDER = .TRUE.
NDIM = 1
CALL MPI_CART_CREATE(MPI_COMM_WORLD, NDIM, DIMS, PERIODS,
+ REORDER, COMM1D, IERR)
Now we create a one-dimensional (NDIM=1) arrangement of all of our processes (MPI_COMM_WORLD). All of the parameters on this call are input values except for COMM1D and IERR. COMM1D is an integer “communicator handle.” If you print it out, it will be a value such as 134. It is not actually data, it is merely a handle that is used in other calls. It is quite similar to a file descriptor or unit number used when performing input-output to and from files.
The topology we use is a one-dimensional decomposition that isn’t periodic. If we specified that we wanted a periodic decomposition, the far-left and far-right processes would be neighbors in a wrapped-around fashion making a ring. Given that it isn’t periodic, the far-left and far-right processes have no neighbors.
In our PVM example above, we declared that Process 0 was the far-right process, Process NPROC-1 was the far-left process, and the other processes were arranged linearly between those two. If we set REORDER to .FALSE., MPI also chooses this arrangement. However, if we set REORDER to .TRUE., MPI may choose to arrange the processes in some other fashion to achieve better performance, assuming that you are communicating with close neighbors.
Once the communicator is set up, we use it in all of our communication operations:
* Get my rank in the new communicator
CALL MPI_COMM_RANK( COMM1D, INUM, IERR)
Within each communicator, each process has a rank from zero to the size of the communicator minus 1. The MPI_COMM_RANK tells each process its rank within the communicator. A process may have a different rank in the COMM1D communicator than in the MPI_COMM_WORLD communicator because of some reordering.
Given a Cartesian topology communicator, we can extract information from the communicator using the MPI_CART_GET routine:
* Given a communicator handle COMM1D, get the topology, and my position
* in the topology
CALL MPI_CART_GET(COMM1D, NDIM, DIMS, PERIODS, COORDS, IERR)
In this call, all of the parameters are output values rather than input values as in the MPI_CART_CREATE call. The COORDS variable tells us our coordinates within the communicator. This is not so useful in our one-dimensional example, but in a two-dimensional process decomposition, it would tell our current position in that two-dimensional grid:
* Returns the left and right neighbors 1 unit away in the zeroth dimension
* of our Cartesian map - since we are not periodic, our neighbors may
* not always exist - MPI_CART_SHIFT handles this for us
CALL MPI_CART_SHIFT(COMM1D, 0, 1, LEFTPROC, RIGHTPROC, IERR)
CALL MPE_DECOMP1D(TOTCOLS, NPROC, INUM, S, E)
MYLEN = ( E - S ) + 1
IF ( MYLEN.GT.COLS ) THEN
PRINT *,’Not enough space, need’,MYLEN,’ have ’,COLS
PRINT *,TOTCOLS,NPROC,INUM,S,E
STOP
ENDIF
PRINT *,INUM,NPROC,COORDS(1),LEFTPROC,RIGHTPROC, S, E
We can use MPI_CART_SHIFT to determine the rank number of our left and right neighbors, so we can exchange our common points with these neighbors. This is necessary because we can’t simply send to INUM-1 and INUM+1 if MPI has chosen to reorder our Cartesian decomposition. If we are the far-left or far-right process, the neighbor that doesn’t exist is set to MPI_PROC_NULL, which indicates that we have no neighbor. Later when we are performing message sending, it checks this value and sends messages only to real processes. By not sending the message to the “null process,” MPI has saved us an IF test.
To determine which strip of the global array we store and compute in this process, we call a utility routine called MPE_DECOMP1D that simply does several calculations to evenly split our 200 columns among our processes in contiguous strips. In the PVM version, we need to perform this computation by hand.
The MPE_DECOMP1D routine is an example of an extended MPI library call (hence the MPE prefix). These extensions include graphics support and logging tools in addition to some general utilities. The MPE library consists of routines that were useful enough to standardize but not required to be supported by all MPI implementations. You will find the MPE routines supported on most MPI implementations.
Now that we have our communicator group set up, and we know which strip each process will handle, we begin the computation:
* Start Cold
DO C=0,COLS+1
DO R=0,ROWS+1
BLACK(R,C) = 0.0
ENDDO
ENDDO
As in the PVM example, we set the plate (including boundary values) to zero.
All processes begin the time step loop. Interestingly, like in PVM, there is no need for any synchronization. The messages implicitly synchronize our loops.
The first step is to store the permanent heat sources. We need to use a routine because we must make the store operations relative to our strip of the global array:
* Begin running the time steps
DO TICK=1,MAXTIME
* Set the persistent heat sources
CALL STORE(BLACK,ROWS,COLS,S,E,ROWS/3,TOTCOLS/3,10.0,INUM)
CALL STORE(BLACK,ROWS,COLS,S,E,2*ROWS/3,TOTCOLS/3,20.0,INUM)
CALL STORE(BLACK,ROWS,COLS,S,E,ROWS/3,2*TOTCOLS/3,-20.0,INUM)
CALL STORE(BLACK,ROWS,COLS,S,E,2*ROWS/3,2*TOTCOLS/3,20.0,INUM)
All of the processes set these values independently depending on which process has which strip of the overall array.
Now we exchange the data with our neighbors as determined by the Cartesian communicator. Note that we don’t need an IF test to determine if we are the far-left or far-right process. If we are at the edge, our neighbor setting is MPI_PROC_NULL and the MPI_SEND and MPI_RECV calls do nothing when given this as a source or destination value, thus saving us an IF test.
Note that we specify the communicator COMM1D because the rank values we are using in these calls are relative to that communicator:
* Send left and receive right
CALL MPI_SEND(BLACK(1,1),ROWS,MPI_DOUBLE_PRECISION,
+ LEFTPROC,1,COMM1D,IERR)
CALL MPI_RECV(BLACK(1,MYLEN+1),ROWS,MPI_DOUBLE_PRECISION,
+ RIGHTPROC,1,COMM1D,STATUS,IERR)
* Send Right and Receive left in a single statement
CALL MPI_SENDRECV(
+ BLACK(1,MYLEN),ROWS,COMM1D,RIGHTPROC,2,
+ BLACK(1,0),ROWS,COMM1D,LEFTPROC, 2,
+ MPI_COMM_WORLD, STATUS, IERR)
Just to show off, we use both the separate send and receive, and the combined send and receive. When given a choice, it’s probably a good idea to use the combined operations to give the runtime environment more flexibility in terms of buffering. One downside to this that occurs on a network of workstations (or any other high-latency interconnect) is that you can’t do both send operations first and then do both receive operations to overlap some of the communication delay.
Once we have all of our ghost points from our neighbors, we can perform the algorithm on our subset of the space:
* Perform the flow
DO C=1,MYLEN
DO R=1,ROWS
RED(R,C) = ( BLACK(R,C) +
+ BLACK(R,C-1) + BLACK(R-1,C) +
+ BLACK(R+1,C) + BLACK(R,C+1) ) / 5.0
ENDDO
ENDDO
* Copy back - Normally we would do a red and black version of the loop
DO C=1,MYLEN
DO R=1,ROWS
BLACK(R,C) = RED(R,C)
ENDDO
ENDDO
ENDDO
Again, for simplicity, we don’t do the complete red-black computation. We have no synchronization at the bottom of the loop because the messages implicitly synchronize the processes at the top of the next loop.
Again, we dump out the data for verification. As in the PVM example, one good test of basic correctness is to make sure you get exactly the same results for varying numbers of processes:
* Dump out data for verification
IF ( ROWS .LE. 20 ) THEN
FNAME = ’/tmp/mheatcout.’ // CHAR(ICHAR(’0’)+INUM)
OPEN(UNIT=9,NAME=FNAME,FORM=’formatted’)
DO C=1,MYLEN
WRITE(9,100)(BLACK(R,C),R=1,ROWS)
100 FORMAT(20F12.6)
ENDDO
CLOSE(UNIT=9)
ENDIF
To terminate the program, we call MPI_FINALIZE:
* Lets all go together
CALL MPI_FINALIZE(IERR)
END
As in the PVM example, we need a routine to store a value into the proper strip of the global array. This routine simply checks to see if a particular global element is in this process and if so, computes the proper location within its strip for the value. If the global element is not in this process, this routine simply returns doing nothing:
SUBROUTINE STORE(RED,ROWS,COLS,S,E,R,C,VALUE,INUM)
REAL*8 RED(0:ROWS+1,0:COLS+1)
REAL VALUE
INTEGER ROWS,COLS,S,E,R,C,I,INUM
IF ( C .LT. S .OR. C .GT. E ) RETURN
I = ( C - S ) + 1
* PRINT *,’STORE, INUM,R,C,S,E,R,I’,INUM,R,C,S,E,R,I,VALUE RED(R,I) = VALUE
RETURN
END
When this program is executed, it has the following output:
% mpif77 -c mheatc.f mheatc.f:
MAIN mheatc:
store:
% mpif77 -o mheatc mheatc.o -lmpe
% mheatc -np 4
Calling MPI_INIT
Back from MPI_INIT
Back from MPI_INIT
Back from MPI_INIT
Back from MPI_INIT
0 4 0 -1 1 1 50
2 4 2 1 3 101 150
3 4 3 2 -1 151 200
1 4 1 0 2 51 100
%
As you can see, we call MPI_INIT to activate the four processes. The PRINT statement immediately after the MPI_INIT call appears four times, once for each of the activated processes. Then each process prints out the strip of the array it will process. We can also see the neighbors of each process including -1 when a process has no neighbor to the left or right. Notice that Process 0 has no left neighbor, and Process 3 has no right neighbor. MPI has provided us the utilities to simplify message-passing code that we need to add to implement this type of grid- based application.
When you compare this example with a PVM implementation of the same problem, you can see some of the contrasts between the two approaches. Programmers who wrote the same six lines of code over and over in PVM combined them into a single call in MPI. In MPI, you can think “data parallel” and express your program in a more data-parallel fashion.
In some ways, MPI feels less like assembly language than PVM. However, MPI does take a little getting used to when compared to PVM. The concept of a Cartesian communicator may seem foreign at first, but with understanding, it becomes a flexible and powerful tool.
"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"