This next example is a rather complicated application that implements the heat flow problem in PVM. In many ways, it gives some insight into the work that is performed by the HPF environment. We will solve a heat flow in a two-dimensional plate with four heat sources and the edges in zero-degree water, as shown in Figure 1.
The data will be spread across all of the processes using a (*, BLOCK) distribution. Columns are distributed to processes in contiguous blocks, and all the row elements in a column are stored on the same process. As with HPF, the process that “owns” a data cell performs the computations for that cell after retrieving any data necessary to perform the computation.
We use a red-black approach but for simplicity, we copy the data back at the end of each iteration. For a true red-black, you would perform the computation in the opposite direction every other time step.
Note that instead of spawning slave process, the parent process spawns additional copies of itself. This is typical of SPMD-style programs. Once the additional processes have been spawned, all the processes wait at a barrier before they look for the process numbers of the members of the group. Once the processes have arrived at the barrier, they all retrieve a list of the different process numbers:
% cat pheat.f
PROGRAM PHEAT
INCLUDE ’../include/fpvm3.h’
INTEGER NPROC,ROWS,COLS,TOTCOLS,OFFSET
PARAMETER(NPROC=4,MAXTIME=200)
PARAMETER(ROWS=200,TOTCOLS=200)
PARAMETER(COLS=(TOTCOLS/NPROC)+3)
REAL*8 RED(0:ROWS+1,0:COLS+1), BLACK(0:ROWS+1,0:COLS+1)
LOGICAL IAMFIRST,IAMLAST
INTEGER INUM,INFO,TIDS(0:NPROC-1),IERR
INTEGER I,R,C
INTEGER TICK,MAXTIME
CHARACTER*30 FNAME
* Get the SPMD thing going - Join the pheat group
CALL PVMFJOINGROUP(’pheat’, INUM)
* If we are the first in the pheat group, make some helpers
IF ( INUM.EQ.0 ) THEN
DO I=1,NPROC-1
CALL PVMFSPAWN(’pheat’, 0, ’anywhere’, 1, TIDS(I), IERR)
ENDDO
ENDIF
* Barrier to make sure we are all here so we can look them up
CALL PVMFBARRIER( ’pheat’, NPROC, INFO )
* Find my pals and get their TIDs - TIDS are necessary for sending
DO I=0,NPROC-1
CALL PVMFGETTID(’pheat’, I, TIDS(I))
ENDDO
At this point in the code, we have NPROC processes executing in an SPMD mode. The next step is to determine which subset of the array each process will compute. This is driven by the INUM variable, which ranges from 0 to 3 and uniquely identifies these processes.
We decompose the data and store only one quarter of the data on each process. Using the INUM variable, we choose our continuous set of columns to store and compute. The OFFSET variable maps between a “global” column in the entire array and a local column in our local subset of the array. Figure 2 shows a map that indicates which processors store which data elements. The values marked with a B are boundary values and won’t change during the simulation. They are all set to 0. This code is often rather tricky to figure out. Performing a (BLOCK, BLOCK) distribution requires a two-dimensional decomposition and exchanging data with the neighbors above and below, in addition to the neighbors to the left and right:
* Compute my geometry - What subset do I process? (INUM=0 values)
* Actual Column = OFFSET + Column (OFFSET = 0)
* Column 0 = neighbors from left
* Column 1 = send to left
* Columns 1..mylen My cells to compute
* Column mylen = Send to right (mylen=50)
* Column mylen+1 = Neighbors from Right (Column 51)
IAMFIRST = (INUM .EQ. 0)
IAMLAST = (INUM .EQ. NPROC-1)
OFFSET = (ROWS/NPROC * INUM )
MYLEN = ROWS/NPROC
IF ( IAMLAST ) MYLEN = TOTCOLS - OFFSET
PRINT *,’INUM:’,INUM,’ Local’,1,MYLEN,
+ ’ Global’,OFFSET+1,OFFSET+MYLEN
* Start Cold
DO C=0,COLS+1
DO R=0,ROWS+1
BLACK(R,C) = 0.0
ENDDO
ENDDO
Now we run the time steps. The first act in each time step is to reset the heat sources. In this simulation, we have four heat sources placed near the middle of the plate. We must restore all the values each time through the simulation as they are modified in the main loop:
* Begin running the time steps
DO TICK=1,MAXTIME
* Set the heat persistent sources
CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
+ ROWS/3,TOTCOLS/3,10.0,INUM)
CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
+ 2*ROWS/3,TOTCOLS/3,20.0,INUM)
CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
+ ROWS/3,2*TOTCOLS/3,-20.0,INUM)
CALL STORE(BLACK,ROWS,COLS,OFFSET,MYLEN,
+ 2*ROWS/3,2*TOTCOLS/3,20.0,INUM)
Now we perform the exchange of the “ghost values” with our neighboring processes. For example, Process 0 contains the elements for global column 50. To compute the next time step values for column 50, we need column 51, which is stored in Process 1. Similarly, before Process 1 can compute the new values for column 51, it needs Process 0’s values for column 50.
Figure 3 shows how the data is transferred between processors. Each process sends its leftmost column to the left and its rightmost column to the right. Because the first and last processes border unchanging boundary values on the left and right respectively, this is not necessary for columns one and 200. If all is done properly, each process can receive its ghost values from their left and right neighbors.
The net result of all of the transfers is that for each space that must be computed, it’s surrounded by one layer of either boundary values or ghost values from the right or left neighbors:
* Send left and right
IF ( .NOT. IAMFIRST ) THEN
CALL PVMFINITSEND(PVMDEFAULT,TRUE)
CALL PVMFPACK( REAL8, BLACK(1,1), ROWS, 1, INFO )
CALL PVMFSEND( TIDS(INUM-1), 1, INFO )
ENDIF
IF ( .NOT. IAMLAST ) THEN
CALL PVMFINITSEND(PVMDEFAULT,TRUE)
CALL PVMFPACK( REAL8, BLACK(1,MYLEN), ROWS, 1, INFO )
CALL PVMFSEND( TIDS(INUM+1), 2, INFO )
ENDIF
* Receive right, then left
IF ( .NOT. IAMLAST ) THEN
CALL PVMFRECV( TIDS(INUM+1), 1, BUFID )
CALL PVMFUNPACK ( REAL8, BLACK(1,MYLEN+1), ROWS, 1, INFO
ENDIF
IF ( .NOT. IAMFIRST ) THEN
CALL PVMFRECV( TIDS(INUM-1), 2, BUFID )
CALL PVMFUNPACK ( REAL8, BLACK(1,0), ROWS, 1, INFO)
ENDIF
This next segment is the easy part. All the appropriate ghost values are in place, so we must simply perform the computation in our subspace. At the end, we copy back from the RED to the BLACK array; in a real simulation, we would perform two time steps, one from BLACK to RED and the other from RED to BLACK, to save this extra copy:
* Perform the flow
DO C=1,MYLEN
DO R=1,ROWS
RED(R,C) = ( BLACK(R,C) +
+ BLACK(R,C-1) + BLACK(R-1,C) +
+ BLACK(R+1,C) + BLACK(R,C+1) ) / 5.0
ENDDO
ENDDO
* Copy back - Normally we would do a red and black version of the loop
DO C=1,MYLEN
DO R=1,ROWS
BLACK(R,C) = RED(R,C)
ENDDO
ENDDO
ENDDO
Now we find the center cell and send to the master process (if necessary) so it can be printed out. We also dump out the data into files for debugging or later visualization of the results. Each file is made unique by appending the instance number to the filename. Then the program terminates:
CALL SENDCELL(RED,ROWS,COLS,OFFSET,MYLEN,INUM,TIDS(0),
+ ROWS/2,TOTCOLS/2)
* Dump out data for verification
IF ( ROWS .LE. 20 ) THEN
FNAME = ’/tmp/pheatout.’ // CHAR(ICHAR(’0’)+INUM)
OPEN(UNIT=9,NAME=FNAME,FORM=’formatted’)
DO C=1,MYLEN
WRITE(9,100)(BLACK(R,C),R=1,ROWS)
100 FORMAT(20F12.6)
ENDDO
CLOSE(UNIT=9)
ENDIF
* Lets all go together
CALL PVMFBARRIER( ’pheat’, NPROC, INFO )
CALL PVMFEXIT( INFO )
END
The SENDCELL routine finds a particular cell and prints it out on the master process. This routine is called in an SPMD style: all the processes enter this routine although all not at precisely the same time. Depending on the INUM and the cell that we are looking for, each process may do something different.
If the cell in question is in the master process, and we are the master process, print it out. All other processes do nothing. If the cell in question is stored in another process, the process with the cell sends it to the master processes. The master process receives the value and prints it out. All the other processes do nothing.
This is a simple example of the typical style of SPMD code. All the processes execute the code at roughly the same time, but, based on information local to each process, the actions performed by different processes may be quite different:
SUBROUTINE SENDCELL(RED,ROWS,COLS,OFFSET,MYLEN,INUM,PTID,R,C)
INCLUDE ’../include/fpvm3.h’
INTEGER ROWS,COLS,OFFSET,MYLEN,INUM,PTID,R,C
REAL*8 RED(0:ROWS+1,0:COLS+1)
REAL*8 CENTER
* Compute local row number to determine if it is ours
I = C - OFFSET
IF ( I .GE. 1 .AND. I.LE. MYLEN ) THEN
IF ( INUM .EQ. 0 ) THEN
PRINT *,’Master has’, RED(R,I), R, C, I
ELSE
CALL PVMFINITSEND(PVMDEFAULT,TRUE)
CALL PVMFPACK( REAL8, RED(R,I), 1, 1, INFO )
PRINT *, ’INUM:’,INUM,’ Returning’,R,C,RED(R,I),I
CALL PVMFSEND( PTID, 3, INFO )
ENDIF
ELSE
IF ( INUM .EQ. 0 ) THEN
CALL PVMFRECV( -1 , 3, BUFID )
CALL PVMFUNPACK ( REAL8, CENTER, 1, 1, INFO)
PRINT *, ’Master Received’,R,C,CENTER
ENDIF
ENDIF
RETURN
END
Like the previous routine, the STORE routine is executed on all processes. The idea is to store a value into a global row and column position. First, we must determine if the cell is even in our process. If the cell is in our process, we must compute the local column (I) in our subset of the overall matrix and then store the value:
SUBROUTINE STORE(RED,ROWS,COLS,OFFSET,MYLEN,R,C,VALUE,INUM)
REAL*8 RED(0:ROWS+1,0:COLS+1)
REAL VALUE
INTEGER ROWS,COLS,OFFSET,MYLEN,R,C,I,INUM
I = C - OFFSET
IF ( I .LT. 1 .OR. I .GT. MYLEN ) RETURN
RED(R,I) = VALUE
RETURN
END
When this program executes, it has the following output:
% pheat
INUM: 0 Local 1 50 Global 1 50
Master Received 100 100 3.4722390023541D-07
%
We see two lines of print. The first line indicates the values that Process 0 used in its geometry computation. The second line is the output from the master process of the temperature at cell (100,100) after 200 time steps.
One interesting technique that is useful for debugging this type of program is to change the number of processes that are created. If the program is not quite moving its data properly, you usually get different results when different numbers of processes are used. If you look closely, the above code performs correctly with one process or 30 processes.
Notice that there is no barrier operation at the end of each time step. This is in contrast to the way parallel loops operate on shared uniform memory multiprocessors that force a barrier at the end of each loop. Because we have used an “owner computes” rule, and nothing is computed until all the required ghost data is received, there is no need for a barrier. The receipt of the messages with the proper ghost values allows a process to begin computing immediately without regard to what the other processes are currently doing.
This example can be used either as a framework for developing other grid-based computations, or as a good excuse to use HPF and appreciate the hard work that the HPF compiler developers have done. A well-done HPF implementation of this simulation should outperform the PVM implementation because HPF can make tighter optimizations. Unlike us, the HPF compiler doesn’t have to keep its generated code readable.
"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"