If it were all that simple, you wouldn’t need this book. While compilers are extremely clever, there is still a lot of ways to improve the performance of your code without sacrificing its portability. Instead of converting the whole program to C and using a thread library, you can assist the compiler by adding compiler directives to our source code.
Compiler directives are typically inserted in the form of stylized FORTRAN comments. This is done so that a nonparallelizing compiler can ignore them and just look at the FORTRAN code, sans comments. This allows to you tune your code for parallel architectures without letting it run badly on a wide range of single-processor systems.
There are two categories of parallel-processing comments:
Assertions tell the compiler certain things that you as the programmer know about the code that it might not guess by looking at the code. Through the assertions, you are attempting to assuage the compiler’s doubts about whether or not the loop is eligible for parallelization. When you use directives, you are taking full responsibility for the correct execution of the program. You are telling the compiler what to parallelize and how to do it. You take full responsibility for the output of the program. If the program produces meaningless results, you have no one to blame but yourself.
In a previous example, we compiled a program and received the following output:
E6000: f77 -O3 -autopar -loopinfo -o dep dep.f
dep.f:
"dep.f", line 6: not parallelized, call may be unsafe
"dep.f", line 8: not parallelized, unsafe dependence (a)
E6000:
An uneducated programmer who has not read this book (or has not looked at the code) might exclaim, “What unsafe dependence, I never put one of those in my code!” and quickly add a no dependencies assertion. This is the essence of an assertion. Instead of telling the compiler to simply parallelize the loop, the programmer is telling the compiler that its conclusion that there is a dependence is incorrect. Usually the net result is that the compiler does indeed parallelize the loop.
We will briefly review the types of assertions that are typically supported by these compilers. An assertion is generally added to the code using a stylized comment.
A no dependencies or ignore dependencies directive tells the compiler that references don’t overlap. That is, it tells the compiler to generate code that may execute incorrectly if there are dependencies. You’re saying, “I know what I’m doing; it’s OK to overlap references.” A no dependencies directive might help the following loop:
DO I=1,N
A(I) = A(I+K) * B(I)
ENDDO
If you know that k is greater than -1 or less than -n, you can get the compiler to parallelize the loop:
C$ASSERT NO_DEPENDENCIES
DO I=1,N
A(I) = A(I+K) * B(I)
ENDDO
Of course, blindly telling the compiler that there are no dependencies is a prescription for disaster. If k equals -1, the example above becomes a recursive loop.
You will often see loops that contain some potential dependencies, making them bad candidates for a no dependencies directive. However, you may be able to supply some local facts about certain variables. This allows partial parallelization without compromising the results. In the code below, there are two potential dependencies because of subscripts involving k and j:
for (i=0; i<n; i++) {
a[i] = a[i+k] * b[i];
c[i] = c[i+j] * b[i];
}
Perhaps we know that there are no conflicts with references to a[i] and a[i+k]. But maybe we aren’t so sure about c[i] and c[i+j]. Therefore, we can’t say in general that there are no dependencies. However, we may be able to say something explicit about k (like “k is always greater than -1”), leaving j out of it. This information about the relationship of one expression to another is called a relation assertion. Applying a relation assertion allows the compiler to apply its optimization to the first statement in the loop, giving us partial parallelization.1
Again, if you supply inaccurate testimony that leads the compiler to make unsafe optimizations, your answer may be wrong.
As we have seen elsewhere, when elements of an array are indirectly addressed, you have to worry about whether or not some of the subscripts may be repeated. In the code below, are the values of K(I) all unique? Or are there duplicates?
DO I=1,N
A(K(I)) = A(K(I)) + B(I) * C
END DO
If you know there are no duplicates in K (i.e., that A(K(I)) is a permutation), you can inform the compiler so that iterations can execute in parallel. You supply the information using a permutation assertion.
Equivalenced arrays in FORTRAN programs provide another challenge for the compiler. If any elements of two equivalenced arrays appear in the same loop, most compilers assume that references could point to the same memory storage location and optimize very conservatively. This may be true even if it is abundantly apparent to you that there is no overlap whatsoever.
You inform the compiler that references to equivalenced arrays are safe with a no equivalences assertion. Of course, if you don’t use equivalences, this assertion has no effect.
Each loop can be characterized by an average number of iterations. Some loops are never executed or go around just a few times. Others may go around hundreds of times:
C$ASSERT TRIPCOUNT>100
DO I=L,N
A(I) = B(I) + C(I)
END DO
Your compiler is going to look at every loop as a candidate for unrolling or parallelization. It’s working in the dark, however, because it can’t tell which loops are important and tries to optimize them all. This can lead to the surprising experience of seeing your runtime go up after optimization!
A trip count assertion provides a clue to the compiler that helps it decide how much to unroll a loop or when to parallelize a loop.2 Loops that aren’t important can be identified with low or zero trip counts. Important loops have high trip counts.
If your compiler supports procedure inlining, you can use directives and command-line switches to specify how many nested levels of procedures you would like to inline, thresholds for procedure size, etc. The vendor will have chosen reasonable defaults.
Assertions also let you choose subroutines that you think are good candidates for inlining. However, subject to its thresholds, the compiler may reject your choices. Inlining could expand the code so much that increased memory activity would claim back gains made by eliminating the procedure call. At higher optimization levels, the compiler is often capable of making its own choices for inlining candidates, provided it can find the source code for the routine under consideration.
Some compilers support a feature called interprocedural analysis. When this is done, the compiler looks across routine boundaries for its data flow analysis. It can perform significant optimizations across routine boundaries, including automatic inlining, constant propagation, and others.
Without interprocedural analysis, when looking at a loop, if there is a subroutine call in the middle of the loop, the compiler has to treat the subroutine as if it will have the worst possible side effects. Also, it has to assume that there are dependencies that prevent the routine from executing simultaneously in two different threads.
Many routines (especially functions) don’t have any side effects and can execute quite nicely in separate threads because each thread has its own private call stack and local variables. If the routine is meaty, there will be a great deal of benefit in executing it in parallel.
Your computer may allow you to add a directive that tells you if successive sub-routine calls are independent:
C$ASSERT NO_SIDE_EFFECTS
DO I=1,N
CALL BIGSTUFF (A,B,C,I,J,K)
END DO
Even if the compiler has all the source code, use of common variables or equivalences may mask call independence.
At some point, you get tired of giving the compiler advice and hoping that it will reach the conclusion to parallelize your loop. At that point you move into the realm of manual parallelism. Luckily the programming model provided in FORTRAN insulates you from much of the details of exactly how multiple threads are managed at runtime. You generally control explicit parallelism by adding specially formatted comment lines to your source code. There are a wide variety of formats of these directives. In this section, we use the syntax that is part of the OpenMP (see ) standard. You generally find similar capabilities in each of the vendor compilers. The precise syntax varies slightly from vendor to vendor. (That alone is a good reason to have a standard.)
The basic programming model is that you are executing a section of code with either a single thread or multiple threads. The programmer adds a directive to summon additional threads at various points in the code. The most basic construct is called the parallel region.
In a parallel region, the threads simply appear between two statements of straight-line code. A very trivial example might be the following using the OpenMP directive syntax:
PROGRAM ONE
EXTERNAL OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS
INTEGER OMP_GET_THREAD_NUM, OMP_GET_MAX_THREADS
IGLOB = OMP_GET_MAX_THREADS()
PRINT *,’Hello There’
C$OMP PARALLEL PRIVATE(IAM), SHARED(IGLOB)
IAM = OMP_GET_THREAD_NUM()
PRINT *, ’I am ’, IAM, ’ of ’, IGLOB
C$OMP END PARALLEL
PRINT *,’All Done’
END
The C$OMP is the sentinel that indicates that this is a directive and not just another comment. The output of the program when run looks as follows:
% setenv OMP_NUM_THREADS 4
% a.out
Hello There
I am 0 of 4
I am 3 of 4
I am 1 of 4
I am 2 of 4
All Done
%
Execution begins with a single thread. As the program encounters the PARALLEL directive, the other threads are activated to join the computation. So in a sense, as execution passes the first directive, one thread becomes four. Four threads execute the two statements between the directives. As the threads are executing independently, the order in which the print statements are displayed is somewhat random. The threads wait at the END PARALLEL directive until all threads have arrived. Once all threads have completed the parallel region, a single thread continues executing the remainder of the program.
In Figure 1, the PRIVATE(IAM) indicates that the IAM variable is not shared across all the threads but instead, each thread has its own private version of the variable. The IGLOB variable is shared across all the threads. Any modification of IGLOB appears in all the other threads instantly, within the limitations of the cache coherency.
| Data interactions during a parallel region |
|---|
![]() |
During the parallel region, the programmer typically divides the work among the threads. This pattern of going from single-threaded to multithreaded execution may be repeated many times throughout the execution of an application.
Because input and output are generally not thread-safe, to be completely correct, we should indicate that the print statement in the parallel section is only to be executed on one processor at any one time. We use a directive to indicate that this section of code is a critical section. A lock or other synchronization mechanism ensures that no more than one processor is executing the statements in the critical section at any one time:
C$OMP CRITICAL
PRINT *, ’I am ’, IAM, ’ of ’, IGLOB
C$OMP END CRITICAL
Quite often the areas of the code that are most valuable to execute in parallel are loops. Consider the following loop:
DO I=1,1000000
TMP1 = ( A(I) ** 2 ) + ( B(I) ** 2 )
TMP2 = SQRT(TMP1)
B(I) = TMP2
ENDDO
To manually parallelize this loop, we insert a directive at the beginning of the loop:
C$OMP PARALLEL DO
DO I=1,1000000
TMP1 = ( A(I) ** 2 ) + ( B(I) ** 2 )
TMP2 = SQRT(TMP1)
B(I) = TMP2
ENDDO
C$OMP END PARALLEL DO
When this statement is encountered at runtime, the single thread again summons the other threads to join the computation. However, before the threads can start working on the loop, there are a few details that must be handled. The PARALLEL DO directive accepts the data classification and scoping clauses as in the parallel section directive earlier. We must indicate which variables are shared across all threads and which variables have a separate copy in each thread. It would be a disaster to have TMP1 and TMP2 shared across threads. As one thread takes the square root of TMP1, another thread would be resetting the contents of TMP1. A(I) and B(I) come from outside the loop, so they must be shared. We need to augment the directive as follows:
C$OMP PARALLEL DO SHARED(A,B) PRIVATE(I,TMP1,TMP2)
DO I=1,1000000
TMP1 = ( A(I) ** 2 ) + ( B(I) ** 2 )
TMP2 = SQRT(TMP1)
B(I) = TMP2
ENDDO
C$OMP END PARALLEL DO
The iteration variable I also must be a thread-private variable. As the different threads increment their way through their particular subset of the arrays, they don’t want to be modifying a global value for I.
There are a number of other options as to how data will be operated on across the threads. This summarizes some of the other data semantics available:
Each vendor may have different terms to indicate these data semantics, but most support all of these common semantics. Figure 2 shows how the different types of data semantics operate.
Now that we have the data environment set up for the loop, the only remaining problem that must be solved is which threads will perform which iterations. It turns out that this is not a trivial task, and a wrong choice can have a significant negative impact on our overall performance.
There are two basic techniques (along with a few variations) for dividing the iterations in a loop between threads. We can look at two extreme examples to get an idea of how this works:
C VECTOR ADD
DO IPROB=1,10000
A(IPROB) = B(IPROB) + C(IPROB)
ENDDO
C PARTICLE TRACKING
DO IPROB=1,10000
RANVAL = RAND(IPROB)
CALL ITERATE_ENERGY(RANVAL) ENDDO
ENDDO| Variables during a parallel region |
|---|
![]() |
In both loops, all the computations are independent, so if there were 10,000 processors, each processor could execute a single iteration. In the vector-add example, each iteration would be relatively short, and the execution time would be relatively constant from iteration to iteration. In the particle tracking example, each iteration chooses a random number for an initial particle position and iterates to find the minimum energy. Each iteration takes a relatively long time to complete, and there will be a wide variation of completion times from iteration to iteration.
These two examples are effectively the ends of a continuous spectrum of the iteration scheduling challenges facing the FORTRAN parallel runtime environment:
At the beginning of a parallel loop, each thread takes a fixed continuous portion of iterations of the loop based on the number of threads executing the loop.
With dynamic scheduling, each thread processes a chunk of data and when it has completed processing, a new chunk is processed. The chunk size can be varied by the programmer, but is fixed for the duration of the loop.
These two example loops can show how these iteration scheduling approaches might operate when executing with four threads. In the vector-add loop, static scheduling would distribute iterations 1–2500 to Thread 0, 2501–5000 to Thread 1, 5001–7500 to Thread 2, and 7501–10000 to Thread 3. In Figure 3, the mapping of iterations to threads is shown for the static scheduling option.
| Iteration assignment for static scheduling |
|---|
![]() |
Since the loop body (a single statement) is short with a consistent execution time, static scheduling should result in roughly the same amount of overall work (and time if you assume a dedicated CPU for each thread) assigned to each thread per loop execution.
An advantage of static scheduling may occur if the entire loop is executed repeatedly. If the same iterations are assigned to the same threads that happen to be running on the same processors, the cache might actually contain the values for A, B, and C from the previous loop execution.3 The runtime pseudo-code for static scheduling in the first loop might look as follows:
C VECTOR ADD - Static Scheduled
ISTART = (THREAD_NUMBER * 2500 ) + 1
IEND = ISTART + 2499
DO ILOCAL = ISTART,IEND
A(ILOCAL) = B(ILOCAL) + C(ILOCAL)
ENDDO
It’s not always a good strategy to use the static approach of giving a fixed number of iterations to each thread. If this is used in the second loop example, long and varying iteration times would result in poor load balancing. A better approach is to have each processor simply get the next value for IPROB each time at the top of the loop.
That approach is called dynamic scheduling, and it can adapt to widely varying iteration times. In Figure 4, the mapping of iterations to processors using dynamic scheduling is shown. As soon as a processor finishes one iteration, it processes the next available iteration in order.
| Iteration assignment in dynamic scheduling |
|---|
![]() |
If a loop is executed repeatedly, the assignment of iterations to threads may vary due to subtle timing issues that affect threads. The pseudo-code for the dynamic scheduled loop at runtime is as follows:
C PARTICLE TRACKING - Dynamic Scheduled
IPROB = 0
WHILE (IPROB <= 10000 )
BEGIN_CRITICAL_SECTION
IPROB = IPROB + 1
ILOCAL = IPROB
END_CRITICAL_SECTION
RANVAL = RAND(ILOCAL)
CALL ITERATE_ENERGY(RANVAL)
ENDWHILE
ILOCAL is used so that each thread knows which iteration is currently processing. The IPROB value is altered by the next thread executing the critical section.
While the dynamic iteration scheduling approach works well for this particular loop, there is a significant negative performance impact if the programmer were to use the wrong approach for a loop. For example, if the dynamic approach were used for the vector-add loop, the time to process the critical section to determine which iteration to process may be larger than the time to actually process the iteration. Furthermore, any cache affinity of the data would be effectively lost because of the virtually random assignment of iterations to processors.
In between these two approaches are a wide variety of techniques that operate on a chunk of iterations. In some techniques the chunk size is fixed, and in others it varies during the execution of the loop. In this approach, a chunk of iterations are grabbed each time the critical section is executed. This reduces the scheduling overhead, but can have problems in producing a balanced execution time for each processor. The runtime is modified as follows to perform the particle tracking loop example using a chunk size of 100:
IPROB = 1
CHUNKSIZE = 100
WHILE (IPROB <= 10000 )
BEGIN_CRITICAL_SECTION
ISTART = IPROB
IPROB = IPROB + CHUNKSIZE
END_CRITICAL_SECTION
DO ILOCAL = ISTART,ISTART+CHUNKSIZE-1
RANVAL = RAND(ILOCAL)
CALL ITERATE_ENERGY(RANVAL)
ENDDO
ENDWHILE
The choice of chunk size is a compromise between overhead and termination imbalance. Typically the programmer must get involved through directives in order to control chunk size.
Part of the challenge of iteration distribution is to balance the cost (or existence) of the critical section against the amount of work done per invocation of the critical section. In the ideal world, the critical section would be free, and all scheduling would be done dynamically. Parallel/vector supercomputers with hardware assistance for load balancing can nearly achieve the ideal using dynamic approaches with relatively small chunk size.
Because the choice of loop iteration approach is so important, the compiler relies on directives from the programmer to specify which approach to use. The following example shows how we can request the proper iteration scheduling for our loops:
C VECTOR ADD
C$OMP PARALLEL DO PRIVATE(IPROB) SHARED(A,B,C) SCHEDULE(STATIC)
DO IPROB=1,10000
A(IPROB) = B(IPROB) + C(IPROB)
ENDDO
C$OMP END PARALLEL DO
C PARTICLE TRACKING
C$OMP PARALLEL DO PRIVATE(IPROB,RANVAL) SCHEDULE(DYNAMIC)
DO IPROB=1,10000
RANVAL = RAND(IPROB)
CALL ITERATE_ENERGY(RANVAL)
ENDDO
C$OMP END PARALLEL DO
"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"