Skip to content Skip to navigation

OpenStax CNX

You are here: Home » Content » Understanding Parallelism - Loop-Carried Dependencies

Navigation

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.
  • HPC Open Edu Cup

    This module is included inLens: High Performance Computing Open Education Cup 2008-2009
    By: Ken Kennedy Institute for Information TechnologyAs a part of collection: "High Performance Computing"

    Click the "HPC Open Edu Cup" link to see all content they endorse.

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • NSF Partnership display tagshide tags

    This module is included inLens: NSF Partnership in Signal Processing
    By: Sidney BurrusAs a part of collection: "High Performance Computing"

    Click the "NSF Partnership" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Featured Content

    This module is included inLens: Connexions Featured Content
    By: ConnexionsAs a part of collection: "High Performance Computing"

    Comments:

    "The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"

    Click the "Featured Content" link to see all content affiliated with them.

Also in these lenses

  • UniqU content

    This module is included inLens: UniqU's lens
    By: UniqU, LLCAs a part of collection: "High Performance Computing"

    Click the "UniqU content" link to see all content selected in this lens.

  • Lens for Engineering

    This module is included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

  • eScience, eResearch and Computational Problem Solving

    This module is included inLens: eScience, eResearch and Computational Problem Solving
    By: Jan E. OdegardAs a part of collection: "High Performance Computing"

    Click the "eScience, eResearch and Computational Problem Solving" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
 

Understanding Parallelism - Loop-Carried Dependencies

Module by: Charles Severance, Kevin Dowd. E-mail the authors

The notion of data dependence is particularly important when we look at loops, the hub of activity inside numerical applications. A well-designed loop can produce millions of operations that can all be performed in parallel. However, a single misplaced dependency in the loop can force it all to be run in serial. So the stakes are higher when looking for dependencies in loops.

Some constructs are completely independent, right out of the box. The question we want to ask is “Can two different iterations execute at the same time, or is there a data dependency between them?” Consider the following loop:


DO I=1,N A(I) = A(I) + B(I) ENDDO

For any two values of I and K, can we calculate the value of A(I) and A(K) at the same time? Below, we have manually unrolled several iterations of the previous loop, so they can be executed together:


A(I) = A(I) + B(I) A(I+1) = A(I+1) + B(I+1) A(I+2) = A(I+2) + B(I+2)

You can see that none of the results are used as an operand for another calculation. For instance, the calculation for A(I+1) can occur at the same time as the calculation for A(I) because the calculations are independent; you don’t need the results of the first to determine the second. In fact, mixing up the order of the calculations won’t change the results in the least. Relaxing the serial order imposed on these calculations makes it possible to execute this loop very quickly on parallel hardware.

Flow Dependencies

For comparison, look at the next code fragment:


DO I=2,N A(I) = A(I-1) + B(I) ENDDO

This loop has the regularity of the previous example, but one of the subscripts is changed. Again, it’s useful to manually unroll the loop and look at several iterations together:


A(I) = A(I-1) + B(I) A(I+1) = A(I) + B(I+1) A(I+2) = A(I+1) + B(I+2)

In this case, there is a dependency problem. The value of A(I+1) depends on the value of A(I), the value of A(I+2) depends on A(I+1), and so on; every iteration depends on the result of a previous one. Dependencies that extend back to a previous calculation and perhaps a previous iteration (like this one), are loop carried flow dependencies or backward dependencies. You often see such dependencies in applications that perform Gaussian elimination on certain types of matrices, or numerical solutions to systems of differential equations. However, it is impossible to run such a loop in parallel (as written); the processor must wait for intermediate results before it can proceed.

In some cases, flow dependencies are impossible to fix; calculations are so dependent upon one another that we have no choice but to wait for previous ones to complete. Other times, dependencies are a function of the way the calculations are expressed. For instance, the loop above can be changed to reduce the dependency. By replicating some of the arithmetic, we can make it so that the second and third iterations depend on the first, but not on one another. The operation count goes up — we have an extra addition that we didn’t have before — but we have reduced the dependency between iterations:


DO I=2,N,2 A(I) = A(I-1) + B(I) A(I+1) = A(I-1) + B(I) + B(I+1) ENDDO

The speed increase on a workstation won’t be great (most machines run the recast loop more slowly). However, some parallel computers can trade off additional calculations for reduced dependency and chalk up a net win.

Antidependencies

It’s a different story when there is a loop-carried antidependency, as in the code below:


DO I=1,N A(I) = B(I) * E B(I) = A(I+2) * C ENDDO

In this loop, there is an antidependency between the variable A(I) and the variable A(I+2). That is, you must be sure that the instruction that uses A(I+2) does so before the previous one redefines it. Clearly, this is not a problem if the loop is executed serially, but remember, we are looking for opportunities to overlap instructions. Again, it helps to pull the loop apart and look at several iterations together. We have recast the loop by making many copies of the first statement, followed by copies of the second:


A(I) = B(I) * E A(I+1) = B(I+1) * E A(I+2) = B(I+2) * E ... B(I) = A(I+2) * C ← assignment makes use of the new B(I+1) = A(I+3) * C value of A(I+2) incorrect. B(I+2) = A(I+4) * C

The reference to A(I+2) needs to access an “old” value, rather than one of the new ones being calculated. If you perform all of the first statement followed by all of the second statement, the answers will be wrong. If you perform all of the second statement followed by all of the first statement, the answers will also be wrong. In a sense, to run the iterations in parallel, you must either save the A values to use for the second statement or store all of the B value in a temporary area until the loop completes.

We can also directly unroll the loop and find some parallelism:


1 A(I) = B(I) * E 2 B(I) = A(I+2) * C → 3 A(I+1) = B(I+1) * E | Output dependency 4 B(I+1) = A(I+3) * C | 5 A(I+2) = B(I+2) * E ← 6 B(I+2) = A(I+4) * C

Statements 1–4 could all be executed simultaneously. Once those statements completed execution, statements 5–8 could execute in parallel. Using this approach, there are sufficient intervening statements between the dependent statements that we can see some parallel performance improvement from a superscalar RISC processor.

Output Dependencies

The third class of data dependencies, output dependencies, is of particular interest to users of parallel computers, particularly multiprocessors. Output dependencies involve getting the right values to the right variables when all calculations have been completed. Otherwise, an output dependency is violated. The loop below assigns new values to two elements of the vector A with each iteration:


DO I=1,N A(I) = C(I) * 2. A(I+2) = D(I) + E ENDDO

As always, we won’t have any problems if we execute the code sequentially. But if several iterations are performed together, and statements are reordered, then incorrect values can be assigned to the last elements of A. For example, in the naive vectorized equivalent below, A(I+2) takes the wrong value because the assignments occur out of order:


A(I) = C(I) * 2. A(I+1) = C(I+1) * 2. A(I+2) = C(I+2) * 2. A(I+2) = D(I) + E ← Output dependency violated A(I+3) = D(I+1) + E A(I+4) = D(I+2) + E

Whether or not you have to worry about output dependencies depends on whether you are actually parallelizing the code. Your compiler will be conscious of the danger, and will be able to generate legal code — and possibly even fast code, if it’s clever enough. But output dependencies occasionally become a problem for programmers.

Dependencies Within an Iteration

We have looked at dependencies that cross iteration boundaries but we haven’t looked at dependencies within the same iteration. Consider the following code fragment:


DO I = 1,N D = B(I) * 17 A(I) = D + 14 ENDDO

When we look at the loop, the variable D has a flow dependency. The second statement cannot start until the first statement has completed. At first glance this might appear to limit parallelism significantly. When we look closer and manually unroll several iterations of the loop, the situation gets worse:


D = B(I) * 17 A(I) = D + 14 D = B(I+1) * 17 A(I+1) = D + 14 D = B(I+2) * 17 A(I+2) = D + 14

Now, the variable D has flow, output, and antidependencies. It looks like this loop has no hope of running in parallel. However, there is a simple solution to this problem at the cost of some extra memory space, using a technique called promoting a scalar to a vector. We define D as an array withN elements and rewrite the code as follows:


DO I = 1,N D(I) = B(I) * 17 A(I) = D(I) + 14 ENDDO

Now the iterations are all independent and can be run in parallel. Within each iteration, the first statement must run before the second statement.

Reductions

The sum of an array of numbers is one example of a reduction — so called because it reduces a vector to a scalar. The following loop to determine the total of the values in an array certainly looks as though it might be able to be run in parallel:


SUM = 0.0 DO I=1,N SUM = SUM + A(I) ENDDO

However, if we perform our unrolling trick, it doesn’t look very parallel:


SUM = SUM + A(I) SUM = SUM + A(I+1) SUM = SUM + A(I+2)

This loop also has all three types of dependencies and looks impossible to parallelize. If we are willing to accept the potential effect of rounding, we can add some parallelism to this loop as follows (again we did not add the preconditioning loop):


SUM0 = 0.0 SUM1 = 0.0 SUM2 = 0.0 SUM3 = 0.0 DO I=1,N,4 SUM0 = SUM0 + A(I) SUM1 = SUM1 + A(I+1) SUM2 = SUM2 + A(I+2) SUM3 = SUM3 + A(I+3) ENDDO SUM = SUM0 + SUM1 + SUM2 + SUM3

Again, this is not precisely the same computation, but all four partial sums can be computed independently. The partial sums are combined at the end of the loop.

Loops that look for the maximum or minimum elements in an array, or multiply all the elements of an array, are also reductions. Likewise, some of these can be reorganized into partial results, as with the sum, to expose more computations. Note that the maximum and minimum are associative operators, so the results of the reorganized loop are identical to the sequential loop.

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks