# Connexions

You are here: Home » Content » Memory - Caches

## Navigation

### Lenses

What is a lens?

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

#### Endorsed by (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.
• HPC Open Edu Cup

This module is included inLens: High Performance Computing Open Education Cup 2008-2009
By: Ken Kennedy Institute for Information TechnologyAs a part of collection: "High Performance Computing"

Click the "HPC Open Edu Cup" link to see all content they endorse.

#### Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
• NSF Partnership

This module is included inLens: NSF Partnership in Signal Processing
By: Sidney BurrusAs a part of collection: "High Performance Computing"

Click the "NSF Partnership" link to see all content affiliated with them.

Click the tag icon to display tags associated with this content.

• Featured Content

This module is included inLens: Connexions Featured Content
By: ConnexionsAs a part of collection: "High Performance Computing"

Comments:

"The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"

Click the "Featured Content" link to see all content affiliated with them.

#### Also in these lenses

• UniqU content

This module is included inLens: UniqU's lens
By: UniqU, LLCAs a part of collection: "High Performance Computing"

Click the "UniqU content" link to see all content selected in this lens.

• Lens for Engineering

This module is included inLens: Lens for Engineering
By: Sidney BurrusAs a part of collection: "High Performance Computing"

Click the "Lens for Engineering" link to see all content selected in this lens.

• eScience, eResearch and Computational Problem Solving

This module is included inLens: eScience, eResearch and Computational Problem Solving
By: Jan E. OdegardAs a part of collection: "High Performance Computing"

Click the "eScience, eResearch and Computational Problem Solving" link to see all content selected in this lens.

### Recently Viewed

This feature requires Javascript to be enabled.

### Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.

# Memory - Caches

Module by: Charles Severance, Kevin Dowd. E-mail the authors

Once we go beyond the registers in the memory hierarchy, we encounter caches. Caches are small amounts of SRAM that store a subset of the contents of the memory. The hope is that the cache will have the right subset of main memory at the right time.

The actual cache architecture has had to change as the cycle time of the processors has improved. The processors are so fast that off-chip SRAM chips are not even fast enough. This has lead to a multilevel cache approach with one, or even two, levels of cache implemented as part of the processor. Table 1 shows the approximate speed of accessing the memory hierarchy on a 500-MHz DEC 21164 Alpha.

 Registers 2 ns L1 On-Chip 4 ns L2 On-Chip 5 ns L3 Off-Chip 30 ns Memory 220 ns

When every reference can be found in a cache, you say that you have a 100% hit rate. Generally, a hit rate of 90% or better is considered good for a level-one (L1) cache. In level-two (L2) cache, a hit rate of above 50% is considered acceptable. Below that, application performance can drop off steeply.

One can characterize the average read performance of the memory hierarchy by examining the probability that a particular load will be satisfied at a particular level of the hierarchy. For example, assume a memory architecture with an L1 cache speed of 10 ns, L2 speed of 30 ns, and memory speed of 300 ns. If a memory reference were satisfied from L1 cache 75% of the time, L2 cache 20% of the time, and main memory 5% of the time, the average memory performance would be:

(0.75 * 10 ) + ( 0.20 * 30 ) + ( 0.05 * 300 ) = 28.5 ns


You can easily see why it’s important to have an L1 cache hit rate of 90% or higher.

Given that a cache holds only a subset of the main memory at any time, it’s important to keep an index of which areas of the main memory are currently stored in the cache. To reduce the amount of space that must be dedicated to tracking which memory areas are in cache, the cache is divided into a number of equal sized slots known as lines. Each line contains some number of sequential main memory locations, generally four to sixteen integers or real numbers. Whereas the data within a line comes from the same part of memory, other lines can contain data that is far separated within your program, or perhaps data from somebody else’s program, as in Figure 1. When you ask for something from memory, the computer checks to see if the data is available within one of these cache lines. If it is, the data is returned with a minimal delay. If it’s not, your program may be delayed while a new line is fetched from main memory. Of course, if a new line is brought in, another has to be thrown out. If you’re lucky, it won’t be the one containing the data you are just about to need.

On multiprocessors (computers with several CPUs), written data must be returned to main memory so the rest of the processors can see it, or all other processors must be made aware of local cache activity. Perhaps they need to be told to invalidate old lines containing the previous value of the written variable so that they don’t accidentally use stale data. This is known as maintaining coherency between the different caches. The problem can become very complex in a multiprocessor system.1

Caches are effective because programs often exhibit characteristics that help kep the hit rate high. These characteristics are called spatial and temporal locality of reference; programs often make use of instructions and data that are near to other instructions and data, both in space and time. When a cache line is retrieved from main memory, it contains not only the information that caused the cache miss, but also some neighboring information. Chances are good that the next time your program needs data, it will be in the cache line just fetched or another one recently fetched.

Caches work best when a program is reading sequentially through the memory. Assume a program is reading 32-bit integers with a cache line size of 256 bits. When the program references the first word in the cache line, it waits while the cache line is loaded from main memory. Then the next seven references to memory are satisfied quickly from the cache. This is called unit stride because the address of each successive data element is incremented by one and all the data retrieved into the cache is used. The following loop is a unit-stride loop:


DO I=1,1000000
SUM = SUM + A(I)
END DO


When a program accesses a large data structure using “non-unit stride,” performance suffers because data is loaded into cache that is not used. For example:


DO I=1,1000000, 8
SUM = SUM + A(I)
END DO


This code would experience the same number of cache misses as the previous loop, and the same amount of data would be loaded into the cache. However, the program needs only one of the eight 32-bit words loaded into cache. Even though this program performs one-eighth the additions of the previous loop, its elapsed time is roughly the same as the previous loop because the memory operations dominate performance.

While this example may seem a bit contrived, there are several situations in which non-unit strides occur quite often. First, when a FORTRAN two-dimensional array is stored in memory, successive elements in the first column are stored sequentially followed by the elements of the second column. If the array is processed with the row iteration as the inner loop, it produces a unit-stride reference pattern as follows:


REAL*4 A(200,200)
DO J = 1,200
DO I = 1,200
SUM = SUM + A(I,J)
END DO
END DO


Interestingly, a FORTRAN programmer would most likely write the loop (in alphabetical order) as follows, producing a non-unit stride of 800 bytes between successive load operations:


REAL*4 A(200,200)
DO I = 1,200
DO J = 1,200
SUM = SUM + A(I,J)
END DO
END DO


Because of this, some compilers can detect this suboptimal loop order and reverse the order of the loops to make best use of the memory system. As we will see in (Reference), however, this code transformation may produce different results, and so you may have to give the compiler “permission” to interchange these loops in this particular example (or, after reading this book, you could just code it properly in the first place).

while ( ptr != NULL ) ptr = ptr->next;


The next element that is retrieved is based on the contents of the current element. This type of loop bounces all around memory in no particular pattern. This is called pointer chasing and there are no good ways to improve the performance of this code.

A third pattern often found in certain types of codes is called gather (or scatter) and occurs in loops such as:

SUM = SUM + ARR ( IND(I) )

where the IND array contains offsets into the ARR array. Again, like the linked list, the exact pattern of memory references is known only at runtime when the values stored in the IND array are known. Some special-purpose systems have special hardware support to accelerate this particular operation.

## Footnotes

1. (Reference) describes cache coherency in more detail.

## Content actions

### Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

#### Definition of a lens

##### Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

##### What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

##### Who can create a lens?

Any individual member, a community, or a respected organization.

##### What are tags?

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks