Skip to content Skip to navigation Skip to collection information

OpenStax-CNX

You are here: Home » Content » High Performance Computing » Subroutine Profiling

Navigation

Table of Contents

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Endorsed by Endorsed (What does "Endorsed by" mean?)

This content has been endorsed by the organizations listed. Click each link for a list of all content endorsed by the organization.
  • HPC Open Edu Cup

    This collection is included inLens: High Performance Computing Open Education Cup 2008-2009
    By: Ken Kennedy Institute for Information Technology

    Click the "HPC Open Edu Cup" link to see all content they endorse.

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • NSF Partnership display tagshide tags

    This collection is included inLens: NSF Partnership in Signal Processing
    By: Sidney Burrus

    Click the "NSF Partnership" link to see all content affiliated with them.

    Click the tag icon tag icon to display tags associated with this content.

  • Featured Content

    This collection is included inLens: Connexions Featured Content
    By: Connexions

    Comments:

    "The purpose of Chuck Severence's book, High Performance Computing has always been to teach new programmers and scientists about the basics of High Performance Computing. This book is for learners […]"

    Click the "Featured Content" link to see all content affiliated with them.

Also in these lenses

  • UniqU content

    This collection is included inLens: UniqU's lens
    By: UniqU, LLC

    Click the "UniqU content" link to see all content selected in this lens.

  • Lens for Engineering

    This module and collection are included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

  • eScience, eResearch and Computational Problem Solving

    This collection is included inLens: eScience, eResearch and Computational Problem Solving
    By: Jan E. Odegard

    Click the "eScience, eResearch and Computational Problem Solving" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.

Tags

(What is a tag?)

These tags come from the endorsement, affiliation, and other lenses that include this content.
 

Subroutine Profiling

Module by: Charles Severance, Kevin Dowd. E-mail the authors

Sometimes you want more detail than the overall timing of the application. But you don’t have time to modify the code to insert several hundred etime calls into your code. Profiles are also very useful when you have been handed a strange 20,000-line application program and told to figure out how it works and then improve its performance.

Most compilers provide a facility to automatically insert timing calls into your code at the entry and exit of each routine at compile time. While your program runs, the entry and exit times are recorded and then dumped into a file. A separate utility summarizes the execution patterns and produces a report that shows the percentage of the time spent in each of your routines and the library routines.

The profile gives you a sense of the shape of the execution profile. That is, you can see that 10% of the time is spent in subroutine A, 5% in subroutine B, etc. Naturally, if you add all of the routines together they should account for 100% of the overall time spent. From these percentages you can construct a picture — a profile — of how execution is distributed when the program runs. Though not representative of any particular profiling tool, the histograms in Figure 1 and Figure 2 depict these percentages, sorted from left to right, with each vertical column representing a different routine. They help illustrate different profile shapes.

Figure 1
Sharp profile — dominated by routine 1
This figure is a histogram, with routines on the horizontal axis and % time on the vertical axis. For routines of value 1, the histogram shows a % time value of 65. For routines of value 2, the histogram shows a % time value of 20. For routines of value 3, the histogram shows a % time value of 10. For routines of value 4, the histogram shows a % time value of 5. For routines of value 5, the histogram shows a % time value of approximately 2. Above the histograms is a dashed curve that follows the decreasing trend of the histogram as the number of routines increases.

A sharp profile says that most of the time is spent in one or two procedures, and if you want to improve the program’s performance you should focus your efforts on tuning those procedures. A minor optimization in a heavily executed line of code can sometimes have a great effect on the overall runtime, given the right opportunity. A flat profile,1 on the other hand, tells you that the runtime is spread across many routines, and effort spent optimizing any one or two will have little benefit in speeding up the program. Of course, there are also programs whose execution profile falls somewhere in the middle.

Figure 2
Flat profile — no routine predominates
This figure is a histogram, with routines on the horizontal axis and % time on the vertical axis. For routines of value 1, the histogram shows a % time value of approximately 22. For routines of value 2, the histogram shows a % time value of 20. For routines of value 3, the histogram shows a % time value of approximately 18. For routines of value 4, the histogram shows a % time value of approximately 22. For routines of value 5, the histogram shows a % time value of approximately 18. Above the histograms is a dashed curve that follows the decreasing trend of the histogram as the number of routines increases.

We cannot predict with absolute certainty what you are likely to find when you profile your programs, but there are some general trends. For instance, engineering and scientific codes built around matrix solutions often exhibit very sharp profiles. The runtime is dominated by the work performed in a handful of routines. To tune the code, you need to focus your efforts on those routines to make them more efficient. It may involve restructuring loops to expose parallelism, providing hints to the compiler, or rearranging memory references. In any case, the challenge is tangible; you can see the problems you have to fix.

There are limits to how much tuning one or two routines will improve your runtime, of course. An often quoted rule of thumb is Amdahl’s Law, derived from remarks made in 1967 by one of the designers of the IBM 360 series, and founder of Amdahl Computer, Gene Amdahl. Strictly speaking, his remarks were about the performance potential of parallel computers, but people have adapted Amdahl’s Law to describe other things too. For our purposes, it goes like this: Say you have a program with two parts, one that can be optimized so that it goes infinitely fast and another that can’t be optimized at all. Even if the optimizable portion makes up 50% of the initial runtime, at best you will be able to cut the total runtime in half. That is, your runtime will eventually be dominated by the portion that can’t be optimized. This puts an upper limit on your expectations when tuning.

Even given the finite return on effort suggested by Amdahl’s Law, tuning a program with a sharp profile can be rewarding. Programs with flat profiles are much more difficult to tune. These are often system codes, nonnumeric applications, and varieties of numerical codes without matrix solutions. It takes a global tuning approach to reduce, to any justifiable degree, the runtime of a program with a flat profile. For instance, you can sometimes optimize instruction cache usage, which is complicated because of the program’s equal distribution of activity among a large number of routines. It can also help to reduce subroutine call overhead by folding callees into callers. Occasionally, you can find a memory reference problem that is endemic to the whole program — and one that can be fixed all at once.

When you look at a profile, you might find an unusually large percentage of time spent in the library routines such as log, exp, or sin. Often these functions are done in software routines rather than inline. You may be able to rewrite your code to eliminate some of these operations. Another important pattern to look for is when a routine takes far longer than you expect. Unexpected execution time may indicate you are accessing memory in a pattern that is bad for performance or that some aspect of the code cannot be optimized properly.

In any case, to get a profile, you need a profiler. One or two subroutine profilers come standard with the software development environments on all UNIX machines. We discuss two of them: prof and gprof. In addition, we mention a few line-by-line profilers. Subroutine profilers can give you a general overall view of where time is being spent. You probably should start with prof, if you have it (most machines do). Otherwise, use gprof. After that, you can move to a line-by- line profiler if you need to know which statements take the most time.

prof

prof is the most common of the UNIX profiling tools. In a sense, it is an extension of the compiler, linker, and object libraries, plus a few extra utilities, so it is hard to look at any one thing and say “this profiles your code.” prof works by periodically sampling the program counter as your application runs. To enable profiling, you must recompile and relink using the –p flag. For example, if your program has two modules, stuff.c and junk.c, you need to compile and link according to the following code:


% cc stuff.c -p -O -c % cc junk.c -p -O -c % cc stuff.o junk.o -p -o stuff

This creates a stuff binary that is ready for profiling. You don’t need to do anything special to run it. Just treat it normally by entering stuff. Because runtime statistics are being gathered, it takes a little longer than usual to execute.2 At completion, there is a new file called mon.out in the directory where you ran it. This file contains the history of stuff in binary form, so you can’t look at it directly. Use the prof utility to read mon.out and create a profile of stuff. By default, the information is written to your screen on standard output, though you can easily redirect it to a file:


% prof stuff > stuff.prof

To explore how the prof command works, we have created the following ridiculous little application, loops.c. It contains a main routine and three subroutines for which you can predict the time distribution just by looking at the code.


main () { int l; for (l=0;l<1000;l++) { if (l == 2*(l/2)) foo (); bar(); baz(); } } foo (){ int j; for (j=0;j<200;j++) } bar () { int i; for (i=0;i<200;i++); } baz () { int k; for (k=0;k<300;k++); }

Again, you need to compile and link loops with the –p flag, run the program, and then run the prof utility to extract a profile, as follows:


% cc loops.c -p -o loops % ./loops % prof loops > loops.prof

The following example shows what a loops.prof should look like. There are six columns.


%Time Seconds Cumsecs #Calls msec/call Name 56.8 0.50 0.50 1000 0.500 _baz 27.3 0.24 0.74 1000 0.240 _bar 15.9 0.14 0.88 500 0.28 _foo 0.0 0.00 0.88 1 0. _creat 0.0 0.00 0.88 2 0. _profil 0.0 0.00 0.88 1 0. _main 0.0 0.00 0.88 3 0. _getenv 0.0 0.00 0.88 1 0. _strcpy 0.0 0.00 0.88 1 0. _write

The columns can be described as follows:

  • %Time Percentage of CPU time consumed by this routine
  • Seconds CPU time consumed by this routine
  • Cumsecs A running total of time consumed by this and all preceding routines in the list
  • Calls The number of times this particular routine was called
  • msec/call Seconds divided by number of calls giving the average length of time taken by each invocation of the routine
  • Name The name of this routine

The top three routines listed are from loops.c itself. You can see an entry for the “main” routine more than halfway down the list. Depending on the vendor, the names of the routines may contain leading or trailing underscores, and there will always be some routines listed you don’t recognize. These are contributions from the C library and possibly the FORTRAN libraries, if you are using FORTRAN. Profiling also introduces some overhead into the run, and often shows up as one or two subroutines in the prof output. In this case, the entry for _profil represents code inserted by the linker for collecting runtime profiling data.

If it was our intention to tune loops, we would consider a profile like the one in the figure above to be a fairly good sign. The lead routine takes 50% of the runtime, so at least there is a chance we could do something with it that would have a significant impact on the overall runtime. (Of course with a program as trivial as loops, there is plenty we can do, since loops does nothing.)

gprof

Just as it’s important to know how time is distributed when your program runs, it’s also valuable to be able to tell who called who in the list of routines. Imagine, for instance, if something labeled _exp showed up high in the list in the prof output. You might say: “Hmmm, I don’t remember calling anything named exp(). I wonder where that came from.” A call tree helps you find it.

Subroutines and functions can be thought of as members of a family tree. The top of the tree, or root, is actually a routine that precedes the main routine you coded for the application. It calls your main routine, which in turn calls others, and so on, all the way down to the leaf nodes of the tree. This tree is properly known as a call graph.3 The relationship between routines and nodes in the graph is one of parents and children. Nodes separated by more than one hop are referred to as ancestors and descendants.

Figure 6-4 graphically depicts the kind of call graph you might see in a small application. main is the parent or ancestor of most of the rest of the routines. G has two parents, E and C. Another routine, A, doesn’t appear to have any ancestors or descendants at all. This problem can happen when routines are not compiled with profiling enabled, or when they aren’t invoked with a subroutine call — such as would be the case if A were an exception handler.

The UNIX profiler that can extract this kind of information is called gprof. It replicates the abilities of prof, plus it gives a call graph profile so you can see who calls whom, and how often. The call graph profile is handy if you are trying to figure out how a piece of code works or where an unknown routine came from, or if you are looking for candidates for subroutine inlining.

To use call graph profiling you need go through the same steps as with prof, except that a –pg flag is substituted for the –p flag.4 Additionally, when it comes time to produce the actual profile, you use the gprof utility instead of prof. One other difference is that the name of the statistics file is gmon.out instead of mon.out:


% cc -pg stuff.c -c % cc stuff.o -pg -o stuff % stuff % gprof stuff > stuff.gprof
Figure 3
Simple call graph
This figure is a flow chart. The top of the flowchart is a box labeled, Main. From this box are two lines, one connected to a box labeled, B, and the other connected to a box labeled, C. From the B box are two lines, one connected to a box labeled, D, and one connected to a box labeled, E. From the E box is a line connected to a box labeled, G. From the C box is a line connected to the same box, G. To the right of the flowchart is a box labeled, A, with no connections.

The output from gprof is divided into three sections:

  • Call graph profile
  • Timing profile
  • Index

The first section textually maps out the call graph. The second section lists routines, the percentage of time devoted to each, the number of calls, etc. (similar to prof ). The third section is a cross reference so that you can locate routines by number, rather than by name. This section is especially useful for large applications because routines are sorted based on the amount of time they use, and it can be difficult to locate a particular routine by scanning for its name. Let’s invent another trivial application to illustrate how gprof works. Figure 4 shows a short piece of FORTRAN code, along with a diagram of how the routines are connected together. Subroutines A and B are both called by MAIN, and, in turn, each calls C. The following example shows a section of the output from gprof ’s call graph profile:5

Figure 4
FORTRAN example
This figure is a flow chart. The top of the flowchart is a box labeled, Main.  To the side of the Main box are four lines of code. Line one reads, program main. Line two reads, call a. Line three reads, call b. Line four reads, end. From this box are two lines, one connected to a box labeled, A, and the other connected to a box labeled, B. To the side of the A box are five lines of code. Line one reads, subroutine a. Line two reads, call c. Line three reads, do 10 i=1, 5000000. Line four reads, 10 continue. Line five reads, end. To the side of the B box are five lines of code. Line one reads, subroutine b. Line two reads, call c. Line three reads, do 10 i=1,10000000. Line four reads, 10 continue. Line five reads, end. Connected to the bottom of both box A and box B are arrows that point at a box labeled, C. To the side of the C box are four lines of code. Line one reads, subroutine c. Line two reads, do 10 i=1,5000000. Line three reads, 10 continue. Line four reads, end.

called/total parents index %time self descendants called+self name index called/total children .... .... .... 0.00 8.08 1/1 _main [2] [3] 99.9 0.00 8.08 1 _MAIN_ [3] 3.23 1.62 1/1 _b_ [4] 1.62 1.62 1/1 _a_ [5] ----------------------------------------------- 3.23 1.62 1/1 _MAIN_ [3] [4] 59.9 3.23 1.62 1 _b_ [4] 1.62 0.00 1/2 _c_ [6] ----------------------------------------------- 1.62 1.62 1/1 _MAIN_ [3] [5] 40.0 1.62 1.62 1 _a_ [5] 1.62 0.00 1/2 _c_ [6] ----------------------------------------------- 1.62 0.00 1/2 _a_ [5] 1.62 0.00 1/2 _b_ [4] [6] 39.9 3.23 0.00 2 _c_ [6]

Sandwiched between each set of dashed lines is information describing a given routine and its relationship to parents and children. It is easy to tell which routine the block represents because the name is shifted farther to the left than the others. Parents are listed above, children below. As with prof, underscores are tacked onto the labels.6 A description of each of the columns follows:

  • index You will notice that each routine name is associated with a number in brackets ([n]). This is a cross-reference for locating the routine elsewhere in the profile. If, for example, you were looking at the block describing _MAIN_ and wanted to know more about one of its children, say _a_, you could find it by scanning down the left side of the page for its index, [5].
  • %time The meaning of the %time field is a little different than it was for prof. In this case it describes the percentage of time spent in this routine plus the time spent in all of its children. It gives you a quick way to determine where the busiest sections of the call graph can be found.
  • self Listed in seconds, the self column has different meanings for parents, the routine in question, and its children. Starting with the middle entry — the routine itself — the self figure shows how much overall time was dedicated to the routine. In the case _b_, for instance, this amounts to 3.23 seconds.
    Each self column entry shows the amount of time that can be attributed to calls from the parents. If you look at routine _c_, for example, you will see that it consumed a total time of 3.23 seconds. But note that it had two parents: 1.62 seconds of the time was attributable to calls from _a_, and 1.62 seconds to _b_.
    For the children, the self figure shows how much time was spent executing each child due to calls from this routine. The children may have consumed more time overall, but the only time accounted for is time-attributable to calls from this routine. For example, _c_ accumulated 3.23 seconds overall, but if you look at the block describing _b_, you see _c_ listed as a child with only 1.62 seconds. That’s the total time spent executing _c_ on behalf of _b_.
  • descendants As with the self column, figures in the descendants column have different meanings for the routine, its parents, and children. For the routine itself, it shows the number of seconds spent in all of its descendants.
    For the routine’s parents, the descendants figure describes how much time spent in the routine can be traced back to calls by each parent. Looking at routine _c_ again, you can see that of its total time, 3.23 seconds, 1.62 seconds were attributable to each of its two parents, _a_ and _b_.
    For the children, the descendants column shows how much of the child’s time can be attributed to calls from this routine. The child may have accumulated more time overall, but the only time displayed is time associated with calls from this routine.
  • calls The calls column shows the number of times each routine was invoked, as well as the distribution of those calls associated with both parents and children. Starting with the routine itself, the figure in the calls column shows the total number of entries into the routine. In situations where the routine called itself, you will also see a +n immediately appended, showing that additional n calls were made recursively.
    Parent and child figures are expressed as ratios. For the parents, the ratio m/n says “of the n times the routine was called, m of those calls came from this parent.” For the child, it says “of the n times this child was called, m of those calls came from this routine.”

gprof’s Flat Profile

As we mentioned previously, gprof also produces a timing profile (also called a “flat” profile, just to confuse things) similar to the one produced by prof. A few of the fields are different from prof, and there is some extra information, so it will help if we explain it briefly. The following example shows the first few lines from a gprof flat profile for stuff. You will recognize the top three routines from the original program. The others are library functions included at link-time.


% cumulative self self total time seconds seconds calls ms/call ms/call name 39.9 3.23 3.23 2 1615.07 1615.07 _c_ [6] 39.9 6.46 3.23 1 3230.14 4845.20 _b_ [4] 20.0 8.08 1.62 1 1620.07 3235.14 _a_ [5] 0.1 8.09 0.01 3 3.33 3.33 _ioctl [9] 0.0 8.09 0.00 64 0.00 0.00 .rem [12] 0.0 8.09 0.00 64 0.00 0.00 _f_clos [177] 0.0 8.09 0.00 20 0.00 0.00 _sigblock [178] ... .... .... . . . ......

Here’s what each column means:

  • %time Again, we see a field that describes the runtime for each routine as a percent- age of the overall time taken by the program. As you might expect, all the entries in this column should total 100% (nearly).
  • cumulative seconds For any given routine, the column called “cumulative seconds” tallies a running sum of the time taken by all the preceding routines plus its own time. As you scan towards the bottom, the numbers asymptotically approach the total runtime for the program.
  • self seconds Each routine’s individual contribution to the runtime.
  • calls The number of times this particular routine was called.
  • self ms/call Seconds spent inside the routine, divided by the number of calls. This gives the average length of time taken by each invocation of the routine. The figure is presented in milliseconds.
  • total ms/call Seconds spent inside the routine plus its descendants, divided by the number of calls.
  • name The name of the routine. Notice that the cross-reference number appears here too.

Accumulating the Results of Several gprof Runs

It is possible to accumulate statistics from multiple runs so that you can get a picture of how a program is doing with a variety of data sets. For instance, say that you wanted to profile an application — call it bar — with three different sets of input data. You could perform the runs separately, saving the gmon.out files as you go, and then combine the results into a single profile at the end:


% f77 -pg bar.f -o bar % bar < data1.input % mv gmon.out gmon.1 % bar < data2.input % mv gmon.out gmon.2 % bar < data3.input % gprof bar -s gmon.1 gmon.2 gmon.out > gprof.summary.out

In the example profile, each run along the way creates a new gmon.out file that we renamed to make room for the next one. At the end, gprof combines the infor- mation from each of the data files to produce a summary profile of bar in the file gprof.summary.out. Additionally (you don’t see it here), gprof creates a file named gmon.sum that contains the merged data from the original three data files. gmon.sum has the same format as gmon.out, so you can use it as input for other merged profiles down the road.

In form, the output from a merged profile looks exactly the same as for an individual run. There are a couple of interesting things you will note, however. For one thing, the main routine appears to have been invoked more than once — one time for each run, in fact. Furthermore, depending on the application, multiple runs tend to either smooth the contour of the profile or exaggerate its features. You can imagine how this might happen. If a single routine is consistently called while others come and go as the input data changes, it takes on increasing importance in your tuning efforts.

A Few Words About Accuracy

For processors running at 600 MHz and more, the time between 60 Hz and 100 Hz samples is a veritable eternity. Furthermore, you can experience quantization errors when the sampling frequency is fixed, as is true of steady 1/100th or 1/60th of a second samples. To take an exaggerated example, assume that the timeline in Figure 5 shows alternating calls to two subroutines, BAR and FOO. The tick marks represent the sample points for profiling.

Figure 5
Quantization errors in profiling
This figure is a line of alternating grey and dark grey sections. The grey sections are wider, and are labeled ..bar.., and the dark grey sections are labeled .foo.

BAR and FOO take turns running. In fact, BAR takes more time than FOO. But because the sampling interval closely matches the frequency at which the two subroutines alternate, we get a quantizing error: most of the samples happen to be taken while FOO is running. Therefore, the profile tells us that FOO took more CPU time than BAR.

We have described the tried and true UNIX subroutine profilers that have been available for years. In many cases, vendors have much better tools available for the asking or for a fee. If you are doing some serious tuning, ask your vendor representative to look into other tools for you.

Footnotes

  1. The term “flat profile” is a little overloaded. We are using it to describe a profile that shows an even distribution of time throughout the program. You will also see the label flat profile used to draw distinction from a call graph profile, as described below.
  2. Remember: code with profiling enabled takes longer to run. You should recompile and relink the whole thing without the –p flag when you have finished profiling.
  3. It doesn’t have to be a tree. Any subroutine can have more than one parent. Furthermore, recursive subroutine calls introduce cycles into the graph, in which a child calls one of its parents.
  4. On HP machines, the flag is –G.
  5. In the interest of conserving space, we clipped out the section most relevant to our discussion and included it in this example. There was a lot more to it, including calls of setup and system routines, the likes of which you will see when you run gprof.
  6. You may have noticed that there are two main routines: _MAIN_ and _main. In a FORTRAN program, _MAIN_ is the actual FORTRAN main routine. It’s called as a subroutine by _main, provided from a system library at link time. When you’re profiling C code, you won’t see _MAIN_.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks