Summary: This module introduces the interactive optimizing compiler combination VPO/VISTA. VPO (Very Portable Optimizer) is a retargetable compiler developed at the University of Virginia. VISTA (VPO Interactive System for Tuning Applications), developed at Florida State University, is a GUI that allows a user to interactively select optimization steps with VPO feedback. Instructions are given on how to apply VPO/VISTA optimization to three routines provided in the power spectral density (PSD) estimator of the previous lab. Students are then asked to optimize these routines twice: once using VISTA and once using any method desired.
In this lab you will apply an interactive optimizing compiler to the IIR filtering and autocorrelation routines of the previous lab. You will also optimize the pseudo-noise (PN) generation, IIR filtering, and autocorrelation routines, using any technique desired, to get code that executes as fast as possible.
Most code optimizations are done on an intermediate representation of the source code or at the assembly level (post-pass optimization). Read the Wikipedia [link] article on compiler optimization [link] and learn about optimization problems, types, factors, and techniques. Remember, no compiler optimization can effectively overcome a poor algorithm design or implementation.
Since compilers group transformations into phases, the user cannot choose exactly which transformations are performed nor can the user specify the order in which they are performed. Furthermore, the user cannot see the results of individual transformations. A particularly unfortunate case is when only one transformation in a particular phase causes buggy code to be produced. Designing compilers to overcome some of these limitations is an active area of research. One such optimizer and compiler that is being developed is VPO/VISTA.
VPO is a compiler-independent and language-independent optimizer. [1]
"VPO employs a paradigm of compilation that has proven to be flexible and adaptable - all code improving transformations are performed on a single target-specific representation of the program, called RTLs [link] (Register Transfer Lists). RTLs are a low-level, machine and language independent representation that encodes machine-specific instructions. The comprehensive use of RTLs in VPO has several important consequences. One advantage of using RTLs as the sole intermediate representation is that the synthesis phases of the compiler can be invoked in any order and repeatedly if necessary. This largely eliminates code inefficiencies often caused by the ordering of the phases. In contrast, a more conventional compiler system will perform optimizations on various different representations."[2]VPO is also easily retargeted to new architectures and it can be extended to accomodate new architectural features.
VISTA is a front end to VPO that enables the user to optimize the code interactively. One view displays assembly or RTL instructions and another view takes input on which optimizations to perform. Each optimization phase and transformation can be stepped through to see which which instructions are modified.
Now we will apply the interactive research compiler, VPO/VISTA, to the
autocorrelation routine. VPO runs on Solaris, while VISTA runs in Java.
We will run Java from a local windows machine since Code Composer runs
on Windows. After
compilation into an internal representation is completed, Java is used
to start a "code server" on a Solaris machine over a network port, and a
Java client is run locally to interactively optimize the code. The port
over which the connection is made changes each time the optimizer is run.
Your TA
will give you instructions on which machine to connect to and your account
information. You will transfer iirfilt.c,
autocorr.c, and required header files to the remote
machine. You will then compile autocorr.c and apply a
a set of VISTA optimization sequences to the code. Next, you will
then transfer autocorr.vpo.asm to the local machine and
assemble it with the rest of PSD estimation application. Once you
understand the VISTA optimization process and can interpret the results,
you will use VISTA to fully optimize iirfilt.c.
First, change the value of M in your local copy of
lab4b.h to 23. In Windows, open a
"WinSCP" file browser
by clicking Start->All Programs->WinSCP. Log in to the
remote Solaris machine, make a working directory on the remote
machine (e.g. lab5),
and transfer your local copies of
iirfilt.c, autocorr.c,
lab4b.h, and intrinsics.h
to the remote machine.
Now open a secure terminal by selecting
Start->All Programs->Putty. Login in and change the
working directory to the one you placed your files in.
Type vpocompile autocorr.c to compile the file
into autocorr.cex. This new file contains RTLs
that can be optimized by VISTA.
If VISTA has been previously run on this code and you do not want
any information from the last VISTA session kept, delete the files
named autocorr.trans and autocorr.transout
by typing rm autocorr.trans* in the SSH terminal.
Now the RTL file must be served
from the remote machine to the local Windows machine by typing
vposerver autocorr.cex in the SSH terminal. Note that
the exact command to run VISTA (see next step) is printed on the
screen, and it can be copied into the local command prompt.
Run VISTA on the local Windows machine by typing
vista_viewer port_number
hostname
at a command prompt. The port number and hostname is given in the
output of the vposerver command.
When VISTA is run a JAVA GUI will be opened. The right half of the GUI contains the assembly instructions, represented in the VPO-internal register-transfer-list (RTL) format. The left half contains the list of phases. Each phase contains multiple transformations; the "CD-player-like" buttons allow the user to move to the beginning and end of the entire phase list ( |< and >| ), move backward and forward one phase ( << and >> ), and move backward and forward one transformation ( < and > ). The buttons directly below the phase list allow phases to be added to the list.
![]() |
First, select Ctl Flow(square) from the drop-down box.
Can you identify what each loop/branch (depicted via arrows)
corresponds to in the source code?
![]() |
To see how the RTL instructions originally listed in the right
half of the GUI relate to assembly instructions, select
Assembly in
the drop-down box to see the corresponding assembly
instructions. Note that the accumulators and registers are labeled with
increasing numbers. These labels help indicate to the optimizer when a
register is "dead", meaning that its contents are no longer needed.
These "registers" can be mapped to physical registers by selecting
Setup Trans Sequence, Register Assignment, and
Done.
Note that after Register Assignment
is selected, Eval Order Deter and
Global Inst Select are no longer available and others
are made available. After
Done is selected, you will then see familiar registers
like A, B, AR1, etc. Next to
"Register Assignment" in the left half of the window are the number
of transformations and the code size in percent of the original code.
Now step through the individual register assigments.
The > button will step through each
transformation in two steps: pressing it once will highlight in yellow
the instructions that may be modified (before-transformation state), and
pressing it again will highlight the new instructions that replace the
previous ones in light red (after-transformation state). Click on the
<< button to move backwards through the phases list, then step
through some transformations using > and observe
the results. Stepping through transformations in
Inst Selection is especially informative.
Note the code size, then go back and try the disabled
optimizations. First, click |< to go back to the beginning of
optimization,
then select Option->
Discard changes after current state to remove
Register Assignment. Now enter these optimizations:
Eval Order Deter, Global Inst Select,
and Register Assignment. Note the improvement in code
size. What happens if Global Inst Select is instead
selected first?
![]() |
Now remove all optimizations and enter the optimizations shown in
figure 3. Note that Fix Entry Exit
and Inst Scheduling are required before assembly code can
be written. If Fix Entry Exit is left out, no code is
generated to preserve and restore register states at the beginning and
end of the function. Inst Scheduling orders the instructions
to avoid pipeline conflicts. If this optimization is left out,
nops are entered after each instruction. Note that even
though Inst Scheduling increases the reported size of
the code, if it is left out the assembly code that VISTA writes to file
will be even larger.
![]() |
To finish the VISTA session and use the optimized code, click the
Exit button in the
lower portion of the GUI to generate a vpo.asm file that
can be linked with the program. If you click on the "X" in
the right side of the title bar, a vpo.asm file
will not be generated.
Now the optimized file is ready to be linked to the rest of the PSD
estimator. Using the SSH file browser, transfer
autocorr.vpo.asm to the directory containing the PSD
files on the local Windows machine.
On the local windows machine, rename autocorr.vpo.asm
to autocorrvpo.asm since the Windows version of the TI
compiler doesn't handle multiple periods in a filename properly.
Compile and link the PSD estimator with the optimized
autocorr.c by typing
c_asm lab4bmain.c pn.c iirfilt.c autocorrvpo.asm
c_fft_given_iirc.asm
at the Windows command prompt.
Load this executable onto the DSP and measure the number of clock
cycles autocorr() consumes. Is this code correct?
Is it executing before the next buffer of N samples is filled?
How can you tell?
After inspecting the source code of the PSD estimator, you
will likely discover inefficiencies. This implementation is provided as
the "reference implementation" of the optimization process and to
define the expected input and output of the application.
The computational efficiency of your code will be judged against
this implementation.
Since a primary purpose of this lab is to learn optimization and
efficient code techniques, the optimization portion of your lab
grade will be based on the total execution
time of rand_fillbuffer, iirfilt, and
autocorr, which are contained in pn.c,
iirfilt.c, and autocorr.c, respectively.
You are not required to optimize memory use.
You will not change lab4bmain.c, and you will preserve the input
and output behavior of each function.
Note that by execution
time we mean cycle count, not the number of instructions in
your program. Remember that several of the TMS320C54xx
instructions take more than one cycle. The multicycle
instructions are primarily the multi-word instructions,
including instructions that take immediates, like
stm, and instructions using direct addressing
of memory (such as ld *(temp),A). Branch and
repeat statements also require several cycles to execute.
Most C instructions take more than one cycle. The debugger
can be used to determine the exact number of cycles used by
your code; ask your TA to demonstrate. However, since the
number of execution cycles used by an instruction is usually
determined by the number of words in its encoding, the
easiest way to estimate the number of cycles used by your
code is to count the number of instruction words in the
.lst file or the disassembly window in the
debugger.
You will create an optimized version of this code using any technique
desired, and you will create an optimized version of iirfilt
with VISTA.
Optimize iirfilt in iirfilt.c.
If VISTA is
applied to other functions, note that a separate VISTA session must
be run for each file. Also note that the reference implementation of
pn.c contains two functions; after the first function has
been optimized, and
Fix Entry Exit and Inst Scheduling have been
selected, select Option->
Proceed to next function to optimize the second function.
Descriptions of many of the optimizations can be found in Appendix A. Some general guidelines for optimization:
Register Assignment before selecting
Register Assignment Inst Selection,
Dead Variable Elim, and
Common Subexpr Elim Deopt Reg Allocation
can be used to open up new optimization possibilities for future
phases if changes then iirfilt to 25%
(or less) of the original before
Fix Entry Exit and Inst Scheduling
are selected.
.vpo.asm file for correctness,
do not link any other VISTA-generated files in the executable.
For example, if you are testing autocorr.vpo.asm,
pass pn.c and iirfilt.c through the TI
compiler. Another reason the functions may not work properly
is because it cannot keep up with the samples coming in. For
some of the provided functions, any optimized version that is
above 40% the original size is almost surely too large to
execute quickly enough.
Fix Entry Exit and Inst Scheduling
are selected, a .vpo.asm file is produced. To run
VISTA again on the same function requires
rm <filename>.trans* to be executed before
vposerver is run again.
start writing in to start
recording to a local file with the name that has been entered
into the edit box next
to this button. When you have completed the transformations
that you want to record, select the record button again to stop
recording. When you want to execute transformations from
a file, enter the filename into the edit box and select
execute from file.
Recording transformations can save a
significant amount of time when you want to
try variations of the same transformation sequence.
If you spot inefficiencies with a function's C code before you
have tried optimizing it VISTA, or
if you are not satisfied with your best
reference-implementation optimization using VISTA, you
may change the C code and re-optimize with VISTA.
Recall that optimizing
in C is a step before optimizing with hand-coded assembly.
Look carefully at iirfilt.c and included
files before optimizing.
Some improvements may be made simply by modifying the C code.
Modifications are valid as long as
the modified code takes a "live" input and produces
exactly the same output as the reference
iirfilt code.
Note that hand-optimizing the VISTA-generated assembly code
counts as "freestyle-optimized" code, not VISTA-only-optimized
code. Be sure to write down your best VISTA optimization
sequence and be prepared to show it to the TA. Restart VISTA
and check to see that the
sequence gives the same assembly code and reduction in code size
percentage.
While you are optimizing with VISTA, step through some
transformations and figure out what they do, especially
Inst Selection,
Common Subexpr Elim,
Dead Variable Elimination,
Register Allocation,
Merge Basic Blocks, and
Elim Empty Blocks.
Also, correlate the VISTA blocks and assembly instructions with the C
code. What does each block correspond to, and what part of the C code
are the assembly instructions implementing? You may find it useful to
look for key instructions like ones that perform multiplication.
For this optimization activity you can use any technique desired.
While the reference code might serve as a starting point, you should
do whatever you need to do to make your code as efficient
as possible, while operating in an equivalent manner as the given
code. Be sure to
review
the steps involved in optimization.
You may use VISTA-generated assembly code as a starting point or for
general guidance. Similarly, you can use TI-generated assembly code
or VPO-only code. To generate VPO-only code, use the command
vpocompile2 on the remote machine. This command will
run VPO with a fixed optimization sequence. A .s
assembly file will be created that can be transferred to the local
machine and assembled with the rest of the PSD estimator.
TI-generated assembly code is produced for each file that is
passed to c_asm. While you are optimizing the code in a
"free-style" manner, be sure to note what optimizing compilers can
and can't do in optimizing code.
If you are programming the PN generator in assembly, you may wish to refer to the description of assembly instructions for logical operations in Section 2-2 of the C54x Mnemonic Instruction Set reference. Initialize the shift register to one. You can debug the PN output by comparing it to the output of the MATLAB code. Be prepared, as part of your quiz, to prove to a TA that your PN generator works properly.
Your IIR filtering routine can debugged by placing an impulse followed
by zeros in autocorr_in instead of a PN sequence.
Your autocorrelation routine can be debugged by commenting out the
IIR-filtering routine and placing the maximum DC value into
autocorr_in in a similar manner as described the
IIR-debugging step. Note that each of these tips is the most useful if
the output is inspected in memory.
We will grade you based on the number of cycles used in the statements
rand_fillbuffer();, iirfilt();, and
autocorr();.
Note that some instructions, like RPT, are
non-repeatable instructions; their use may cause
unnecessary glitches in I/O. For grading simplicity, your final
code should not have modifications except in these three functions,
and M should be set to 23.
If the number of cycles used in each function is variable, the maximum
possible number of cycles will be counted for each function.
You must use the core.asm file in
v:\ece320\54x\dsplib\core.asm or the C core
file in v:\ece320\54x\dspclib\core.asm as
provided by the TAs; these files may not be
modified. We reserve the right to test your code by
modifying the inputs.
You are also required to record some of your iirfilt
VISTA results to a file so
that improvements to VISTA can be identified. Each time you
test a VISTA-generated function on the DSP,
(1) record (yes or no) whether it works, and
(2) how many cycles it takes to execute this
function. You are not required to test every sequence you generate; if
you are not happy with the code size of a sequence, keep trying until you
have one worthy of testing. Once you test the code, record the results.
When you have finished optimizing a function with VISTA, (1) write down a
rough estimate of the total time you spent optimizing the function with
VISTA (to the nearest 15 minutes or less), and (2) write down the VISTA
optimization sequence that can be used to obtain your optimized code.
Finally, record at what stage, if any, you altered the C code of the
reference implementation. Valid answers are not at all, before VISTA
was used, or after some VISTA optimization was attempted.
Record how much time you spent on the "free-style" optimization activity, as well as whether you used generated code as a starting point or "started from scratch". If you used VISTA, VPO, or TI generated code only for guidance this is to be recorded as "starting from scratch" or "None" in the free-style results table. If you modified a generated assembly file, record this as "VISTA", "VPO", or "TI" in the free-style results table.
Email your results at or before the time they are due. The address will be given to you by your TA. Enter in the subject line "[VISTA]<space>", without the quotes and a space where <space> appears. After the space you may put anything you want. Email your C code used in optimization (if it is different from the reference implementation), your free-style-optimized assembly code, and the results of your optimization process. Please attach the results as text; if you used a spreadsheet to record the results, export the file to a CSV file and attach the CSV file. Optionally, add any comments you might have about your experience with VISTA. Here are two tables listing what is required:
| VISTA Optimization | Optimized function | |
|---|---|---|
| iirfilt() | ||
| Iteration | Works? | Cycle time |
| 1 | <Yes/No> | <In cycles> |
| 2 | ||
| 3 | ||
| ... | ||
| Best VISTA | Yes | |
| Best VISTA Optimization Sequence | <Insert sequence here> | |
| Time spent using VISTA | <Time to nearest 15 min. or less> | |
| When was C code altered? | <Never, before VISTA, or between VISTA iterations> | |
| "Free-style" optimization | Optimized function | |
|---|---|---|
| iirfilt() | ||
| Time Spent Optimizing | Cycle time | |
| Best Free-style | <To 15 min. or less> | <In cycles> |
| Which generated code modified? | <None,VISTA,VPO,TI> | |
This lab is to be completed in two weeks, but the VISTA portion is to be completed in one week and the VISTA log should be sent in at that time. Grading for this lab will be a bit different from past labs:
iirfilt, due the first week.
iirfilt version
optimized with VISTA, due the first week.
| Optimization Phase | Description |
|---|---|
| branch chaining | Replaces a branch or jump target with the target of the last jump in the chain |
| eliminate empty block | Removes empty blocks from the control flow graph |
| useless jump elimination | Remove useless transfers of control like a jump to the next block in the control flow |
| dead code elimination | Remove block unreachable from the top block |
| reverse branches | Reverses a conditional branch when it branches over a jump to eliminate the jump |
| block reordering | Removes a jump by reordering basic blocks when the target of the jump has only a single predecessor |
| merge basic blocks | Merges two consecutive basic blocks when the predecessor has no transfer of control and the successor has only one predecessor |
| instruction selection | Combine instructions together when the combined effect is in a legal instruction |
| evaluation order determination | Reorder instructions within a basic block to calculate expressions that require the most registers first |
| minimize loop jumps | Remove an unconditional jump at the end of a loop or one that jumps into a loop, by replicating a portion of the loop |
| dead assignment elimination | Removes assignments where the assignment value is never used |
| register allocation | Replaces references to a variable within a specific live range with a register |
| common subexpression elimination | Eliminates fully redundant calculations |
| loop transformations | Performs loop-invariant code motion, recurrence elimination, loop strength reduction, and induction variable elimination on each loop ordered by loop nesting level. Each of these transformations can also be individually selected by the user. |
| strength reduction | Replaces an expensive instruction with one or more cheaper ones |