NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 
PackagePlatformVersionModule Docs
totalview pdsf 6.3.1-0 totalview NERSCVendor
totalview seaborg 6.2-0-0 totalview  Vendor
(*) Denotes limited support

Using Totalview on Seaborg

Etnus TotalView Documentation

Etnus provides a TotalView User's Guide and a web-based Tutorial. See also LLNL's TotalView Tutorial.


Introduction

One way of trying to debug a code involves placing WRITE statements in the code, running the code, and looking for variables that have "wrong" values. This method involves a lot of tinkering with the code and recompiling. With large parallel codes this can quickly become very time-consuming and cumbersome.

An alternative debugging method involves using a debugger like totalview. A debugger allows you to:

  • compile only one time.
  • monitor all values in the code.
  • make changes while executing the code.
  • examine the core files that are produced when a program crashes.

TotalView is a full-featured, source-level, graphical debugger for applications written in C, C++, Fortran (77 and 90), assembler, and mixed source/assembler codes. It is a multiprocess, multithread debugger that supports multiple parallel programming paradigms including MPI, PVM and OpenMP.

Compiling an Example Program

In order to use the debugger, code must be compiled with the -g option. This will produce a larger executable that may run relatively slowly, so be sure to recompile without the -g option once you are ready to execute production runs.

This example program, ex1.f, shown below can be compiled with the following command:

% mpxlf -o ex1 -g ex1.f
** ex1   === End of Compilation 1 ===
1501-510  Compilation successful for file ex1.f.

The source code for this example program can be found by loading the training module and then looking in the $EXAMPLES/Totalview directory.

Running the program, without the debugger, produces the following output:

% poe ./ex1 -nodes 2 -procs 4 -stdoutmode ordered
 All these values should be the same:
 Processor Number : 0  Before send  = 0.6456298828
 x1(3) on Processor No. : 0  After  recv  = 0.6456298828
 x1(3) on Processor No. : 1  After  recv  = 0.0000000000E+00
 x1(3) on Processor No. : 2  After  recv  = 0.0000000000E+00
 x1(3) on Processor No. : 3  After  recv  = 0.6456298828

Processors 1 and 2 contain unexpected values in array x1.

Starting the Debugger

The totalview program is contained in the totalview module, so we must load that first.

% module load totalview

When running Totalview v6.2 for the first time, users of previous Totalview versions on Seaborg may experience problems debugging parallel codes. If you receive the following error message

attach_to_cluster: cluster -2: The TotalView Debugger Server in the
cluster is obsolete

you must take one of two actions:

  1. Delete the file named preferences.tvd in your $HOME/.totalview directory and then restart totalview, or
  2. Create a file named $HOME/.totalview/.tvdrc which contains these lines
    dset TV::server_launch_string {%C %R -n "%B/tvdsvr
                    -working_directory %D
                    -callback %L -set_pw %P -verbosity %V"}
    
    dset TV::visualizer_launch_string {%B/visualize}
    
    and restart totalview.

If you choose option (1) above, you will no longer be able to run the previous versions of totalview that are contained in NERSC's totalview modules.

Interactive parallel programs on the SP are actually executed by the poe command. So to use totalview, we start up poe under the control of totalview and pass the executable file name and poe options following the -a command-line switch for totalview, e.g.

% totalview poe -a ./ex1 -nodes 2 -procs 4 -stdoutmode ordered

This code will use two nodes with a total of four processes to run the executable ex1. The above command will open two windows.

Some aliases have been known to break the totalview shell script which actually runs the binary program, e.g. alias -x ls='ls -F'. If the program will not start you may want examine your dot files.

Image: AIX TotalView window

Image: poe window

Debugging the Example Program

At the top of the poe window, click on "Go". Poe and totalview will load your parallel program.

If you want to interact with totalview (put break points, etc.) then answer Yes in this window.

Another window will open with the source code displayed in it. The window is titled poeexecutable name>.0 where 0 is the task number.

Image: poe window.

The small AIX Totalview window shows the status of each task.

Image: AIX Totalview window.

Setting a Breakpoint

To set a break point, left-click with the mouse on the line number. Use a left mouse click on line 18 to create a break point there.

Image: poe window.

Switching Between Task Views

The breakpoint has been set on all tasks.

To see what's happening on another task right-click in the small "AIX TotalView" window on the line corresponding to that task, then choose "Dive" from the popup menu ("Dive Anew" will open a new window). For example, if you select "Dive" on the line corresponding to task 2 in the AIX Totalview window. the "poe" window will show information pertaining to task 2. You can also step to adjacent tasks with the "P-" and "P+" button in the upper right corner of the "poe" window.

Image: poe window.

Advancing to Breakpoints

Go back to the window showing processor 0 and, click on the "Go" button.

The program will start executing on all processors and will stop at line 18.

Image: poe window

Finding the Error

Now let's try to find the problem. Allow the program to advance through all the MPI calls and stop it right afterward by setting a breakpoint at line 52.

After setting the breakpoint, click "Go". The program will stop at line 52.

Image: poe window

Examining Variables

You examine variables by right-clicking with the mouse on the variable name in the "poe" window and slecting "Dive" from the popup menu. For example, "dive" on on the x1 variable name on line 46 of the "poe" window. A new window containing the values of x1 will open.

Now click on the "P+" button in the "poe" window. Then dive again on x1 in the "poe" window. Another new window will open showing the values of x1 on processor 1.

Similarly, we can examine the values on processor 2 and values on processor 3.

Solving the Problem

By looking at the values contained in the x1 array, we get a big clue to finding the solution. Since each processor has a number of non-zero elements that depends on its processor number, we suspect the problem is contained in one of the loops that performs the MPI_SENDs and/or MPI_RECVs.

If we first convince ourselves that line number 46 is OK, we are led to take a look at line 34. There we see that we're sending i elements of the array x1, not all im1 elements as we had intended. Once we make the change, recompile and run the program, we get the following output:

 All these values should be the same:
 Processor Number : 0  Before send  = 0.6456298828
 x1(3) on Processor No. : 0  After  recv  = 0.6456298828
 x1(3) on Processor No. : 1  After  recv  = 0.6456298828
 x1(3) on Processor No. : 2  After  recv  = 0.6456298828
 x1(3) on Processor No. : 3  After  recv  = 0.6456298828

And the code has been fixed!


LBNL Home
Page last modified: May 17 2004 14:04:13.
Page URL: http://www.nersc.gov/nusers/resources/software/ibm/totalview/
Contact: webmaster@nersc.gov
Privacy and Security Notice
DOE Office of Science