HPC ResourcesResources HomeSoftwareSoftware HomeCompilers Libraries Applications Tools & Utilities Software by PlatformAdditional SoftwareRequest FormAffiliated CollectionsACTS Collection |
Using Totalview on Seaborg
Etnus TotalView DocumentationEtnus provides a TotalView User's Guide and a web-based Tutorial. See also LLNL's TotalView Tutorial. IntroductionOne way of trying to debug a code involves placing WRITE statements in the code, running the code, and looking for variables that have "wrong" values. This method involves a lot of tinkering with the code and recompiling. With large parallel codes this can quickly become very time-consuming and cumbersome. An alternative debugging method involves using a debugger like totalview. A debugger allows you to:
TotalView is a full-featured, source-level, graphical debugger for applications written in C, C++, Fortran (77 and 90), assembler, and mixed source/assembler codes. It is a multiprocess, multithread debugger that supports multiple parallel programming paradigms including MPI, PVM and OpenMP. Compiling an Example ProgramIn order to use the debugger, code must be compiled with the -g option. This will produce a larger executable that may run relatively slowly, so be sure to recompile without the -g option once you are ready to execute production runs. This example program, ex1.f, shown below can be compiled with the following command: % mpxlf -o ex1 -g ex1.f ** ex1 === End of Compilation 1 === 1501-510 Compilation successful for file ex1.f. The source code for this example program can be found by loading the training module and then looking in the $EXAMPLES/Totalview directory. Running the program, without the debugger, produces the following output: % poe ./ex1 -nodes 2 -procs 4 -stdoutmode ordered All these values should be the same: Processor Number : 0 Before send = 0.6456298828 x1(3) on Processor No. : 0 After recv = 0.6456298828 x1(3) on Processor No. : 1 After recv = 0.0000000000E+00 x1(3) on Processor No. : 2 After recv = 0.0000000000E+00 x1(3) on Processor No. : 3 After recv = 0.6456298828 Processors 1 and 2 contain unexpected values in array x1. Starting the DebuggerThe totalview program is contained in the totalview module, so we must load that first. % module load totalview When running Totalview v6.2 for the first time, users of previous Totalview versions on Seaborg may experience problems debugging parallel codes. If you receive the following error message attach_to_cluster: cluster -2: The TotalView Debugger Server in the cluster is obsolete you must take one of two actions:
If you choose option (1) above, you will no longer be able to run the previous versions of totalview that are contained in NERSC's totalview modules. Interactive parallel programs on the SP are actually executed by the poe command. So to use totalview, we start up poe under the control of totalview and pass the executable file name and poe options following the -a command-line switch for totalview, e.g. % totalview poe -a ./ex1 -nodes 2 -procs 4 -stdoutmode ordered This code will use two nodes with a total of four processes to run the executable ex1. The above command will open two windows. Some aliases have been known to break the totalview shell script which actually runs the binary program, e.g. alias -x ls='ls -F'. If the program will not start you may want examine your dot files. Image: AIX TotalView window Image: poe window Debugging the Example ProgramAt the top of the poe window, click on "Go". Poe and totalview will load your parallel program. If you want to interact with totalview (put break points, etc.) then answer Yes in this window. Another window will open with the source code displayed in it. The window is titled poeexecutable name>.0 where 0 is the task number. Image: poe window. The small AIX Totalview window shows the status of each task. Image: AIX Totalview window. Setting a BreakpointTo set a break point, left-click with the mouse on the line number. Use a left mouse click on line 18 to create a break point there. Image: poe window. Switching Between Task ViewsThe breakpoint has been set on all tasks. To see what's happening on another task right-click in the small "AIX TotalView" window on the line corresponding to that task, then choose "Dive" from the popup menu ("Dive Anew" will open a new window). For example, if you select "Dive" on the line corresponding to task 2 in the AIX Totalview window. the "poe" window will show information pertaining to task 2. You can also step to adjacent tasks with the "P-" and "P+" button in the upper right corner of the "poe" window. Image: poe window. Advancing to BreakpointsGo back to the window showing processor 0 and, click on the "Go" button. The program will start executing on all processors and will stop at line 18. Image: poe window Finding the ErrorNow let's try to find the problem. Allow the program to advance through all the MPI calls and stop it right afterward by setting a breakpoint at line 52. After setting the breakpoint, click "Go". The program will stop at line 52. Image: poe window Examining VariablesYou examine variables by right-clicking with the mouse on the variable name in the "poe" window and slecting "Dive" from the popup menu. For example, "dive" on on the x1 variable name on line 46 of the "poe" window. A new window containing the values of x1 will open. Now click on the "P+" button in the "poe" window. Then dive again on x1 in the "poe" window. Another new window will open showing the values of x1 on processor 1. Similarly, we can examine the values on processor 2 and values on processor 3. Solving the ProblemBy looking at the values contained in the x1 array, we get a big clue to finding the solution. Since each processor has a number of non-zero elements that depends on its processor number, we suspect the problem is contained in one of the loops that performs the MPI_SENDs and/or MPI_RECVs. If we first convince ourselves that line number 46 is OK, we are led to take a look at line 34. There we see that we're sending i elements of the array x1, not all im1 elements as we had intended. Once we make the change, recompile and run the program, we get the following output: All these values should be the same: Processor Number : 0 Before send = 0.6456298828 x1(3) on Processor No. : 0 After recv = 0.6456298828 x1(3) on Processor No. : 1 After recv = 0.6456298828 x1(3) on Processor No. : 2 After recv = 0.6456298828 x1(3) on Processor No. : 3 After recv = 0.6456298828 And the code has been fixed! |
Page last modified: May 17 2004 14:04:13. Page URL: http://www.nersc.gov/nusers/resources/software/ibm/totalview/ Contact: webmaster@nersc.gov Privacy and Security Notice |