NERSC logo National Energy Research Scientific Computing Center
  A DOE Office of Science User Facility
  at Lawrence Berkeley National Laboratory
 

Guidance on Selecting Mathematical Libraries on seaborg

The NERSC IBM SP system, seaborg, provides access to a large number of mathematical libraries. This guidance document is intended to provide users with some general direction on the choice of libraries and also with some performance data that can allow comparison of the libraries.

Please bear in mind that specific performance numbers mentioned below represent performance on a particular date, in a particular environment, and for a particular test. Values are subject to change.


The Variety of Mathematical Libraries

There are a large number of mathematical libraries available on the NERSC IBM SP seaborg. This discussion addresses some of the most significant libraries for user consideration. For a full list of software available on seaborg, please see the list.

Some of the vendor libraries are:

  • ESSL, the IBM single-processor Engineering and Scientific Subroutine Library. Although the library is a single processor implementation, it can be used with MPI, with each MPI task runing the library routines independently.
  • ESSL-SMP, the SMP, thread-safe version of ESSL, which can use all processors on an SP node under a single multi-threaded task.
  • PESSL, the IBM Parallel ESSL, which is quite similar to ScaLAPACK. PESSL is designed for distributed, multi-node computing, with data for a single computation distributed across many nodes. PESSL uses BLACS for communication between nodes.

These vendor supplied and supported libraries generally provide the best performance; however, there may be certain vendor-specific implementation details which could impact code portability.

There are also a number of third-party software libraries available on seaborg which provide many of the same functions as the vendor libraries. Among them are:

  • NAG, a widely available, commercial, single-processor mathematical library from the Numerical Algorithms Group.
  • NAG-SMP, an SMP version of NAG for multi-thread computing on a single node.
  • NAG-PAR, a multi-processor, distributed version of NAG, which uses BLACS for communication.

It should be clear that this set of NAG libraries has the same basic structure as the IBM libraries: (1) a single-processor version; (2) an SMP version; and (3) a parallel version.

Other significant third-party mathematical libraries are:

  • IMSL, a widely available, commercial, single-processor mathematical library from Visual Numerics.
  • ScaLAPACK, a widely available, public-domain, distributed, multi-processor linear algebra library.

Best Performance from IBM Libraries

In all tests by the NERSC consulting staff, the best performance is generally obtained by using the IBM libraries (ESSL, ESSL-SMP, and PESSL). In some cases a third-party library routine may approach the vendor library performance, but in general, the best performance is obtained by using the IBM libraries.

As the vendor of the system, IBM has a unique opportunity to tune the mathematical libraries for optimal performance on the architecture, and they do a good job.


The Cost of Portability Using Third-Party Libraries

If user codes are designed to run on a variety of platforms in addition to the NERSC IBM SP seaborg, it may be necessary to consider portability and the use of a common, third-party software library.

Tests by NERSC consulting staff and users have indicated that third-party libraries routines show a wide range of performance variation from the vendor libraries, and the variation applies to the particular routine chosen rather than the entire library. It is not uncommon, however, for third-party library routines to be ten to one hundred times slower than equivalent IBM library routines.

Users should carefully evaluate the performance impacts of using third-party mathematical library routines. If considerations of portability strongly support the choice of a third-party library, then users may wish to carefully study their code and see if it would not be possible to replace one or two significant library routines with the IBM equivalents, an action which may result in a great overall performance increase with only minimal impacts on portability. This would indicate the use of both third-party and IBM libraries when porting the code to seaborg


Comparison of Matrix-Matrix Multiply Routines

The following table provides sample performance data in millions of floating point operations per second obtained using the HPMcount performance monitoring tool. Please bear in mind that these numbers are dependent on the specific hardware and software environment at the time the tests were made, and they may be subject to variability.

Matrix-Matrix Multiply of 5,000 by 5,000 dense matrix; Single Node

The following numbers are obtained by using hpmcount or poe+ on the entire code, including minimal I/O and matrix construction.

Mflip/s  Wall sec   Library
-------  --------   -------------------------------------------
 8,300       30     PESSL PDGEMM (16 processors)
 7,900       32     ScaLAPACK routine PDGEMM (16 processors)
 7,900       32     ESSL-SMP routine DGEMM (16 threads)
 7,900       32     NAG-SMP routine F01CKF (16 threads)
 1,200      213     ESSL routine DGEMM
 1,150      218     IMSL routine DMRRRR

The mulititask (PESSL and ScaLAPACK) and multithread (ESSL-SMP and NAG-SMP) libraries give comparable performance. Compilations with -O3 or with the IBM MASS library had little impact on these results, last updated on June 25, 2002.

Matrix-Matrix Multiply of 20,000 by 20,000 dense matrix; Multi-Node

The following numbers are obtained by using poe+ on the entire code, including minimal I/O and matrix construction.

Mflip/s  Wall sec   Library and configuration
-------  --------   -------------------------------------------
158,900     100     ScaLAPACK PDGEMM (256 proc, 16 nodes) 
146,200     110     PESSL PDGEMM (256 proc, 16 nodes) 
105,400     150     ScaLAPACK PDGEMM (144 proc, 9 nodes, block 128) 
100,960     160     PESSL PDGEMM (144 proc, 9 nodes, block 128) 
 79,400     200     PESSL PDGEMM (144 proc, 9 nodes, block 1024) 
 74,800     214     ScaLAPACK PDGEMM (144 proc, 9 nodes, block 1024) 
 55,000     290     PESSL PDGEMM (64 proc, 4 nodes) 
 50,000     320     ScaLAPACK PDGEMM (64 proc, 4 nodes) 
 27,160     590     PESSL PDGEMM (32 proc, 2 nodes) 
 25,630     625     ScaLAPACK PDGEMM (32 proc, 2 nodes) 
 15,800   1,010     PESSL PDGEMM (16 Proc, 1 node)
 15,600   1,025     ScaLAPACK PDGEMM (16 Proc, 1 node)

Here are two plots of representative samples of the above data:

Exec times
Speedups

The problem size is now about 10 gigabytes for the matrices, and too large to fit on a single process; only the multi-process, distributed routine can be used.

Rescaling the numbers to provide some information on scalability, the results in terms of Mflip/s per node, Relative Wall Sec (from fastest run), and Speed-Up per Node (assuming a time of 1,010 wall clock seconds for one node) are given below All runs used a block-cyclic distribution of data on a square processor array (except 32 processor runs on 8x4 processor array). Note that for the cases run on 144 processors, or 9 nodes, the difference in block size resulted in a change of more thant 30% in performance.

Gflips   Relative Speed- 
per node Wall sec up / N Library and configuration
------- -------- ------ ------------------------------------------------
 9.93     1.0    0.63   ScaLAPACK PDGEMM (256 proc, 16 nodes, block 128)
 9.14     1.1    0.57   PESSL PDGEMM (256 proc, 16 nodes, block 128)
11.71     1.5    0.75   ScaLAPACK PDGEMM (144 proc, 9 nodes, block 128)
11.22     1.6    0.70   PESSL PDGEMM (144 proc, 9 nodes, block 128)
 8.82     2.0    0.56   PESSL PDGEMM (144 proc, 9 nodes, block 1024)
 8.31     2.1    0.52   ScaLAPACK PDGEMM (144 proc, 9 nodes, block 1024)
13.75     2.9    0.87   PESSL PDGEMM (64 proc, 4 nodes, block 1024)
12.50     3.2    0.79   ScaLAPACK PDGEMM (64 proc, 4 nodes, block 1024)
13.60     5.9    0.86   PESSL PDGEMM (32 proc, 2 nodes, block 128)
12.80     6.2    0.81   ScaLAPACK PDGEMM (32 proc, 2 nodes, block 128)
15.80    10.1    1.00   PESSL PDGEMM (16 Proc, 1 node, block 128)
15.60    10.3    0.99   ScaLAPACK PDGEMM (16 Proc, 1 node, block 128)

For a problem of this size, it looks like the best performance, in terms of a balance of reducing wall clock time while still making good use of all processors, was obtained using 9 nodes (144 processors) with a block size of 128 for the block-cyclic distribution of the 20,000 by 20,000 dense matrices.

Matrix-Matrix Multiply of Larger Dense Matrix; Multi-Node

The following numbers are obtained by using poe+ on the entire code, including minimal I/O and matrix construction.

Gflip/s Wall sec Size    Library and configuration
------- -------- -------  -------------------------------------------
163.6   1,529   50,000  ScaLAPACK PDGEMM (256 proc, 16 nodes)
163.4   1,531   50,000  PESSL PDGEMM (256 proc, 16 nodes)
179.6  11,141  100,000  PESSL PDGEMM (256 proc, 16 nodes, 128 block)
210.7   9,495  100,000  ScaLAPACK PDGEMM (256 proc, 16 nodes, 128 block)

Comparison of Matrix Inversion Routines

The following table provides sample performance data in millions of floating point operations per second obtained using the HPMcount performance monitoring tool. Please bear in mind that these numbers are dependent on the specific hardware and software environment at the time the tests were made, and they may be subject to variability.

Matrix Inversion of 1,000 by 1,000 dense matrix; Single Node

2,582 Mflip/s   ESSL-SMP routine DGEICD (16 threads)
1,209 Mflip/s   ESSL routine DGEICD
  462 Mflip/S   IMSL routine DLINRG

Matrix Inversion of 5,000 by 5,000 dense matrix; Single Node

13,899 Mflip/s   ESSL-SMP routine DGEICD (16 threads)
 1,185 Mflip/s   ESSL routine DGEICD
   303 Mflip/S   IMSL routine DLINRG

Comparison of Matrix-Vector Solution Routines

The following table provides sample performance data in millions of floating point operations per second obtained using the HPMcount performance monitoring tool. Please bear in mind that these numbers are dependent on the specific hardware and software environment at the time the tests were made, and they may be subject to variability.

Solution of Vector of Length 1,000 with 1,000 by 1,000 dense matrix; Single Node

780 Mflip/s   ESSL routines DGEF & DGES [0.9 sec wall clock]
 90 Mflip/s   ScaLAPACK routine PDGEV (4 procs) [7.6 sec wall clock]
 85 Mflip/s   PESSL routine PDGEV (4 procs) [7.9 sec wall clock]

Here the problem size is small enough (8 megabytes) that it easily fits onto a single node, and there is no reason to use the parallel versions of the libraries, which involve additional overhead, for a solution of a problem of this size.

Solution of Vector of Length 10,000 with 10,000 by 10,000 dense matrix; Single Node

10,100 Mflip/s   PESSL routine PDGEV (16 procs) [66 sec wall clock]
 8,900 Mflip/s   ScaLAPACK routine PDGEV (16 procs) [75 sec wall clock]
 1,170 Mflip/s   ESSL routines DGEF & DGES [570 sec wall clock]

Here the problem size is larger, (760 megabytes), and although it still fits onto a single node, there is a benefit to using more than a single processor on the node.

Solution of Vector of Length 50,000 with 50,000 by 50,000 dense matrix; Multi-Node

155,000 Mflip/s   PESSL routine PDGEV (256 proc, 16 nodes) 
						[540 sec wall]
148,000 Mflip/s   ScaLAPACK routine PDGEV (256 proc, 16 nodes) 
						[560 sec wall]

Here the problem size is too large for a single node, requiring about 20 gigabytes for the data. Thus only the parallel, distributed libraries can be used. The per-node performance is about 9,500 Mflip/s.


Conclusion

In conclusion, the best performance is generally obtained using the routines from the IBM mathematical library best suited to your problem size, whether single processor, SMP multi-threaded, or parallel (ESSL, ESSL-SMP, or PESSL).

If portability concerns indicate the use of a third-party library, carefully evaluate the performance costs. If you can easily improve the performance of your code by replacing a key library routine with the IBM equivalent, a significant performance improvement may be obtained.

If a third-party library provides a specific capability that is not available in the IBM libraries, it would be a prudent first step to only use that particular third-party routine, and use the IBM library for all other functions. This mixed library approach should, on first approximation, give the best performance while including functionality which may not be in the IBM libraries.


Please send comments about this page to Frank Hale, a NERSC Consultant


LBNL Home
Page last modified: May 17 2004 14:04:13.
Page URL: http://www.nersc.gov/nusers/resources/software/ibm/guidance.php
Contact: webmaster@nersc.gov
Privacy and Security Notice
DOE Office of Science