This research led to a number of important new software tools and standards. Recognizing that a message passing standard was necessary to ensure the easy portability of the prototype libraries, the project helped in initiating and promoting the development of the MPI (Message Passing Interface). Standards specific to parallel linear algebra were also developed. The PBLAS (Parallel Basic Linear Algebra Subprograms) are message passing versions of most of the sequential BLAS routines, and have a similar interface. The BLACS (Basic Linear Algebra Communication Subprograms) are a set of routines for communicating rectangular and trapezoidal sub-matrices between processes. The BLACS and PBLAS are important building blocks of the ScaLAPACK library.

The initial public release of ScaLAPACK occurred last year, and included routines for the solution of a general system of linear equations via LU and Cholesky factorizations, orthogonal factorizations (QR, RQ, LQ, and QL), reduction to condensed form (upper Hessenberg, tridiagonal form, and bidiagonal), and the symmetric eigenproblem.

A subsequent release occurred earlier this year and increased the functionality of the package by including routines for the symmetric positive definite banded linear systems of equations, condition estimation and iterative refinement, for LU and Cholesky factorization, full-rank linear least squares problems, generalized orthogonal factorizations, reduction of a symmetric-definite generalized eigenproblem to standard form, the generalized symmetric and the nonsymmetric eigenproblem.

Since these releases, many improvements have been underway to increase the flexibility and functionality of the library. Flexibility is being increased by expanding the spectrum of PBLAS operations. For instance the ability to operate on non-aligned data not only greatly simplifies the writing of ScaLAPACK codes but also makes it possible to express divide-and-conquer algorithms in terms of calls to ScaLAPACK and PBLAS routines. The functionality of ScaLAPACK is being expanded by the introduction of new routines such as rank deficient linear least squares, generalized linear least squares, singular value decomposition, general banded routines, and tridiagonal solvers.

A number of new scalable algorithms have been developed. A new version of the dense symmetric eigenvalue routine PDSYEVX is faster in most common cases and eliminates the need to reorthogonalize in the clustered eigenvalue case. A second eigensolver was also developed that is slower than PDSYEVX, but is more reliable.

We continued our investigation of new parallel algorithms for the non-symmetric eigenvalue problem. The sign function presents an algorithm for the dense nonsymmetric eigenvalue problem. It is relatively straightforward to parallelize, compared to the conventional sequential algorithm, but does several times as many floating point operations. It is the only scalable algorithm for this problem currently available. We have explored its numerical properties, performance, algorithmic variations, and suitability for mixed parallelism which was reported in a number of publications.

The ScaLAPACK library for dense linear algebra computations is in the process of transferring some of the technology to the commercial marketplace. IBM Corporation and Cray Research have incorporated parts of ScaLAPACK into their parallel scientific subroutine libraries. NAG Inc. has also incorporated parts of ScaLAPACK into its parallel library.

The goal of the workshop was to stimulate thought, discussion, and comment on the future development of a set of standards for basic matrix data structures, both dense and sparse, as well as calling sequences for a set of low-level computational kernels for the parallel and sequential settings. These new ``standards'' are needed to complement and supplement the existing ones for sparse and parallel computation. One of the major aims of these standards will be to enable linear algebra libraries (both public domain and commercial) to interoperate efficiently and easily.

The meeting was attended by 52 people. There were eight sessions of talks with a total of 28 talks. The principle observations and conclusions reached in this workshop were as follows:

- It may be too early to establish a standard for the Parallel BLAS, but there is sufficient interest in forming a group to investigate further.
- We need to consider the facilities and functionality offered by the BLAS following the experience with using them in existing software packages.
- The audience for the BLAS has expanded to encompass application scientists in general.
- Similar meeting style as in HPF and MPI forum should be adopted.
- Lower-level routines are needed to insure high-performance on small problems.
- Interoperability between programming languages needs to be stressed.
- Extensions and alternative entry points (to allow different levels of error checking and profiling) to the BLAS seem appropriate.

Additional information, including many of the presentations for the workshop, can be found here.

As a result of this meeting and a follow up meeting held at Supercomputer '95 in San Diego it was decided to organize a series of meetings similar to the MPI Forum to continue the BLAS standardization work. We expect this activity to be on going through FY 1997. Refer to the BLAS Technical Forum homepage for further information.

A new generation of even more massively parallel computers will soon emerge. Concurrent with the development of these more powerful parallel systems is a shift in the computing practices of many scientists and researchers. Increasingly, the tendency is to use a variety of distributed computing resources, with each individual task assigned to the most appropriate architecture, rather than to use a single, monolithic machine. The pace of these two developments, the emergence of highly parallel machines and the move to a more distributed computing environment, have been so rapid that software developers have been unable to keep up. Part of the problem has been that supporting system software has inhibited this development. Consequently, exploiting the power of these technological advances has become more and more difficult. Much of the existing reusable scientific software, such as that found in commercial libraries and in public domain packages, is no longer adequate for the new architectures. If the full power of these new machines is to be realized, then scalable libraries, comparable in scope and quality to those that currently exist, must be developed. One of our goals as software designers is to communicate to the high-performance computing community algorithms and methods for the solution of system of linear equations. In the past we have provided black-box software in the form of a mathematical software library, such as LAPACK, LINPACK, NAG, and IMSL. These software libraries provide for:

- Easy interface with hidden details
- Reliability; the code should fail as rarely as possible
- Speed.

On the other hand, the high-performance computing community, which wants to solve complex, large-scale problems as quickly as possible, seems to want

- Speed
- Access to internal details to tune data structures to the application
- Algorithms that are fast for the particular application even if not reliable as general methods.

These differing priorities make for different approaches to algorithms and software. The first set of priorities leads us to produce ``black boxes" for general problem classes. The second set of priorities seems to lead us to produce ``template codes" or ``toolboxes" where the users can assemble, modify and tune building blocks starting from well-documented subparts. This leads to software which is not going to be reliable on all problems, and requires extensive user tuning to make it work. This is just what the block-box users do not want.

Our eigenvalue template design will expect that a non-expert user confronted with an eigenvalue problem would be willing to supply

- as few facts as necessary about the operator or operators whose spectral information is desired,
- the kind of spectral information desired (including accuracy), and
- the kind of computer available to solve it (perhaps).

In return, the user would want

- the software to solve the problem,
- an estimate of how long it will take to solve, and
- a way to assess the accuracy (perhaps).

In the (likely) case that no single best algorithm exists, we expect the user would want a list of reasonable alternatives and a discussion of the tradeoffs in time, space and accuracy among them. For example, it might be reasonable to use a dense algorithm on a sparse problem if high accuracy is desired, the problem is not too large, and/or the problem is to be solved just once. Much of our effort will center on educating the user as to which facts must be supplied to make a decision. To this end we have decided to categorize available methods along five (mostly independent) axes:

- Mathematical Properties of the Problem,
- Desired Spectral Information,
- Problem Representation,
- Available Transformations, and
- Costs (including dependence on accuracy, computer type).

Ideally, the user would supply a ``coordinate'' for each of these five axes, thus uniquely identifying a problem class for which we could identify a ``best algorithm,'' software implementing it, a performance model to predict its execution time, and a method to assess the accuracy.

Realistically, only a few regions of this five-dimensional space are populated with interesting or useful algorithms, and we expect to have a more decision-tree like approach to guide the user's expectations. This work will continue into FY 1997.