Publications Search

Solving elliptic finite element systems in near-linear time with support preconditioners

Proposed for publication in the SIAM Journal on Matrix Analysis.

We consider linear systems arising from the use of the finite element method for solving a certain class of linear elliptic problems. Our main result is that these linear systems, which are symmetric and positive semidefinite, are well approximated by symmetric diagonally dominant matrices. Our framework for defining matrix approximation is support theory. Significant graph theoretic work has already been developed in the support framework for preconditioners in the diagonally dominant case, and in particular it is known that such systems can be solved with iterative methods in nearly linear time. Thus, our approximation result implies that these graph theoretic techniques can also solve a class of finite element problems in nearly linear time. We show that the quality of our approximation, which controls the number of iterations in the preconditioned iterative solver, depends primarily on a mesh quality measure but not on the problem size or shape of the domain.

More Details

TYPE Journal Article YEAR 2005

OSTI

Solving elliptic finite element systems in near-linear time with support preconditioners

Proposed for publication in SIAM Journal of Matrix Analysis.

Boman, Erik G.; Hendrickson, Bruce A.

We consider linear systems arising from the use of the finite element method for solving scalar linear elliptic problems. Our main result is that these linear systems, which are symmetric and positive semidefinite, are well approximated by symmetric diagonally dominant matrices. Our framework for defining matrix approximation is support theory. Significant graph theoretic work has already been developed in the support framework for preconditioners in the diagonally dominant case, and in particular it is known that such systems can be solved with iterative methods in nearly linear time. Thus, our approximation result implies that these graph theoretic techniques can also solve a class of finite element problems in nearly linear time. We show that the support number bounds, which control the number of iterations in the preconditioned iterative solver, depend on mesh quality measures but not on the problem size or shape of the domain.

More Details

TYPE Journal Article YEAR 2004

OSTI

LDRD report : parallel repartitioning for optimal solver performance

Devine, Karen; Boman, Erik G.; Heaphy, Robert T.; Hendrickson, Bruce A.; Heroux, Michael A.

We have developed infrastructure, utilities and partitioning methods to improve data partitioning in linear solvers and preconditioners. Our efforts included incorporation of data repartitioning capabilities from the Zoltan toolkit into the Trilinos solver framework, (allowing dynamic repartitioning of Trilinos matrices); implementation of efficient distributed data directories and unstructured communication utilities in Zoltan and Trilinos; development of a new multi-constraint geometric partitioning algorithm (which can generate one decomposition that is good with respect to multiple criteria); and research into hypergraph partitioning algorithms (which provide up to 56% reduction of communication volume compared to graph partitioning for a number of emerging applications). This report includes descriptions of the infrastructure and algorithms developed, along with results demonstrating the effectiveness of our approaches.

More Details

TYPE Report YEAR 2004

DOI OSTI

Supercomputing and discrete algorithms : a symbiotic relationship

Hendrickson, Bruce A.; Hart, William E.; Phillips, Cynthia A.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Interprocessor communication with memory constraints

Hendrickson, Bruce A.

Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. The authors propose a model for optimizing the exchange of messages under such circumstances which they call the minimum phase remapping problem. They first show that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, they show how the problem can be phrased as an instance of multi-commodity flow. Next, they study a continuous approximation to the problem. They show that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, they devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.

More Details

TYPE Conference YEAR 2000

OSTI

Load balancing fictions, falsehoods and fallacies

Applied Mathematical Modelling

Hendrickson, Bruce A.

Effective use of a parallel computer requires that a calculation be carefully divided among the processors. This load balancing problem appears in many guises and has been a fervent area of research for the past decade or more. Although great progress has been made, and useful software tools developed, a number of challenges remain. It is the conviction of the author that these challenges will be easier to address if we first come to terms with some significant shortcomings in our current perspectives. This paper tries to identify several areas in which the prevailing point of view is either mistaken or insufficient. The goal is to motivate new ideas and directions for this important field. © 2000 Elsevier Science Inc.

More Details

TYPE Journal Article YEAR 2000

DOI OSTI Scopus

Design of dynamic load-balancing tools for parallel applications

Proceedings of the International Conference on Supercomputing

Devine, Karen; Hendrickson, Bruce A.; Boman, Erik G.; Vaughan, Courtenay T.

The design of general-purpose dynamic load-balancing tools for parallel applications is more challenging than the design of static partitioning tools. Both algorithmic and software engineering issues arise. We have addressed many of these issues in the design of the Zoltan dynamic load-balancing library. Zoltan has an object-oriented interface that makes it easy to use and provides separation between the application and the load-balancing algorithms. It contains a suite of dynamic load-balancing algorithms, including both geometric and graph-based algorithms. Its design makes it valuable both as a partitioning tool for a variety of applications and as a research test-bed for new algorithmic development. In this paper, we describe Zoltan's design and demonstrate its use in an unstructured-mesh finite element application.

More Details

TYPE Conference YEAR 2000

OSTI Scopus

Scalable Parallel Crash Simulations

Hendrickson, Bruce A.

We are pleased to submit our efforts in parallelizing the PRONTO application suite for con- sideration in the SuParCup 99 competition. PRONTO is a finite element transient dynamics simulator which includes a smoothed particle hydrodynamics (SPH) capability; it is similar in scope to the well-known DYNA, PamCrash, and ABAQUS codes. Our efforts over the last few years have produced a fully parallel version of the entire PRONTO code which (1) runs fast and scalably on thousands of processors, (2) has performed the largest finite-element transient dynamics simulations we are aware of, and (3) includes several new parallel algorithmic ideas that have solved some difficult problems associated with contact detection and SPH scalability. We motivate this work, describe the novel algorithmic advances, give performance numbers for PRONTO running on Sandia's Intel Teraflop machine, and highlight two prototypical large-scale computations we have performed with the parallel code. We have successfully parallelized a large-scale production transient dynamics code with a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations. To be able to simulate a more than ten million element model in a few tenths of second per timestep is unprecedented for solid dynamics simulations, especially when full global contact searches are required. The key reason is our new algorithmic ideas for efficiently parallelizing the contact detection stage. To our knowledge scalability of this computation had never before been demonstrated on more than 64 processors. This has enabled parallel PRONTO to become the only solid dynamics code we are aware of that can run effectively on 1000s of processors. More importantly, our parallel performance compares very favorably to the original serial PRONTO code which is optimized for vector supercomputers. On the container crush problem, a Teraflop node is as fast as a single processor of the Cray Jedi. This means that on the Teraflop machine we can now run simulations with tens of millions of elements thousands of times faster than we could on the Jedi! This is enabling transient dynamics simulations of unprecedented scale and fidelity. Not only can previous applications be run with vastly improved resolution and speed, but qualitatively new and different analyses have been made possible.

More Details

TYPE Conference YEAR 1999

OSTI

Graph Partitioning Models for Parallel Computing

Parallel Computing

Hendrickson, Bruce A.

Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.

More Details

TYPE Journal Article YEAR 1999

OSTI

Skewed graph partitioning

Hendrickson, Bruce A.

Graph partitioning is an important abstraction used in solving many scientific computing problems. Unfortunately, the standard partitioning model does not incorporate considerations that are important in many settings. We address this by describing a generalized partitioning model which incorporates the notion of partition skew and is applicable to a variety of problems. We then develop enhancements to several important partitioning algorithms necessary to solve the generalized partitioning problem. Finally we demonstrate the benefit of employing several of these generalized methods to static decomposition of parallel computing problems.

More Details

TYPE Conference YEAR 1997

OSTI

Transient dynamics simulations: Parallel algorithms for contact detection and smoothed particle hydrodynamics

Hendrickson, Bruce A.

Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (gasoline, water) or fluid-like materials (earth) in the simulation can be modeled using the techniques of smoothed particle hydrodynamics. Implementing a hybrid mesh/particle model on a massively parallel computer poses several difficult challenges. One challenge is to simultaneously parallelize and load-balance both the mesh and particle portions of the computation. A second challenge is to efficiently detect the contacts that occur within the deforming mesh and between mesh elements and particles as the simulation proceeds. These contacts impart forces to the mesh elements and particles which must be computed at each timestep to accurately capture the physics of interest. In this paper we describe new parallel algorithms for smoothed particle hydrodynamics and contact detection which turn out to have several key features in common. Additionally, we describe how to join the new algorithms with traditional parallel finite element techniques to create an integrated particle/mesh transient dynamics simulation. Our approach to this problem differs from previous work in that we use three different parallel decompositions, a static one for the finite element analysis and dynamic ones for particles and for contact detection. We have implemented our ideas in a parallel version of the transient dynamics code PRONTO-3D and present results for the code running on a large Intel Paragon.

More Details

TYPE Conference YEAR 1996

OSTI

A new parallel algorithm for contact detection in finite element methods

Hendrickson, Bruce A.

In finite-element, transient dynamics simulations, physical objects are typically modeled as Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. In many simulations, such as computations of impacts or explosions, portions of the deforming mesh come in contact with each other as the simulation progresses. These contacts must be detected and the forces they impart to the mesh must be computed at each timestep to accurately capture the physics of interest. While the finite-element portion of these computations is readily parallelized, the contact detection problem is difficult to implement efficiently on parallel computers and has been a bottleneck to achieving high performance on large parallel machines. In this paper we describe a new parallel algorithm for detecting contacts. Our approach differs from previous work in that we use two different parallel decompositions, a static one for the finite element analysis and dynamic one for contact detection. We present results for this algorithm in a parallel version of the transient dynamics code PRONTO-3D running on a large Intel Paragon.

More Details

TYPE Conference YEAR 1996

OSTI

Enhancing data locality by using terminal propagation

Hendrickson, Bruce A.

Terminal propagation is a method developed in the circuit placement community for adding constraints to graph partitioning problems. This paper adapts and expands this idea, and applies it to the problem of partitioning data structures among the processors of a parallel computer. We show how the constraints in terminal propagation can be used to encourage partitions in which messages are communicated only between architecturally near processors. We then show how these constraints can be handled in two important partitioning algorithms, spectral bisection and multilevel-KL. We compare the quality of partitions generated by these algorithms to each other and to Partitions generated by more familiar techniques.

More Details

TYPE Conference YEAR 1995

OSTI

An efficient parallel algorithm for matrix-vector multiplication

Hendrickson, Bruce A.

The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/{radical}p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.

More Details

TYPE Report YEAR 1993

OSTI

An improved spectral graph partitioning algorithm for mapping parallel computations

Hendrickson, Bruce A.

Efficient use of a distributed memory parallel computer requires that the computational load be balanced across processors in a way that minimizes interprocessor communication. We present a new domain mapping algorithm that extends recent work in which ideas from spectral graph theory have been applied to this problem. Our generalization of spectral graph bisection involves a novel use of multiple eigenvectors to allow for division of a computation into four or eight parts at each stage of a recursive decomposition. The resulting method is suitable for scientific computations like irregular finite elements or differences performed on hypercube or mesh architecture machines. Experimental results confirm that the new method provides better decompositions arrived at more economically and robustly than with previous spectral methods. We have also improved upon the known spectral lower bound for graph bisection.

More Details

TYPE Report YEAR 1992

OSTI

Parallel QR factorization on a hypercube using the torus wrap mapping

Hendrickson, Bruce A.

We present an algorithm for the QR factorization of a dense matrix without column pivoting on a hypercube multiprocessor. The algorithm combines the optimal numerical efficiency of Householder reflections with the excellent communication properties of the torus wrap mapping. Analytical results indicate that the communication cost for this algorithm is less than that for other common approaches. Numerical results on an nCUBE 2 confirm the efficiency of our technique. 23 refs., 5 figs., 1 tab.

More Details

TYPE Conference YEAR 1991

OSTI

Publications

Search results