Publications

66 Results

Search results

Jump to search filters

Center for Computing Research Highlights

Hendrickson, Bruce A.; Alvin, Kenneth F.; Miller, Leann A.; Collis, Samuel S.

Sandia has a legacy of leadership in the advancement of high performance computing (HPC) at extreme scales. First-of-a-kind scalable distributed-memory parallel platforms such as the Intel Paragon, ASCI Red (the world’s first teraflops computer), and Red Storm (co-developed with Cray) helped form the basis for one of the most successful supercomputer product lines ever: the Cray XT series. Sandia also has pioneered system software elements—including lightweight operating systems, the Portals network programming interface, advanced interconnection network designs, and scalable I/O— that are critical to achieving scalability on large computing systems.

More Details

Advances in Domain Mapping of Massively Parallel Scientific Computations

Leland, Robert W.; Hendrickson, Bruce A.

One of the most important concerns in parallel computing is the proper distribution of workload across processors. For most scientific applications on massively parallel machines, the best approach to this distribution is to employ data parallelism; that is, to break the datastructures supporting a computation into pieces and then to assign those pieces to different processors. Collectively, these partitioning and assignment tasks comprise the domain mapping problem.

More Details

CHALLENGES IN PARALLEL GRAPH PROCESSING

Parallel Processing Letters

Hendrickson, Bruce A.; Berry, Jonathan

Graph algorithms are becoming increasingly important for solving many problems in scientific computing, data mining and other domains. As these problems grow in scale, parallel computing resources are required to meet their computational and memory requirements. Unfortunately, the algorithms, software, and hardware that have worked well for developing mainstream parallel scientific applications are not necessarily effective for large-scale graph problems. In this paper we present the inter-relationships between graph problems, software, and parallel hardware in the current state of the art and discuss how those issues present inherent challenges in solving large-scale graph problems. The range of these challenges suggests a research agenda for the development of scalable high-performance software for graph problems.

More Details

LDRD final report : massive multithreading applied to national infrastructure and informatics

Barrett, Brian; Hendrickson, Bruce A.; Laviolette, Randall A.; Leung, Vitus J.; Mackey, Greg E.; Murphy, Richard C.; Phillips, Cynthia A.; Pinar, Ali P.

Large relational datasets such as national-scale social networks and power grids present different computational challenges than do physical simulations. Sandia's distributed-memory supercomputers are well suited for solving problems concerning the latter, but not the former. The reason is that problems such as pattern recognition and knowledge discovery on large networks are dominated by memory latency and not by computation. Furthermore, most memory requests in these applications are very small, and when the datasets are large, most requests miss the cache. The result is extremely low utilization. We are unlikely to be able to grow out of this problem with conventional architectures. As the power density of microprocessors has approached that of a nuclear reactor in the past two years, we have seen a leveling of Moores Law. Building larger and larger microprocessor-based supercomputers is not a solution for informatics and network infrastructure problems since the additional processors are utilized to only a tiny fraction of their capacity. An alternative solution is to use the paradigm of massive multithreading with a large shared memory. There is only one instance of this paradigm today: the Cray MTA-2. The proposal team has unique experience with and access to this machine. The XMT, which is now being delivered, is a Red Storm machine with up to 8192 multithreaded 'Threadstorm' processors and 128 TB of shared memory. For many years, the XMT will be the only way to address very large graph problems efficiently, and future generations of supercomputers will include multithreaded processors. Roughly 10 MTA processor can process a simple short paths problem in the time taken by the Gordon Bell Prize-nominated distributed memory code on 32,000 processors of Blue Gene/Light. We have developed algorithms and open-source software for the XMT, and have modified that software to run some of these algorithms on other multithreaded platforms such as the Sun Niagara and Opteron multi-core chips.

More Details

Tolerating the community detection resolution limit with edge weighting

Proposed for publication in the Proceedings of the National Academy of Sciences.

Hendrickson, Bruce A.; Laviolette, Randall A.; Phillips, Cynthia A.; Berry, Jonathan

Communities of vertices within a giant network such as the World-Wide-Web are likely to be vastly smaller than the network itself. However, Fortunato and Barthelemy have proved that modularity maximization algorithms for community detection may fail to resolve communities with fewer than {radical} L/2 edges, where L is the number of edges in the entire network. This resolution limit leads modularity maximization algorithms to have notoriously poor accuracy on many real networks. Fortunato and Barthelemy's argument can be extended to networks with weighted edges as well, and we derive this corollary argument. We conclude that weighted modularity algorithms may fail to resolve communities with fewer than {radical} W{epsilon}/2 total edge weight, where W is the total edge weight in the network and {epsilon} is the maximum weight of an inter-community edge. If {epsilon} is small, then small communities can be resolved. Given a weighted or unweighted network, we describe how to derive new edge weights in order to achieve a low {epsilon}, we modify the 'CNM' community detection algorithm to maximize weighted modularity, and show that the resulting algorithm has greatly improved accuracy. In experiments with an emerging community standard benchmark, we find that our simple CNM variant is competitive with the most accurate community detection methods yet proposed.

More Details

Solving elliptic finite element systems in near-linear time with support preconditioners

SIAM Journal on Numerical Analysis

Boman, Erik G.; Hendrickson, Bruce A.; Vavasis, Stephen

We consider linear systems arising from the use of the finite element method for solving scalar linear elliptic problems. Our main result is that these linear systems, which are symmetric and positive semidefinite, are well approximated by symmetric diagonally dominant matrices. Our framework for defining matrix approximation is support theory. Significant graph theoretic work has already been developed in the support framework for preconditioners in the diagonally dominant case, and, in particular, it is known that such systems can be solved with iterative methods in nearly linear time. Thus, our approximation result implies that these graph theoretic techniques can also solve a class of finite element problems in nearly linear time. We show that the support number bounds, which control the number of iterations in the preconditioned iterative solver, depend on mesh quality measures but not on the problem size or shape of the domain. © 2008 Society for Industrial and Applied Mathematics.

More Details

Graph analysis with high-performance computing

Computing in Science and Engineering

Hendrickson, Bruce A.; Berry, Jonathan W.

Large, complex graphs arise in many settings including the Internet, social networks, and communication networks. To study such data sets, the authors explored the use of highperformance computing (HPC) for graph algorithms. They found that the challenges in these applications are quite different from those arising in traditional HPC applications and that massively multithreaded machines are well suited for graph problems. © 2008 IEEE.

More Details

Combinatorial scientific computing: The enabling power of discrete algorithms in computational science

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Hendrickson, Bruce A.; Pothen, Alex

Combinatorial algorithms have long played a crucial, albeit under-recognized role in scientific computing. This impact ranges well beyond the familiar applications of graph algorithms in sparse matrices to include mesh generation, optimization, computational biology and chemistry, data analysis and parallelization. Trends in science and in computing suggest strongly that the importance of discrete algorithms in computational science will continue to grow. This paper reviews some of these many past successes and highlights emerging areas of promise and opportunity. © Springer-Verlag Berlin Heidelberg 2007.

More Details

Algorithms and architectures for high performance analysis of semantic graphs

Hendrickson, Bruce A.

Semantic graphs offer one promising avenue for intelligence analysis in homeland security. They provide a mechanism for describing a wide variety of relationships between entities of potential interest. The vertices are nouns of various types, e.g. people, organizations, events, etc. Edges in the graph represent different types of relationships between entities, e.g. 'is friends with', 'belongs-to', etc. Semantic graphs offer a number of potential advantages as a knowledge representation system. They allow information of different kinds, and collected in differing ways, to be combined in a seamless manner. A semantic graph is a very compressed representation of some of relationship information. It has been reported that the semantic graph can be two orders of magnitude smaller than the processed intelligence data. This allows for much larger portions of the data universe to be resident in computer memory. Many intelligence queries that are relevant to the terrorist threat are naturally expressed in the language of semantic graphs. One example is the search for 'interesting' relationships between two individuals or between an individual and an event, which can be phrased as a search for short paths in the graph. Another example is the search for an analyst-specified threat pattern, which can be cast as an instance of subgraph isomorphism. It is important to note than many kinds of analysis are not relationship based, so these are not good candidates for semantic graphs. Thus, a semantic graph should always be used in conjunction with traditional knowledge representation and interface methods. Operations that involve looking for chains of relationships (e.g. friend of a friend) are not efficiently executable in a traditional relational database. However, the semantic graph can be thought of as a pre-join of the database, and it is ideally suited for these kinds of operations. Researchers at Sandia National Laboratories are working to facilitate semantic graph analysis. Since intelligence datasets can be extremely large, the focus of this work is on the use of parallel computers. We have been working to develop scalable parallel algorithms that will be at the core of a semantic graph analysis infrastructure. Our work has involved two different thrusts, corresponding to two different computer architectures. The first architecture of interest is distributed memory, message passing computers. These machines are ubiquitous and affordable, but they are challenging targets for graph algorithms. Much of our distributed-memory work to date has been collaborative with researchers at Lawrence Livermore National Laboratory and has focused on finding short paths on distributed memory parallel machines. Our implementation on 32K processors of BlueGene/Light finds shortest paths between two specified vertices in just over a second for random graphs with 4 billion vertices.

More Details

Parallel hypergraph partitioning for scientific computing

Boman, Erik G.; Devine, Karen; Heaphy, Robert T.; Hendrickson, Bruce A.

Graph partitioning is often used for load balancing in parallel computing, but it is known that hypergraph partitioning has several advantages. First, hypergraphs more accurately model communication volume, and second, they are more expressive and can better represent nonsymmetric problems. Hypergraph partitioning is particularly suited to parallel sparse matrix-vector multiplication, a common kernel in scientific computing. We present a parallel software package for hypergraph (and sparse matrix) partitioning developed at Sandia National Labs. The algorithm is a variation on multilevel partitioning. Our parallel implementation is novel in that it uses a two-dimensional data distribution among processors. We present empirical results that show our parallel implementation achieves good speedup on several large problems (up to 33 million nonzeros) with up to 64 processors on a Linux cluster.

More Details

Combinatorial parallel and scientific computing

Proposed for publication as a book chapter in "Parallel Scientific Computing".

Hendrickson, Bruce A.

Combinatorial algorithms have long played a pivotal enabling role in many applications of parallel computing. Graph algorithms in particular arise in load balancing, scheduling, mapping and many other aspects of the parallelization of irregular applications. These are still active research areas, mostly due to evolving computational techniques and rapidly changing computational platforms. But the relationship between parallel computing and discrete algorithms is much richer than the mere use of graph algorithms to support the parallelization of traditional scientific computations. Important, emerging areas of science are fundamentally discrete, and they are increasingly reliant on the power of parallel computing. Examples include computational biology, scientific data mining, and network analysis. These applications are changing the relationship between discrete algorithms and parallel computing. In addition to their traditional role as enablers of high performance, combinatorial algorithms are now customers for parallel computing. New parallelization techniques for combinatorial algorithms need to be developed to support these nontraditional scientific approaches. This chapter will describe some of the many areas of intersection between discrete algorithms and parallel scientific computing. Due to space limitations, this chapter is not a comprehensive survey, but rather an introduction to a diverse set of techniques and applications with a particular emphasis on work presented at the Eleventh SIAM Conference on Parallel Processing for Scientific Computing. Some topics highly relevant to this chapter (e.g. load balancing) are addressed elsewhere in this book, and so we will not discuss them here.

More Details

Finding strongly connected components in distributed graphs

Journal of Parallel and Distributed Computing

McLendon, William; Hendrickson, Bruce A.; Plimpton, Steven J.; Rauchwerger, Lawrence

The traditional, serial, algorithm for finding the strongly connected components in a graph is based on depth first search and has complexity which is linear in the size of the graph. Depth first search is difficult to parallelize, which creates a need for a different parallel algorithm for this problem. We describe the implementation of a recently proposed parallel algorithm that finds strongly connected components in distributed graphs, and discuss how it is used in a radiation transport solver. © 2005 Elsevier Inc. All rights reserved.

More Details

A scalable distributed parallel breadth-first search algorithm on BlueGene/L

Proceedings of the ACM/IEEE 2005 Supercomputing Conference, SC'05

Yoo, Andy; Chow, Edmond; Henderson, Keith; McLendon, William; Hendrickson, Bruce A.; Çatalyürek, Ümit

Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth-first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional ID (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and ID partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported. © 2005 IEEE.

More Details

Solving elliptic finite element systems in near-linear time with support preconditioners

Proposed for publication in the SIAM Journal on Matrix Analysis.

Boman, Erik G.; Hendrickson, Bruce A.

We consider linear systems arising from the use of the finite element method for solving a certain class of linear elliptic problems. Our main result is that these linear systems, which are symmetric and positive semidefinite, are well approximated by symmetric diagonally dominant matrices. Our framework for defining matrix approximation is support theory. Significant graph theoretic work has already been developed in the support framework for preconditioners in the diagonally dominant case, and in particular it is known that such systems can be solved with iterative methods in nearly linear time. Thus, our approximation result implies that these graph theoretic techniques can also solve a class of finite element problems in nearly linear time. We show that the quality of our approximation, which controls the number of iterations in the preconditioned iterative solver, depends primarily on a mesh quality measure but not on the problem size or shape of the domain.

More Details

Solving elliptic finite element systems in near-linear time with support preconditioners

Proposed for publication in SIAM Journal of Matrix Analysis.

Boman, Erik G.; Hendrickson, Bruce A.

We consider linear systems arising from the use of the finite element method for solving scalar linear elliptic problems. Our main result is that these linear systems, which are symmetric and positive semidefinite, are well approximated by symmetric diagonally dominant matrices. Our framework for defining matrix approximation is support theory. Significant graph theoretic work has already been developed in the support framework for preconditioners in the diagonally dominant case, and in particular it is known that such systems can be solved with iterative methods in nearly linear time. Thus, our approximation result implies that these graph theoretic techniques can also solve a class of finite element problems in nearly linear time. We show that the support number bounds, which control the number of iterations in the preconditioned iterative solver, depend on mesh quality measures but not on the problem size or shape of the domain.

More Details

LDRD report : parallel repartitioning for optimal solver performance

Devine, Karen; Boman, Erik G.; Heaphy, Robert T.; Hendrickson, Bruce A.; Heroux, Michael A.

We have developed infrastructure, utilities and partitioning methods to improve data partitioning in linear solvers and preconditioners. Our efforts included incorporation of data repartitioning capabilities from the Zoltan toolkit into the Trilinos solver framework, (allowing dynamic repartitioning of Trilinos matrices); implementation of efficient distributed data directories and unstructured communication utilities in Zoltan and Trilinos; development of a new multi-constraint geometric partitioning algorithm (which can generate one decomposition that is good with respect to multiple criteria); and research into hypergraph partitioning algorithms (which provide up to 56% reduction of communication volume compared to graph partitioning for a number of emerging applications). This report includes descriptions of the infrastructure and algorithms developed, along with results demonstrating the effectiveness of our approaches.

More Details

Interprocessor communication with memory constraints

Hendrickson, Bruce A.

Many parallel applications require periodic redistribution of workloads and associated data. In a distributed memory computer, this redistribution can be difficult if limited memory is available for receiving messages. The authors propose a model for optimizing the exchange of messages under such circumstances which they call the minimum phase remapping problem. They first show that the problem is NP-Complete, and then analyze several methodologies for addressing it. First, they show how the problem can be phrased as an instance of multi-commodity flow. Next, they study a continuous approximation to the problem. They show that this continuous approximation has a solution which requires at most two more phases than the optimal discrete solution, but the question of how to consistently obtain a good discrete solution from the continuous problem remains open. Finally, they devise a simple and practical approximation algorithm for the problem with a bound of 1.5 times the optimal number of phases.

More Details

Load balancing fictions, falsehoods and fallacies

Applied Mathematical Modelling

Hendrickson, Bruce A.

Effective use of a parallel computer requires that a calculation be carefully divided among the processors. This load balancing problem appears in many guises and has been a fervent area of research for the past decade or more. Although great progress has been made, and useful software tools developed, a number of challenges remain. It is the conviction of the author that these challenges will be easier to address if we first come to terms with some significant shortcomings in our current perspectives. This paper tries to identify several areas in which the prevailing point of view is either mistaken or insufficient. The goal is to motivate new ideas and directions for this important field. © 2000 Elsevier Science Inc.

More Details

Design of dynamic load-balancing tools for parallel applications

Proceedings of the International Conference on Supercomputing

Devine, Karen; Hendrickson, Bruce A.; Boman, Erik G.; Vaughan, Courtenay T.

The design of general-purpose dynamic load-balancing tools for parallel applications is more challenging than the design of static partitioning tools. Both algorithmic and software engineering issues arise. We have addressed many of these issues in the design of the Zoltan dynamic load-balancing library. Zoltan has an object-oriented interface that makes it easy to use and provides separation between the application and the load-balancing algorithms. It contains a suite of dynamic load-balancing algorithms, including both geometric and graph-based algorithms. Its design makes it valuable both as a partitioning tool for a variety of applications and as a research test-bed for new algorithmic development. In this paper, we describe Zoltan's design and demonstrate its use in an unstructured-mesh finite element application.

More Details

Scalable Parallel Crash Simulations

Hendrickson, Bruce A.

We are pleased to submit our efforts in parallelizing the PRONTO application suite for con- sideration in the SuParCup 99 competition. PRONTO is a finite element transient dynamics simulator which includes a smoothed particle hydrodynamics (SPH) capability; it is similar in scope to the well-known DYNA, PamCrash, and ABAQUS codes. Our efforts over the last few years have produced a fully parallel version of the entire PRONTO code which (1) runs fast and scalably on thousands of processors, (2) has performed the largest finite-element transient dynamics simulations we are aware of, and (3) includes several new parallel algorithmic ideas that have solved some difficult problems associated with contact detection and SPH scalability. We motivate this work, describe the novel algorithmic advances, give performance numbers for PRONTO running on Sandia's Intel Teraflop machine, and highlight two prototypical large-scale computations we have performed with the parallel code. We have successfully parallelized a large-scale production transient dynamics code with a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations. To be able to simulate a more than ten million element model in a few tenths of second per timestep is unprecedented for solid dynamics simulations, especially when full global contact searches are required. The key reason is our new algorithmic ideas for efficiently parallelizing the contact detection stage. To our knowledge scalability of this computation had never before been demonstrated on more than 64 processors. This has enabled parallel PRONTO to become the only solid dynamics code we are aware of that can run effectively on 1000s of processors. More importantly, our parallel performance compares very favorably to the original serial PRONTO code which is optimized for vector supercomputers. On the container crush problem, a Teraflop node is as fast as a single processor of the Cray Jedi. This means that on the Teraflop machine we can now run simulations with tens of millions of elements thousands of times faster than we could on the Jedi! This is enabling transient dynamics simulations of unprecedented scale and fidelity. Not only can previous applications be run with vastly improved resolution and speed, but qualitatively new and different analyses have been made possible.

More Details

Graph Partitioning Models for Parallel Computing

Parallel Computing

Hendrickson, Bruce A.

Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.

More Details

Skewed graph partitioning

Hendrickson, Bruce A.

Graph partitioning is an important abstraction used in solving many scientific computing problems. Unfortunately, the standard partitioning model does not incorporate considerations that are important in many settings. We address this by describing a generalized partitioning model which incorporates the notion of partition skew and is applicable to a variety of problems. We then develop enhancements to several important partitioning algorithms necessary to solve the generalized partitioning problem. Finally we demonstrate the benefit of employing several of these generalized methods to static decomposition of parallel computing problems.

More Details

Transient dynamics simulations: Parallel algorithms for contact detection and smoothed particle hydrodynamics

Hendrickson, Bruce A.

Transient dynamics simulations are commonly used to model phenomena such as car crashes, underwater explosions, and the response of shipping containers to high-speed impacts. Physical objects in such a simulation are typically represented by Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. Fluids (gasoline, water) or fluid-like materials (earth) in the simulation can be modeled using the techniques of smoothed particle hydrodynamics. Implementing a hybrid mesh/particle model on a massively parallel computer poses several difficult challenges. One challenge is to simultaneously parallelize and load-balance both the mesh and particle portions of the computation. A second challenge is to efficiently detect the contacts that occur within the deforming mesh and between mesh elements and particles as the simulation proceeds. These contacts impart forces to the mesh elements and particles which must be computed at each timestep to accurately capture the physics of interest. In this paper we describe new parallel algorithms for smoothed particle hydrodynamics and contact detection which turn out to have several key features in common. Additionally, we describe how to join the new algorithms with traditional parallel finite element techniques to create an integrated particle/mesh transient dynamics simulation. Our approach to this problem differs from previous work in that we use three different parallel decompositions, a static one for the finite element analysis and dynamic ones for particles and for contact detection. We have implemented our ideas in a parallel version of the transient dynamics code PRONTO-3D and present results for the code running on a large Intel Paragon.

More Details

A new parallel algorithm for contact detection in finite element methods

Hendrickson, Bruce A.

In finite-element, transient dynamics simulations, physical objects are typically modeled as Lagrangian meshes because the meshes can move and deform with the objects as they undergo stress. In many simulations, such as computations of impacts or explosions, portions of the deforming mesh come in contact with each other as the simulation progresses. These contacts must be detected and the forces they impart to the mesh must be computed at each timestep to accurately capture the physics of interest. While the finite-element portion of these computations is readily parallelized, the contact detection problem is difficult to implement efficiently on parallel computers and has been a bottleneck to achieving high performance on large parallel machines. In this paper we describe a new parallel algorithm for detecting contacts. Our approach differs from previous work in that we use two different parallel decompositions, a static one for the finite element analysis and dynamic one for contact detection. We present results for this algorithm in a parallel version of the transient dynamics code PRONTO-3D running on a large Intel Paragon.

More Details

Enhancing data locality by using terminal propagation

Hendrickson, Bruce A.

Terminal propagation is a method developed in the circuit placement community for adding constraints to graph partitioning problems. This paper adapts and expands this idea, and applies it to the problem of partitioning data structures among the processors of a parallel computer. We show how the constraints in terminal propagation can be used to encourage partitions in which messages are communicated only between architecturally near processors. We then show how these constraints can be handled in two important partitioning algorithms, spectral bisection and multilevel-KL. We compare the quality of partitions generated by these algorithms to each other and to Partitions generated by more familiar techniques.

More Details

An efficient parallel algorithm for matrix-vector multiplication

Hendrickson, Bruce A.

The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/{radical}p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.

More Details

An improved spectral graph partitioning algorithm for mapping parallel computations

Hendrickson, Bruce A.

Efficient use of a distributed memory parallel computer requires that the computational load be balanced across processors in a way that minimizes interprocessor communication. We present a new domain mapping algorithm that extends recent work in which ideas from spectral graph theory have been applied to this problem. Our generalization of spectral graph bisection involves a novel use of multiple eigenvectors to allow for division of a computation into four or eight parts at each stage of a recursive decomposition. The resulting method is suitable for scientific computations like irregular finite elements or differences performed on hypercube or mesh architecture machines. Experimental results confirm that the new method provides better decompositions arrived at more economically and robustly than with previous spectral methods. We have also improved upon the known spectral lower bound for graph bisection.

More Details

Parallel QR factorization on a hypercube using the torus wrap mapping

Hendrickson, Bruce A.

We present an algorithm for the QR factorization of a dense matrix without column pivoting on a hypercube multiprocessor. The algorithm combines the optimal numerical efficiency of Householder reflections with the excellent communication properties of the torus wrap mapping. Analytical results indicate that the communication cost for this algorithm is less than that for other common approaches. Numerical results on an nCUBE 2 confirm the efficiency of our technique. 23 refs., 5 figs., 1 tab.

More Details
66 Results
66 Results