Publications

186 Results
Skip to search filters

Lightweight Software Process Improvement Using Productivity and Sustainability Improvement Planning (PSIP)

Communications in Computer and Information Science

Heroux, Michael A.; Gonsiorowski, Elsa; Gupta, Rinku; Milewicz, Reed M.; Moulton, J.D.; Watson, Gregory R.; Willenbring, James M.; Zamora, Richard J.; Raybourn, Elaine M.

Productivity and Sustainability Improvement Planning (PSIP) is a lightweight, iterative workflow that allows software development teams to identify development bottlenecks and track progress to overcome them. In this paper, we present an overview of PSIP and how it compares to other software process improvement (SPI) methodologies, and provide two case studies that describe how the use of PSIP led to successful improvements in team effectiveness and efficiency.

More Details

Trust me. QED

SIAM News

Heroux, Michael A.

Consider a standard SIAM journal article containing theoretical results. Each theorem has a proof that typically builds on previous developments. Since every theorem stems from a firm foundation, the research community can trust a result without further evidence. One could thus argue that a theorem does not require a proof because surely an author would not publish it if no proof existed to back it up. Furthermore, respectable reviewers and editors expect proofs without exception, and papers containing proof-less theorems will likely go unpublished.

More Details

Toward a Compatible Reproducibility Taxonomy for Computational and Computing Sciences

Heroux, Michael A.; Barba, Lorena B.; Parashar, Manish P.; Stodden, Victoria S.; Taufer, Michela T.

Reproducibility is an essential ingredient of the scientific enterprise. The ability to reproduce results builds trust that we can rely on the results as foundations for future scientific exploration. Presently, the fields of computational and computing sciences provide two opposing definitions of reproducible and replicable. In computational sciences, reproducible research means authors provide all necessary data and computer codes to run analyses again, so others can re-obtain the results (J. Claerbout et al., 1992). The concept was adopted and extended by several communities, where it was distinguished from replication: collecting new data to address the same question, and arriving at consistent findings (Peng et al. 2006). The Association of Computing Machinery (ACM), representing computer science and industry professionals, recently established a reproducibility initiative, adopting essentially opposite definitions. The purpose of this report is to raise awareness of the opposite definitions and propose a path to a compatible taxonomy.

More Details

Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

SIAM Journal on Scientific Computing

Gamell, Marc G.; Teranishi, Keita T.; Kolla, Hemanth K.; Mayo, Jackson M.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish P.

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.

More Details

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

IEEE Transactions on Parallel and Distributed Systems

Gamell, Marc; Teranishi, Keita T.; Mayo, Jackson M.; Kolla, Hemanth K.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.

More Details

xSDK foundations: Toward an extreme-scale scientific software development kit

Supercomputing Frontiers and Innovations

Bartlett, Roscoe B.; Demeshko, Irina; Gamblin, Todd; Hammond, Glenn E.; Heroux, Michael A.; Johnson, Jeffrey; Klinvex, Alicia M.; Li, Xiaoye; McInnes, Lois C.; Moulton, J.D.; Osei-Kuffuor, Daniel; Sarich, Jason; Smith, Barry; Willenbring, James M.; Yang, Ulrike M.

Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing applications with improved robustness and portability. However, without coordination, many libraries cannot be easily composed. Namespace collisions, inconsistent arguments, lack of third-party software versioning, and additional difficulties make composition costly. The Extreme-scale Scientific Software Development Kit (xSDK) defines community policies to improve code quality and compatibility across independently developed packages (hypre, PETSc, SuperLU, Trilinos, and Alquimia) and provides a foundation for addressing broader issues in software interoperability, performance portability, and sustainability. The xSDK provides turnkey installation of member software and seamless combination of aggregate capabilities, and it marks first steps toward extreme-scale scientific software ecosystems from which future applications can be composed rapidly with assured quality and scalability.

More Details

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Proceedings of the International Conference on Parallel Processing Workshops

Gamell, Marc; Katz, Daniel S.; Teranishi, Keita T.; Heroux, Michael A.; Van Der Wijngaart, Rob F.; Mattson, Timothy G.; Parashar, Manish

Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.

More Details

Performance Efficiency and Effectivness of Supercomputers

Leland, Robert; Rajan, Mahesh R.; Heroux, Michael A.

Our first purpose here is to offer to a general technical and policy audience a perspective on whether the supercomputing community should focus on improving the efficiency of supercomputing systems and their use rather than on building larger and ostensibly more capable systems that are used at low efficiency. After first summarizing our content and defining some necessary terms, we give a concise answer to this question. We then set this in context by characterizing performance of current supercomputing systems on a variety of benchmark problems and actual problems drawn from workloads in the national security, industrial, and scientific context. Along the way we answer some related questions, identify some important technological trends, and offer a perspective on the significance of these trends. Our second purpose is to give a reasonably broad and transparent overview of the related issue space and thereby to better equip the reader to evaluate commentary and controversy concerning supercomputing performance. For example, questions repeatedly arise concerning the Linpack benchmark and its predictive power, so we consider this in moderate depth as an example. We also characterize benchmark and application performance for scientific and engineering use of supercomputers and offer some guidance on how to think about these. Examples here are drawn from traditional scientific computing. Other problem domains, for example, data analytics, have different performance characteristics that are better captured by different benchmark problems or applications, but the story in those domains is similar in character and leads to similar conclusions with regard to the motivating question. For more on this topic, see Large-Scale Data Analytics and Its Relationship to Simulation. 1 Director, Computing Research Center, Sandia National Laboratories 2 Distinguished Member of the Technical Staff, Sandia National Laboratories 3 Distinguished Member of the Technical Staff, Sandia National Laboratories 4 Distinguished Member of the Technical Staff , Sandia National Laboratories

More Details

Sustainable & productive: Improving incentives for quality software

CEUR Workshop Proceedings

Heroux, Michael A.

Computational Science and Engineering (CSE) software can benefit substantially from an explicit focus on quality improvement. This is especially true as we face increased demands in both modeling and software complexities. At the same time, just desiring improved quality is not sufficient. We must work with the entities that provide CSE research teams with publication venues, funding, and professional recognition in order to increase incentives for improved software quality. In fact, software quality is precisely calibrated to the expectations, explicit and implicit, set by these entities. We will see broad improvements in sustainability and productivity only when publishers, funding agencies and employers raise their expectations for software quality. CSE software community leaders, those who are in a position to inform and influence these entities, have a unique opportunity to broadly and positively impact software quality by working to establish incentives that will spur creative and novel approaches to improve developer productivity and software sustainability.

More Details

Local recovery and failure masking for stencil-based applications at extreme scales

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Gamell, Marc; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

More Details

Assessing a mini-application as a performance proxy for a finite element method engineering application

Concurrency and Computation. Practice and Experience

Lin, Paul L.; Heroux, Michael A.; Williams, Alan B.; Barrett, Richard F.

The performance of a large-scale, production-quality science and engineering application (‘app’) is often dominated by a small subset of the code. Even within that subset, computational and data access patterns are often repeated, so that an even smaller portion can represent the performance-impacting features. If application developers, parallel computing experts, and computer architects can together identify this representative subset and then develop a small mini-application (‘miniapp’) that can capture these primary performance characteristics, then this miniapp can be used to both improve the performance of the app as well as provide a tool for co-design for the high-performance computing community. However, a critical question is whether a miniapp can effectively capture key performance behavior of an app. This study provides a comparison of an implicit finite element semiconductor device modeling app on unstructured meshes with an implicit finite element miniapp on unstructured meshes. The goal is to assess whether the miniapp is predictive of the performance of the app. Finally, single compute node performance will be compared, as well as scaling up to 16,000 cores. Results indicate that the miniapp can be reasonably predictive of the performance characteristics of the app for a single iteration of the solver on a single compute node.

More Details

Exploring failure recovery for stencil-based applications at extreme scales

HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Gamell, Marc; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.

More Details

Editorial: ACM TOMS replicated computational results initiative

ACM Transactions on Mathematical Software

Heroux, Michael A.

The scientific community relies on the peer review process for assuring the quality of published material, the goal of which is to build a body of work we can trust. Computational journals such as the ACM Transactions on Mathematical Software (TOMS) use this process for rigorously promoting the clarity and completeness of content, and citation of prior work. At the same time, it is unusual to independently confirm computational results. ACM TOMS has established a Replicated Computational Results (RCR) review process as part of the manuscript peer review process. The purpose is to provide independent confirmation that results contained in a manuscript are replicable. Successful completion of the RCR process awards a manuscript with the Replicated Computational Results Designation. This issue of ACM TOMS contains the first [Van Zee and van de Geijn 2015] of what we anticipate to be a growing number of articles to receive the RCR designation, and the related RCR reviewer report [Willenbring 2015]. We hope that the TOMS RCR process will serve as a model for other publications and increase the confidence in and value of computational results in TOMS articles.

More Details

Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications

Journal of Parallel and Distributed Computing

Barrett, R.F.; Crozier, Paul C.; Doerfler, Douglas W.; Heroux, Michael A.; Lin, Paul L.; Thornquist, Heidi K.; Trucano, Timothy G.; Vaughan, Courtenay T.

Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of "miniapps" has been created to serve as proxies for full scale applications. Each miniapp is designed to represent a key performance characteristic that does or is expected to significantly impact the runtime performance of an application program. In this paper we introduce a methodology for assessing the ability of these miniapps to effectively represent these performance issues. We applied this methodology to three miniapps, examining the linkage between them and an application they are intended to represent. Herein we evaluate the fidelity of that linkage. This work represents the initial steps required to begin to answer the question, "Under what conditions does a miniapp represent a key performance characteristic in a full app?"

More Details

Versioned distributed arrays for resilience in scientific applications: Global View Resilience

Procedia Computer Science

Chien, A.; Balaji, P.; Beckman, P.; Dun, N.; Fang, A.; Fujita, H.; Iskra, K.; Rubenstein, Z.; Zheng, Z.; Schreiber, R.; Hammond, J.; Dinan, J.; Laguna, I.; Richards, D.; Dubey, A.; Van Straalen, B.; Hoemmen, M.; Heroux, Michael A.; Teranishi, Keita T.; Siegel, A.

Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versioning, and cross-layer error recovery. Using several large applications (OpenMC, the preconditioned conjugate gradient solver PCG, ddcMD, and Chombo), we evaluate the programmer effort to add resilience. The required changes are small (<2% LOC), localized, and machine-independent, requiring no software architecture changes. We also measure the overhead of adding GVR versioning and show that generally overheads <2% are achieved. We conclude that GVR's interfaces and implementation are flexible and portable and create a gentle-slope path to tolerate growing error rates in future systems.

More Details

Toward local failure local recovery resilience model using MPI-ULFM

ACM International Conference Proceeding Series

Teranishi, Keita T.; Heroux, Michael A.

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.

More Details

Report for the ASC CSSE L2 Milestone (4873) - Demonstration of Local Failure Local Recovery Resilient Programming Model

Heroux, Michael A.; Teranishi, Keita T.

Recovery from process loss during the execution of a distributed memory parallel application is presently achieved by restarting the program, typically from a checkpoint file. Future computer system trends indicate that the size of data to checkpoint, the lack of improvement in parallel file system performance and the increase in process failure rates will lead to situations where checkpoint restart becomes infeasible. In this report we describe and prototype the use of a new application level resilient computing model that manages persistent storage of local state for each process such that, if a process fails, recovery can be performed locally without requiring access to a global checkpoint file. LFLR provides application developers with an ability to recover locally and continue application execution when a process is lost. This report discusses what features are required from the hardware, OS and runtime layers, and what approaches application developers might use in the design of future codes, including a demonstration of LFLR-enabled MiniFE code from the Matenvo mini-application suite.

More Details

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran R.; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stanimire

Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In this paper, we extend these studies by two major contributions. First, we present our implementation of a CA variant of the Generalized Minimum Residual (GMRES) method, called CAGMRES, for solving no symmetric linear systems of equations on a hybrid CPU/GPU cluster. Our performance results on up to 120 GPUs show that CA-GMRES gives a speedup of up to 2.5x in total solution time over standard GMRES on a hybrid cluster with twelve Intel Xeon CPUs and three Nvidia Fermi GPUs on each node. We then outline a domain decomposition framework to introduce a family of preconditioners that are suitable for CA Krylov methods. Our preconditioners do not incur any additional communication and allow the easy reuse of existing algorithms and software for the sub domain solves. Experimental results on the hybrid CPU/GPU cluster demonstrate that CA-GMRES with preconditioning achieve a speedup of up to 7.4x over CAGMRES without preconditioning, and speedup of up to 1.7x over GMRES with preconditioning in total solution time. These results confirm the potential of our framework to develop a practical and effective preconditioned CA Krylov method.

More Details

HPCG Benchmark Technical Specification

Heroux, Michael A.

The High Performance Conjugate Gradient (HPCG) benchmark [cite SNL, UTK reports] is a tool for ranking computer systems based on a simple additive Schwarz, symmetric Gauss-Seidel preconditioned conjugate gradient solver. HPCG is similar to the High Performance Linpack (HPL), or Top 500, benchmark [1] in its purpose, but HPCG is intended to better represent how today’s applications perform. In this paper we describe the technical details of HPCG: how it is designed and implemented, what code transformations are permitted and how to interpret and report results.

More Details

Toward a new metric for ranking high performance computing systems

Heroux, Michael A.

The High Performance Linpack (HPL), or Top 500, benchmark [1] is the most widely recognized and discussed metric for ranking high performance computing systems. However, HPL is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications. In this paper we describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns more commonly found in applications. Using HPCG we strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement.

More Details

Trilinos developers SQE guide :

Willenbring, James M.; Heroux, Michael A.

The Trilinos Project is an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. A new software capability is introduced into Trilinos as a package. A Trilinos package is an integral unit and, although there are exceptions such as utility packages, each package is typically developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. The Trilinos Developers SQE Guide is a resource for Trilinos package developers who are working under Advanced Simulation and Computing (ASC) and are therefore subject to the ASC Software Quality Engineering Practices as described in the Sandia National Laboratories Advanced Simulation and Computing (ASC) Software Quality Plan: ASC Software Quality Engineering Practices Version 3.0 document [1]. The Trilinos Developer Policies webpage [2] contains a lot of detailed information that is essential for all Trilinos developers. The Trilinos Software Lifecycle Model [3] defines the default lifecycle model for Trilinos packages and provides a context for many of the practices listed in this document.

More Details

ShyLU: A hybrid-hybrid solver for multicore platforms

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Rajamanickam, Sivasankaran R.; Boman, Erik G.; Heroux, Michael A.

With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). ShyLU is useful both in shared-memory environments and on large parallel computers with distributed memory. In the latter case, it should be used as a sub domain solver. We argue that with the increasing complexity of compute nodes, it is important to exploit multiple levels of parallelism even within a single compute node. We show the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 384 cores for a given problem size. We also study the MPI-only performance of ShyLU against a hybrid implementation and conclude that on present multicore nodes MPI-only implementation is better. However, for future multicore machines (96 or more cores) hybrid/ hierarchical algorithms and implementations are important for sustained performance. © 2012 IEEE.

More Details

Cooperative application/OS DRAM fault recovery

Hoemmen, Mark F.; Ferreira, Kurt; Heroux, Michael A.; Brightwell, Ronald B.

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

More Details

MiniGhost : a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing

Barrett, Richard F.; Vaughan, Courtenay T.; Heroux, Michael A.

A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available architectures provide strong inter-node bandwidth relative to latency costs, many codes 'bulk up' these messages by aggregating data into a message as a means of reducing the number of messages. A renewed focus on non-traditional architectures and architecture features provides new opportunities for exploring alternatives to this programming approach. In this report we describe miniGhost, a 'miniapp' designed for exploration of the capabilities of current as well as emerging and future architectures within the context of these sorts of applications. MiniGhost joins the suite of miniapps developed as part of the Mantevo project.

More Details

TriBITS lifecycle model. Version 1.0, a lean/agile software lifecycle model for research-based computational science and engineering and applied mathematical software

Willenbring, James M.; Heroux, Michael A.

Software lifecycles are becoming an increasingly important issue for computational science and engineering (CSE) software. The process by which a piece of CSE software begins life as a set of research requirements and then matures into a trusted high-quality capability is both commonplace and extremely challenging. Although an implicit lifecycle is obviously being used in any effort, the challenges of this process - respecting the competing needs of research vs. production - cannot be overstated. Here we describe a proposal for a well-defined software lifecycle process based on modern Lean/Agile software engineering principles. What we propose is appropriate for many CSE software projects that are initially heavily focused on research but also are expected to eventually produce usable high-quality capabilities. The model is related to TriBITS, a build, integration and testing system, which serves as a strong foundation for this lifecycle model, and aspects of this lifecycle model are ingrained in the TriBITS system. Here, we advocate three to four phases or maturity levels that address the appropriate handling of many issues associated with the transition from research to production software. The goals of this lifecycle model are to better communicate maturity levels with customers and to help to identify and promote Software Engineering (SE) practices that will help to improve productivity and produce better software. An important collection of software in this domain is Trilinos, which is used as the motivation and the initial target for this lifecycle model. However, many other related and similar CSE (and non-CSE) software projects can also make good use of this lifecycle model, especially those that use the TriBITS system. Indeed this lifecycle process, if followed, will enable large-scale sustainable integration of many complex CSE software efforts across several institutions.

More Details

LDRD final report : autotuning for scalable linear algebra

Heroux, Michael A.

This report summarizes the progress made as part of a one year lab-directed research and development (LDRD) project to fund the research efforts of Bryan Marker at the University of Texas at Austin. The goal of the project was to develop new techniques for automatically tuning the performance of dense linear algebra kernels. These kernels often represent the majority of computational time in an application. The primary outcome from this work is a demonstration of the value of model driven engineering as an approach to accurately predict and study performance trade-offs for dense linear algebra computations.

More Details

Factors impacting performance of multithreaded sparse riangular solvet

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Wolf, Michael M.; Heroux, Michael A.; Boman, Erik G.

As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in order to reduce the number of MPI tasks and increase the parallel efficiency of the algorithm. However, we need efficient threaded numerical kernels to run on the multi-core nodes in order to achieve good parallel efficiency. In this paper, we focus on improving the performance of a multithreaded triangular solver, an important kernel for preconditioning. We analyze three factors that affect the parallel performance of this threaded kernel and obtain good scalability on the multi-core nodes for a range of matrix sizes. © 2011 Springer-Verlag Berlin Heidelberg.

More Details

Expanding the Trilinos developer community

Heroux, Michael A.

The Trilinos Project started approximately nine years ago as a small effort to enable research, development and ongoing support of small, related solver software efforts. The 'Tri' in Trilinos was intended to indicate the eventual three packages we planned to develop. In 2007 the project expanded its scope to include any package that was an enabling technology for technical computing. Presently the Trilinos repository contains over 55 packages covering a broad spectrum of reusable tools for constructing full-featured scalable scientific and engineering applications. Trilinos usage is now worldwide, and many applications have an explicit dependence on Trilinos for essential capabilities. Users come from other US laboratories, universities, industry and international research groups. Awareness and use of Trilinos is growing rapidly outside of Sandia. Members of the external research community are becoming more familiar with Trilinos, its design and collaborative nature. As a result, the Trilinos project is receiving an increasing number of requests from external community members who want to contribute to Trilinos as developers. To-date we have worked with external developers in an ad hoc fashion. Going forward, we want to develop a set of policies, procedures, tools and infrastructure to simplify interactions with external developers. As we go forward with multi-laboratory efforts such as CASL and X-Stack, and international projects such as IESP, we will need a more streamlined and explicit process for making external developers 'first-class citizens' in the Trilinos development community. This document is intended to frame the discussion for expanding the Trilinos community to all strategically important external members, while at the same time preserving Sandia's primary leadership role in the project.

More Details

Trilinos for emerging parallel computing systems

Heroux, Michael A.

Trilinos is an object-oriented software framework to enabled the solution of large-scale, complex multiphysics engineering and scientific problems. Different Trilinos packages build on each other to create a stack providing the necessary capability: (1) Non-linear solver; (2) Linear solver/preconditioner; (3) Distributed linear algebra; and (4) Local linear algebra.

More Details

Factors impacting performance of multithreaded triangular solve

Wolf, Michael W.; Heroux, Michael A.; Boman, Erik G.

As computational science applications grow more parallel with multi-core supercomputers having hundreds of thousands of computational cores, it will become increasingly difficult for solvers to scale. Our approach is to use hybrid MPI/threaded numerical algorithms to solve these systems in order to reduce the number of MPI tasks and increase the parallel efficiency of the algorithm. However, we need efficient threaded numerical kernels to run on the multi-core nodes in order to achieve good parallel efficiency. In this paper, we focus on improving the performance of a multithreaded triangular solver, an important kernel for preconditioning. We analyze three factors that affect the parallel performance of this threaded kernel and obtain good scalability on the multi-core nodes for a range of matrix sizes.

More Details

Parallel phase model: A programming model for high-end parallel machines with manycores

Proceedings of the International Conference on Parallel Processing

Brightwell, Ronald B.; Heroux, Michael A.; Wen, Zhaofang W.; Wu, Junfeng

This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results. © 2009 IEEE.

More Details

Robust algebraic preconditioners using IFPACK 3.0

Sala, Marzio S.; Heroux, Michael A.

IFPACK provides a suite of object-oriented algebraic preconditioners for the solution of preconditioned iterative solvers. IFPACK constructors expect the (distributed) real sparse matrix to be an Epetra RowMatrix object. IFPACK can be used to define point and block relaxation preconditioners, various flavors of incomplete factorizations for symmetric and non-symmetric matrices, and one-level additive Schwarz preconditioners with variable overlap. Exact LU factorizations of the local submatrix can be accessed through the AMESOS packages. IFPACK , as part of the Trilinos Solver Project, interacts well with other Trilinos packages. In particular, IFPACK objects can be used as preconditioners for AZTECOO, and as smoothers for ML. IFPACK is mainly written in C++, but only a limited subset of C++ features is used, in order to enhance portability.

More Details

AztecOO user guide

Heroux, Michael A.

The Trilinos{trademark} Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries. AztecOO{trademark} is a package within Trilinos that enables the use of the Aztec solver library [19] with Epetra{trademark} [13] objects. AztecOO provides access to Aztec preconditioners and solvers by implementing the Aztec 'matrix-free' interface using Epetra. While Aztec is written in C and procedure-oriented, AztecOO is written in C++ and is object-oriented. In addition to providing access to Aztec capabilities, AztecOO also provides some signficant new functionality. In particular it provides an extensible status testing capability that allows expression of sophisticated stopping criteria as is needed in production use of iterative solvers. AztecOO also provides mechanisms for using Ifpack [2], ML [20] and AztecOO itself as preconditioners.

More Details

Trilinos 4.0 tutorial

Sala, Marzio S.; Heroux, Michael A.; Day, David M.

The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries. The goal of the Trilinos Project is to develop parallel solver algorithms and libraries within an object-oriented software framework for the solution of large-scale, complex multiphysics engineering and scientific applications. The emphasis is on developing robust, scalable algorithms in a software framework, using abstract interfaces for flexible interoperability of components while providing a full-featured set of concrete classes that implement all the abstract interfaces. This document introduces the use of Trilinos, version 4.0. The presented material includes, among others, the definition of distributed matrices and vectors with Epetra, the iterative solution of linear systems with AztecOO, incomplete factorizations with IF-PACK, multilevel and domain decomposition preconditioners with ML, direct solution of linear system with Amesos, and iterative solution of nonlinear systems with NOX. The tutorial is a self-contained introduction, intended to help computational scientists effectively apply the appropriate Trilinos package to their applications. Basic examples are presented that are fit to be imitated. This document is a companion to the Trilinos User's Guide [20] and Trilinos Development Guides [21,22]. Please note that the documentation included in each of the Trilinos' packages is of fundamental importance.

More Details

LDRD report : parallel repartitioning for optimal solver performance

Devine, Karen D.; Boman, Erik G.; Devine, Karen D.; Heaphy, Robert T.; Hendrickson, Bruce A.; Heroux, Michael A.

We have developed infrastructure, utilities and partitioning methods to improve data partitioning in linear solvers and preconditioners. Our efforts included incorporation of data repartitioning capabilities from the Zoltan toolkit into the Trilinos solver framework, (allowing dynamic repartitioning of Trilinos matrices); implementation of efficient distributed data directories and unstructured communication utilities in Zoltan and Trilinos; development of a new multi-constraint geometric partitioning algorithm (which can generate one decomposition that is good with respect to multiple criteria); and research into hypergraph partitioning algorithms (which provide up to 56% reduction of communication volume compared to graph partitioning for a number of emerging applications). This report includes descriptions of the infrastructure and algorithms developed, along with results demonstrating the effectiveness of our approaches.

More Details

Trilinos 3.1 tutorial

Heroux, Michael A.; Sala, Marzio S.; Heroux, Michael A.

This document introduces the use of Trilinos, version 3.1. Trilinos has been written to support, in a rigorous manner, the solver needs of the engineering and scientific applications at Sandia National Laboratories. Aim of this manuscript is to present the basic features of some of the Trilinos packages. The presented material includes the definition of distributed matrices and vectors with Epetra, the iterative solution of linear system with AztecOO, incomplete factorizations with IFPACK, multilevel methods with ML, direct solution of linear system with Amesos, and iterative solution of nonlinear systems with NOX. With the help of several examples, some of the most important classes and methods are detailed to the inexperienced user. For the most majority, each example is largely commented throughout the text. Other comments can be found in the source of each example. This document is a companion to the Trilinos User's Guide and Trilinos Development Guides. Also, the documentation included in each of the Trilinos' packages is of fundamental importance.

More Details

Epetra developers coding guidelines

Heroux, Michael A.; Heroux, Michael A.

Epetra is a package of classes for the construction and use of serial and distributed parallel linear algebra objects. It is one of the base packages in Trilinos. This document describes guidelines for Epetra coding style. The issues discussed here go beyond correct C++ syntax to address issues that make code more readable and self-consistent. The guidelines presented here are intended to aid current and future development of Epetra specifically. They reflect design decisions that were made in the early development stages of Epetra. Some of the guidelines are contrary to more commonly used conventions, but we choose to continue these practices for the purposes of self-consistency. These guidelines are intended to be complimentary to policies established in the Trilinos Developers Guide.

More Details

An overview of Trilinos

Heroux, Michael A.; Kolda, Tamara G.; Long, Kevin R.; Hoekstra, Robert J.; Pawlowski, Roger P.; Phipps, Eric T.; Salinger, Andrew G.; Williams, Alan B.; Heroux, Michael A.; Hu, Jonathan J.; Lehoucq, Richard B.; Thornquist, Heidi K.; Tuminaro, Raymond S.; Willenbring, James M.; Bartlett, Roscoe B.; Howle, Victoria E.

The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries. In particular, our goal is to develop parallel solver algorithms and libraries within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific applications. Our emphasis is on developing robust, scalable algorithms in a software framework, using abstract interfaces for flexible interoperability of components while providing a full-featured set of concrete classes that implement all abstract interfaces. Trilinos uses a two-level software structure designed around collections of packages. A Trilinos package is an integral unit usually developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. Packages exist underneath the Trilinos top level, which provides a common look-and-feel, including configuration, documentation, licensing, and bug-tracking. Trilinos packages are primarily written in C++, but provide some C and Fortran user interface support. We provide an open architecture that allows easy integration with other solver packages and we deliver our software to the outside community via the Gnu Lesser General Public License (LGPL). This report provides an overview of Trilinos, discussing the objectives, history, current development and future plans of the project.

More Details

Trilinos users guide

Heroux, Michael A.; Heroux, Michael A.; Willenbring, James M.

The Trilinos Project is an effort to facilitate the design, development, integration and ongoing support of mathematical software libraries. A new software capability is introduced into Trilinos as a package. A Trilinos package is an integral unit usually developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. The Trilinos Users Guide is a resource for new and existing Trilinos users. Topics covered include how to configure and build Trilinos, what is required to integrate an existing package into Trilinos and examples of how those requirements can be met, as well as what tools and services are available to Trilinos packages. Also discussed are some common practices that are followed by many Trilinos package developers. Finally, a snapshot of current Trilinos packages and their interoperability status is provided, along with a list of supported computer platforms.

More Details

Solving complex-valued linear systems via equivalent real formulations

SIAM Journal of Scientific Computing

Day, David M.; Heroux, Michael A.

Most algorithms used in preconditioned iterative methods are generally applicable to complex valued linear systems, with real valued linear systems simply being a special case. However, most iterative solver packages available today focus exclusively on real valued systems, or deal with complex valued systems as an afterthought. One obvious approach to addressing this problem is to recast the complex problem into one of a several equivalent real forms and then use a real valued solver to solve the related system. However, well-known theoretical results showing unfavorable spectral properties for the equivalent real forms have diminished enthusiasm for this approach. At the same time, experience has shown that there are situations where using an equivalent real form can be very effective. In this paper, the authors explore this approach, giving both theoretical and experimental evidence that an equivalent real form can be useful for a number of practical situations. Furthermore, they show that by making good use of some of the advance features of modem solver packages, they can easily generate equivalent real form preconditioners that are computationally efficient and mathematically identical to their complex counterparts. Using their techniques, they are able to solve very ill-conditioned complex valued linear systems for a variety of large scale applications. However, more importantly, they shed more light on the effectiveness of equivalent real forms and more clearly delineate how and when they should be used.

More Details
186 Results
186 Results