Publications

Results 101–125 of 196

Search results

Jump to search filters

Cooperative application/OS DRAM fault recovery

Hoemmen, Mark F.; Ferreira, Kurt; Heroux, Michael A.; Brightwell, Ronald B.

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

More Details

MiniGhost : a miniapp for exploring boundary exchange strategies using stencil computations in scientific parallel computing

Barrett, Richard F.; Vaughan, Courtenay T.; Heroux, Michael A.

A broad range of scientific computation involves the use of difference stencils. In a parallel computing environment, this computation is typically implemented by decomposing the spacial domain, inducing a 'halo exchange' of process-owned boundary data. This approach adheres to the Bulk Synchronous Parallel (BSP) model. Because commonly available architectures provide strong inter-node bandwidth relative to latency costs, many codes 'bulk up' these messages by aggregating data into a message as a means of reducing the number of messages. A renewed focus on non-traditional architectures and architecture features provides new opportunities for exploring alternatives to this programming approach. In this report we describe miniGhost, a 'miniapp' designed for exploration of the capabilities of current as well as emerging and future architectures within the context of these sorts of applications. MiniGhost joins the suite of miniapps developed as part of the Mantevo project.

More Details

TriBITS lifecycle model. Version 1.0, a lean/agile software lifecycle model for research-based computational science and engineering and applied mathematical software

Willenbring, James M.; Heroux, Michael A.

Software lifecycles are becoming an increasingly important issue for computational science and engineering (CSE) software. The process by which a piece of CSE software begins life as a set of research requirements and then matures into a trusted high-quality capability is both commonplace and extremely challenging. Although an implicit lifecycle is obviously being used in any effort, the challenges of this process - respecting the competing needs of research vs. production - cannot be overstated. Here we describe a proposal for a well-defined software lifecycle process based on modern Lean/Agile software engineering principles. What we propose is appropriate for many CSE software projects that are initially heavily focused on research but also are expected to eventually produce usable high-quality capabilities. The model is related to TriBITS, a build, integration and testing system, which serves as a strong foundation for this lifecycle model, and aspects of this lifecycle model are ingrained in the TriBITS system. Here, we advocate three to four phases or maturity levels that address the appropriate handling of many issues associated with the transition from research to production software. The goals of this lifecycle model are to better communicate maturity levels with customers and to help to identify and promote Software Engineering (SE) practices that will help to improve productivity and produce better software. An important collection of software in this domain is Trilinos, which is used as the motivation and the initial target for this lifecycle model. However, many other related and similar CSE (and non-CSE) software projects can also make good use of this lifecycle model, especially those that use the TriBITS system. Indeed this lifecycle process, if followed, will enable large-scale sustainable integration of many complex CSE software efforts across several institutions.

More Details

A hybrid-hybrid solver for manycore platforms

SC'11 - Proceedings of the 2011 High Performance Computing Networking, Storage and Analysis Companion, Co-located with SC'11

Rajamanickam, Sivasankaran R.; Boman, Erik G.; Heroux, Michael A.

With the increasing levels of parallelism in a compute node, it is important to exploit multiple levels of parallelism even within a single compute node. We present ShyLU (pro- nounced\Shy-loo"for Scalable Hybrid LU), a\hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative method is based on approximate Schur com- plements. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). Our solver is use- ful both in shared-memory environments and on large par- allel computers with distributed memory (as a subdomain solver). We compare the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 192 cores for a given problem size. We compare at MPI performance of ShyLU against a hybrid implementation. We conclude that on present multicore nodes at MPI is better. However, for future manycore machines (48 or more cores) hybrid/ hi- erarchical algorithms and implementations are important for sustained performance. Copyright is held by the author/owner(s).

More Details
Results 101–125 of 196
Results 101–125 of 196