Publications

Results 1–25 of 36

ExaLearn – GenTen Tensor Software ECP Milestone

Kolla, Hemanth K.; Phipps, Eric T.; Wolf, Michael W.

The objective of this milestone was to finish integrating GenTen tensor software with combustion application Pele using the Ascent in situ analysis software, partnering with the ALPINE and Pele teams. Also, to demonstrate the usage of the tensor analysis as part of a combustion simulation.

More Details

TYPE Other Report YEAR 2022

OSTI DOI

Split Bregman optimizer for online generalized CP tensor decomposition

Gilman, Kyle G.; Phipps, Eric T.; Kolla, Hemanth K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

In-Situ Machine Learning for Intelligent Data Capture on Exascale Platforms

Davis IV, Warren L.; Shead, Timothy M.; Kolla, Hemanth K.; Reed, Kevin R.; Kegelmeyer, William P.; Popoola, Gabriel A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

The Potential of Integrated Machine Learning Algorithms for Tropical Cyclone Detection in Advanced Climate Modeling

Davis IV, Warren L.; Shead, Timothy M.; Kolla, Hemanth K.; Popoola, Gabriel A.; Kegelmeyer, William P.; Konduri, Aditya K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

A Framework for In-Situ Anomaly Detection in HPC Environments

Shead, Timothy M.; Kolla, Hemanth K.; Konduri, Aditya K.; Papoola, Gabriel P.; Davis, Warren L.; Dunlavy, Daniel D.; Reed, Kevin &.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

In-Situ Machine Learning for Intelligent Data Capture on Exascale Platforms

Davis, Warren L.; Shead, Timothy M.; Kolla, Hemanth K.; Kegelmeyer, William P.; Popoola, Gabriel A.; Reed, Kevin R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

A Framework for In-Situ Anomaly Detection in HPC Environments

Shead, Timothy M.; Dunlavy, Daniel D.; Kolla, Hemanth K.; Konduri, Aditya K.; Popoola, Gabriel A.; Davis, Warren L.; Kegelmeyer, William P.; Reed, Kevin R.; Ling, Julia L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

SNL ATDM: In-situ Compression with ParaView/TuckerMPI

Kolla, Hemanth K.; Oldfield, Ron A.; Otahal, Thomas J.; Baker, Gavin M.; Mauldin, Jeffrey A.; Kolda, Tamara G.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

In-Situ Machine Learning for Intelligent Data Capture in HPC Simulations

Davis, Warren L.; Dunlavy, Daniel D.; Kegelmeyer, William P.; Kolla, Hemanth K.; Konduri, Aditya K.; Shead, Timothy M.; Reed, Kevin R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

EVENT DETECTION IN MULTI-VARIATE SCIENTIFIC SIMULATIONS USING FEATURE ANOMALY METRICS

Konduri, Aditya K.; Kolla, Hemanth K.; Ling, Julia L.; Kegelmeyer, William P.; Dunlavy, Daniel D.; Shead, Timothy M.; Davis, Warren L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Higher Order Joint Moment Tensor Decomposition

Kolla, Hemanth K.; Konduri, Aditya K.; Rai, Prashant R.; Kolda, Tamara G.; Davis, Warren L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Particle in Cell Algorithms and Codes Toward the Next Generation Architectures

Markosyan, Aram H.; Bettencourt, Matthew T.; Bennett, Janine C.; Lifflander, Jonathan; Hollman, David S.; Wilke, Jeremiah J.; Kolla, Hemanth K.; Moore, Christopher H.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

SIAM Journal on Scientific Computing

Gamell, Marc G.; Teranishi, Keita T.; Kolla, Hemanth K.; Mayo, Jackson M.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish P.

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.

More Details

TYPE Journal Article YEAR 2017

OSTI DOI

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

IEEE Transactions on Parallel and Distributed Systems

Gamell, Marc; Teranishi, Keita T.; Mayo, Jackson M.; Kolla, Hemanth K.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.

More Details

TYPE Journal Article YEAR 2017

Scopus OSTI DOI

Exploring DARMA Abstraction Layer for PIC and DSMC Kernels on Next Generation Platforms

Markosyan, Aram H.; Bettencourt, Matthew T.; Bennett, Janine C.; Lifflander, Jonathan; Hollman, David S.; Wilke, Jeremiah J.; Kolla, Hemanth K.; Moore, Christopher H.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Embedding Python for In-Situ Analysis

Shead, Timothy M.; Konduri, Aditya K.; Kolla, Hemanth K.; Dunlavy, Daniel D.; Kegelmeyer, William P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Exploring DARMA Abstraction Layer for PIC and DSMC Kernels on Next Generation Platforms

Markosyan, Aram H.; Bettencourt, Matthew T.; Bennett, Janine C.; Lifflander, Jonathan; Hollman, David S.; Wilke, Jeremiah J.; Kolla, Hemanth K.; Moore, Christopher H.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Tackling UQ in DARMA a Programming Model for Task-Based Execution at Extreme-Scale

Rizzi, Francesco N.; Phipps, Eric T.; Hollman, David S.; Lifflander, Jonathan; Wilke, Jeremiah J.; Markosyan, Aram H.; Kolla, Hemanth K.; Slattengren, Nicole S.; Teranishi, Keita T.; Stewart, James R.; Clay, Robert L.; Bennett, Janine C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity

Gammel, Marc G.; Teranishi, Keita T.; Knight, Samuel K.; Sjaardema, Gregory D.; Kolla, Hemanth K.; Wilke, Jason W.; Slattengren, Nicole S.; Ferreira, Kurt B.; Bennett, Janine C.; Jain, Nikhil J.; Kale, Laxmikant K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Using Feature Importance Metrics to Detect Events of Interest in Scientific Computing Applications

Ling, Julia L.; Kegelmeyer, William P.; Konduri, Aditya K.; Kolla, Hemanth K.; Reed, Kevin R.; Shead, Timothy M.; Davis, Warren L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Local recovery and failure masking for stencil-based applications at extreme scales

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Gamell, Marc; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI DOI

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen O.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita T.; Wilke, Jeremiah J.; Bennett, Janine C.; Clay, Robert L.; kale, laxkimant k.; Jain, Nikhil J.; Mikida, Eric M.; Aiken, Alex A.; Bauer, Michael B.; Lee, Wonchan L.; Slaughter, Elliott S.; Treichler, Sean T.; Berzins, Martin B.; Harman, Todd H.; humphreys, alan h.; schmidt, john s.; sunderland, dan s.; Mccormick, Pat M.; gutierrez, samuel g.; shulz, martin s.; Gamblin, Todd G.; Bremer, Peer-Timo B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen O.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita T.; Wilke, Jeremiah J.; Bennett, Janine C.; Clay, Robert L.; kale, laxkimant k.; Jain, Nikhil J.; Mikida, Eric M.; Aiken, Alex A.; Bauer, Michael B.; Lee, Wonchan L.; Slaughter, Elliott S.; Treichler, Sean T.; Berzins, Martin B.; Harman, Todd H.; humphreys, alan h.; schmidt, john s.; sunderland, dan s.; Mccormick, Pat M.; gutierrez, samuel g.; shulz, martin s.; Gamblin, Todd G.; Bremer, Peer-Timo B.

This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein

More Details

TYPE SAND Report YEAR 2015

OSTI DOI

Asynchronous Many-Task Programming Models for Next Generation Platforms

Wilke, Jeremiah J.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole S.; Teranishi, Keita T.; Bennett, Janine C.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Local Recovery and Failure Masking for Stencil-based Applications at Extreme Scales

Gamell Balmana, Marc G.; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Results 1–25 of 36

Results 1–25 of 36