Center for Computing Research (CCR)

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen O.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita T.; Wilke, Jeremiah J.; Bennett, Janine C.; Clay, Robert L.; kale, laxkimant k.; Jain, Nikhil J.; Mikida, Eric M.; Aiken, Alex A.; Bauer, Michael B.; Lee, Wonchan L.; Slaughter, Elliott S.; Treichler, Sean T.; Berzins, Martin B.; Harman, Todd H.; humphreys, alan h.; schmidt, john s.; sunderland, dan s.; Mccormick, Pat M.; gutierrez, samuel g.; shulz, martin s.; Gamblin, Todd G.; Bremer, Peer-Timo B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen O.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita T.; Wilke, Jeremiah J.; Bennett, Janine C.; Clay, Robert L.; kale, laxkimant k.; Jain, Nikhil J.; Mikida, Eric M.; Aiken, Alex A.; Bauer, Michael B.; Lee, Wonchan L.; Slaughter, Elliott S.; Treichler, Sean T.; Berzins, Martin B.; Harman, Todd H.; humphreys, alan h.; schmidt, john s.; sunderland, dan s.; Mccormick, Pat M.; gutierrez, samuel g.; shulz, martin s.; Gamblin, Todd G.; Bremer, Peer-Timo B.

This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein

More Details

TYPE SAND Report YEAR 2015

OSTI DOI

Asynchronous Many-Task Programming Models for Next Generation Platforms

Wilke, Jeremiah J.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole S.; Teranishi, Keita T.; Bennett, Janine C.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Asynchronous Many-Task Programming Models for Next Generation Platforms

Bennett, Janine C.; Wilke, Jeremiah J.; Slattengren, Nicole S.; Teranishi, Keita T.; Lin, Paul L.; Sjaardema, Gregory D.; Kolla, Hemanth K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Asynchronous Many-Task Programming Models for Next Generation Platforms

Bennett, Janine C.; Wilke, Jeremiah J.; Slattengren, Nicole S.; Teranishi, Keita T.; Lin, Paul L.; Sjaardema, Gregory D.; Kolla, Hemanth K.; Hollman, David S.; Knight, Samuel K.; franko, ken f.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

DHARMA: Distributed asyncHronous Adaptive Resilient Management of Applications

Bennett, Janine C.; Wilke, Jeremiah J.; Slattengren, Nicole S.; Teranishi, Keita T.; Franko, Kenneth J.; Sjaardema, Gregory D.; Lin, Paul L.; Kolla, Hemanth K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Embedding Python for In-Situ Analysis

Shead, Timothy M.; Konduri, Aditya K.; Kolla, Hemanth K.; Dunlavy, Daniel D.; Kegelmeyer, William P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity

Gammel, Marc G.; Teranishi, Keita T.; Knight, Samuel K.; Sjaardema, Gregory D.; Kolla, Hemanth K.; Wilke, Jason W.; Slattengren, Nicole S.; Ferreira, Kurt B.; Bennett, Janine C.; Jain, Nikhil J.; Kale, Laxmikant K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Evaluation of Asynchronous Multitask Programming Models using Mini-Applications

Franko, Kenneth J.; Sjaardema, Gregory D.; Bennett, Janine C.; Kolla, Hemanth K.; Lin, Paul L.; Teranishi, Keita T.; Wilke, Jeremiah J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

EVENT DETECTION IN MULTI-VARIATE SCIENTIFIC SIMULATIONS USING FEATURE ANOMALY METRICS

Konduri, Aditya K.; Kolla, Hemanth K.; Ling, Julia L.; Kegelmeyer, William P.; Dunlavy, Daniel D.; Shead, Timothy M.; Davis, Warren L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ExaLearn – GenTen Tensor Software ECP Milestone

Kolla, Hemanth K.; Phipps, Eric T.; Wolf, Michael W.

The objective of this milestone was to finish integrating GenTen tensor software with combustion application Pele using the Ascent in situ analysis software, partnering with the ALPINE and Pele teams. Also, to demonstrate the usage of the tensor analysis as part of a combustion simulation.

More Details

TYPE Other Report YEAR 2022

OSTI DOI

Exploring DARMA Abstraction Layer for PIC and DSMC Kernels on Next Generation Platforms

Markosyan, Aram H.; Bettencourt, Matthew T.; Bennett, Janine C.; Lifflander, Jonathan; Hollman, David S.; Wilke, Jeremiah J.; Kolla, Hemanth K.; Moore, Christopher H.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Exploring DARMA Abstraction Layer for PIC and DSMC Kernels on Next Generation Platforms

Markosyan, Aram H.; Bettencourt, Matthew T.; Bennett, Janine C.; Lifflander, Jonathan; Hollman, David S.; Wilke, Jeremiah J.; Kolla, Hemanth K.; Moore, Christopher H.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Exploring failure recovery for stencil-based applications at extreme scales

HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Gamell, Marc; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

Exploring Failure Recovery for Stencil-based Applications at Extreme Scales

Gamell Balmana, Marc G.; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Failure Masking and Local Recovery for Stencil-based Applications at Extreme Scales

Gamell, Marc G.; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Higher Order Joint Moment Tensor Decomposition

Kolla, Hemanth K.; Konduri, Aditya K.; Rai, Prashant R.; Kolda, Tamara G.; Davis, Warren L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

In-Situ Machine Learning for Intelligent Data Capture in HPC Simulations

Davis, Warren L.; Dunlavy, Daniel D.; Kegelmeyer, William P.; Kolla, Hemanth K.; Konduri, Aditya K.; Shead, Timothy M.; Reed, Kevin R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

In-Situ Machine Learning for Intelligent Data Capture on Exascale Platforms

Davis, Warren L.; Shead, Timothy M.; Kolla, Hemanth K.; Kegelmeyer, William P.; Popoola, Gabriel A.; Reed, Kevin R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

In-Situ Machine Learning for Intelligent Data Capture on Exascale Platforms

Davis IV, Warren L.; Shead, Timothy M.; Kolla, Hemanth K.; Reed, Kevin R.; Kegelmeyer, William P.; Popoola, Gabriel A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Lessons Learned from Porting the MiniAero Application to Charm++

Hollman, David S.; Hollman, David S.; Bennett, Janine C.; Bennett, Janine C.; Wilke, Jeremiah J.; Wilke, Jeremiah J.; Kolla, Hemanth K.; Kolla, Hemanth K.; Lin, Paul L.; Lin, Paul L.; Slattengren, Nicole S.; Slattengren, Nicole S.; Teranishi, Keita T.; Teranishi, Keita T.; franko, ken f.; franko, ken f.; Jain, Nikhil J.; Jain, Nikhil J.; Mikida, Eric M.; Mikida, Eric M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Local recovery and failure masking for stencil-based applications at extreme scales

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Gamell, Marc; Teranishi, Keita T.; Heroux, Michael A.; Mayo, Jackson M.; Kolla, Hemanth K.; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI DOI

Publications