Publications

Results 1–25 of 100
Date Inputs. Currently set to enter a start and end date.
Current Filters Clear all
Publication Type Year

Understanding Memory Failures on a Petascale Arm System

The 31st International Symposium on High-Performance Parallel and Distributed Computing

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Joshua David Hemmert, Kevin Pedretti

Conference Paper – 2022 Conference Paper 2022

SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime

2022 Exascale Computing Project Annual Meeting (Virtual)

Stephen Lecler Olivier, Ronald B. Brightwell, Matthew Dosanjh, Kurt Brian Ferreira, Scott Larson Nicoll Levy, Kevin Pedretti, Andrew J Younge

Display or Poster (non-conference) – 2022 Display or Poster (non-conference) 2022

Characterizing Failures in HPC Using Benford?s Law

The SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP22)

Kurt Brian Ferreira, Scott Larson Nicoll Levy

Conference Presentation – 2022 Conference Presentation 2022

Characterizing Per-node Memory Failures Using Benford?s Law

FTXS 2021 Workshop on Fault Tolerance for HPC at eXtreme Scale held in conjuction with SC21

Kurt Brian Ferreira, Scott Larson Nicoll Levy

Abstract – 2021 Abstract 2021

A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications

ExaMPI21Workshop on Exascale MPI

Keira Haskins, Bridges, Kurt Brian Ferreira, Scott Larson Nicoll Levy

Conference Paper – 2021 Conference Paper 2021

A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications

ExaMPI21Workshop on Exascale MPI

Keira Haskins, Patrick Bridges, Kurt Brian Ferreira, Scott Larson Nicoll Levy

Conference Paper – 2021 Conference Paper 2021

Characterizing Memory Failures Using Benford?s Law

14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Kurt Brian Ferreira, Scott Larson Nicoll Levy

Conference Paper – 2021 Conference Paper 2021

Characterizing Per-node Memory Failures Using Benford?s Law

Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2021)

Kurt Brian Ferreira, Scott Larson Nicoll Levy

Conference Paper – 2021 Conference Paper 2021

Evaluating MPI Resource Usage Summary Statistics

Journal of Parallel Computing

Kurt Brian Ferreira, Scott Larson Nicoll Levy

https://www.osti.gov/search/identifier:1822241

Journal Article – 2021 Journal Article 2021

Understanding the Effects of DRAM Correctable Error Logging at Scale

IEEE Cluster Conference

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Victor G. Kuhns, Nathan DeBardelaben, Sean Blanchard

Conference Paper – 2021 Conference Paper 2021

An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation

Workshop on Resource Arbitration for Dynamic Runtimes (RADR)

Scott Larson Nicoll Levy, Kurt Brian Ferreira

https://www.osti.gov/search/identifier:1869756

Conference Presentation – 2021 Conference Presentation 2021

SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime

2021 Exascale Computing Project Annual Meeting (Virtual)

Stephen Lecler Olivier, Ronald B. Brightwell, Kurt Brian Ferreira, Ryan Eric Grant, Scott Larson Nicoll Levy, Kevin Pedretti, Andrew J Younge

https://www.osti.gov/search/identifier:1861479

Display or Poster (non-conference) – 2021 Display or Poster (non-conference) 2021

Examining the Impact of Approximate Coordination on Checkpoint/Restart

https://ckpt-symposium.lbl.gov/home

Kurt Brian Ferreira, Scott Larson Nicoll Levy

Abstract – 2020 Abstract 2020

Evaluating MPI Message Size Summary Statistics

EuroMPI/USA '20

Scott Larson Nicoll Levy, Kurt Brian Ferreira

https://www.osti.gov/search/identifier:1825984

Conference Proceeding – 2020 Conference Proceeding 2020

ALAMO: Autonomous Lightweight Allocation, Management and Optimization

Smoky Mountains Computational Sciences and Engineering Conference

Ronald B. Brightwell, Kurt Brian Ferreira, Ryan Eric Grant, Scott Larson Nicoll Levy, Gerald Fredrick Lofstead, Stephen Lecler Olivier, Kevin Pedretti, Andrew J Younge, Ann C. Gentile, Bradley Keith Brandt

https://www.osti.gov/search/identifier:1818044

Conference Paper – 2020 Conference Paper 2020

Evaluating Tradeoffs Between MPI Message Matching Offload Hardware Capacity and Performance

EuroMPI'19 26th European MPI Users' Group Meeting

Scott Larson Nicoll Levy, Kurt Brian Ferreira

https://www.osti.gov/search/identifier:1641378

Conference Paper – 2019 Conference Paper 2019

Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption

International European Conference on Parallel and Distributed Computing

Scott Larson Nicoll Levy, Kurt Brian Ferreira

https://www.osti.gov/search/identifier:1641289

Conference Paper – 2019 Conference Paper 2019

Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance

Parallel ComputingSystems & Applications

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Whit Schonbein, Ryan Eric Grant, Matthew Dosanjh

https://www.osti.gov/search/identifier:1502976

Journal Article – 2019 Journal Article 2019

Lessons learned from memory errors observed over the lifetime of Cielo

SIAM Conference on Computational Science and Engineering (CSE19)

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Taniya Siddiqua, Nathan DeBardelebe, Vilas Sridharan, Elisabeth Baseman

https://www.osti.gov/search/identifier:1639464

Conference Paper – 2019 Conference Paper 2019

Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design

Concurrency and ComputationPractice and Experience

Kurt Brian Ferreira, Ryan Eric Grant, Michael J. Levenhagen, Scott Larson Nicoll Levy, Taylor Groves

https://www.osti.gov/search/identifier:1501630

Journal Article – 2019 Journal Article 2019

SNL ATDM Software Ecosystem

2019 Exascale Computing Project Annual Meeting

Stephen Lecler Olivier, Ronald B. Brightwell, Kevin Pedretti, Andrew J Younge, Noah Evans, Scott Larson Nicoll Levy, Kurt Brian Ferreira, Ryan Eric Grant

https://www.osti.gov/search/identifier:1583026

Display or Poster (non-conference) – 2019 Display or Poster (non-conference) 2019

Checkpointing Strategies for Shared High-Performance Computing Platforms

International Journal of Networking and Computing

Kurt Brian Ferreira

https://www.osti.gov/search/identifier:1492861

Journal Article – 2018 Journal Article 2018

Lessons Learned from Errors Observed over the Lifetime of Cielo

Sc18

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Nathan Debardeleben, Taniya Siddiqua, Vilas Sridharan, Elisabeth Baseman

https://www.osti.gov/search/identifier:1582542

Conference Paper – 2018 Conference Paper 2018

Physics-Informed Machine Learning for DRAM Error Modeling

he 31st IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology

Elisabeth Baseman (LANL), Nathan DeBardeleben (LANL), Sean Blanchard (LANL), Juston Moore (LANL), Olena Tkachenko (NM Consortium), Kurt Brian Ferreira, Taniya Siddiqua (AMD), Vilas Sridharan (AMD)

https://www.osti.gov/search/identifier:1806747

Conference Paper – 2018 Conference Paper 2018

Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance

EuroMPI 2018

Scott Larson Nicoll Levy, Kurt Brian Ferreira

https://www.osti.gov/search/identifier:1569677

Conference Paper – 2018 Conference Paper 2018
Document Title Type Year
Results 1–25 of 100