Publications
Publication | Type | Year |
---|---|---|
Understanding Memory Failures on a Petascale Arm SystemThe 31st International Symposium on High-Performance Parallel and Distributed Computing
|
Conference Paper – 2022 Conference Paper | 2022 |
SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime2022 Exascale Computing Project Annual Meeting (Virtual)
|
Display or Poster (non-conference) – 2022 Display or Poster (non-conference) | 2022 |
Characterizing Failures in HPC Using Benford?s LawThe SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP22)
|
Conference Presentation – 2022 Conference Presentation | 2022 |
Characterizing Per-node Memory Failures Using Benford?s LawFTXS 2021 Workshop on Fault Tolerance for HPC at eXtreme Scale held in conjuction with SC21
|
Abstract – 2021 Abstract | 2021 |
A Benchmark to Understand Communication Performance in Hybrid MPI and GPU ApplicationsExaMPI21Workshop on Exascale MPI
|
Conference Paper – 2021 Conference Paper | 2021 |
A Benchmark to Understand Communication Performance in Hybrid MPI and GPU ApplicationsExaMPI21Workshop on Exascale MPI
|
Conference Paper – 2021 Conference Paper | 2021 |
Characterizing Memory Failures Using Benford?s Law14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids
|
Conference Paper – 2021 Conference Paper | 2021 |
Characterizing Per-node Memory Failures Using Benford?s LawWorkshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2021)
|
Conference Paper – 2021 Conference Paper | 2021 |
Evaluating MPI Resource Usage Summary StatisticsJournal of Parallel Computing |
Journal Article – 2021 Journal Article | 2021 |
Understanding the Effects of DRAM Correctable Error Logging at ScaleIEEE Cluster Conference
|
Conference Paper – 2021 Conference Paper | 2021 |
An Initial Examination of the Effect of Container Resource Constraints on Application PerturbationWorkshop on Resource Arbitration for Dynamic Runtimes (RADR) |
Conference Presentation – 2021 Conference Presentation | 2021 |
SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime2021 Exascale Computing Project Annual Meeting (Virtual) |
Display or Poster (non-conference) – 2021 Display or Poster (non-conference) | 2021 |
Examining the Impact of Approximate Coordination on Checkpoint/Restarthttps://ckpt-symposium.lbl.gov/home
|
Abstract – 2020 Abstract | 2020 |
Evaluating MPI Message Size Summary StatisticsEuroMPI/USA '20 |
Conference Proceeding – 2020 Conference Proceeding | 2020 |
ALAMO: Autonomous Lightweight Allocation, Management and OptimizationSmoky Mountains Computational Sciences and Engineering Conference |
Conference Paper – 2020 Conference Paper | 2020 |
Evaluating Tradeoffs Between MPI Message Matching Offload Hardware Capacity and PerformanceEuroMPI'19 26th European MPI Users' Group Meeting |
Conference Paper – 2019 Conference Paper | 2019 |
Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer CorruptionInternational European Conference on Parallel and Distributed Computing |
Conference Paper – 2019 Conference Paper | 2019 |
Using Simulation to Examine the Effect of MPI Message Matching Costs on Application PerformanceParallel ComputingSystems & Applications |
Journal Article – 2019 Journal Article | 2019 |
Lessons learned from memory errors observed over the lifetime of CieloSIAM Conference on Computational Science and Engineering (CSE19) |
Conference Paper – 2019 Conference Paper | 2019 |
Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform DesignConcurrency and ComputationPractice and Experience |
Journal Article – 2019 Journal Article | 2019 |
SNL ATDM Software Ecosystem2019 Exascale Computing Project Annual Meeting |
Display or Poster (non-conference) – 2019 Display or Poster (non-conference) | 2019 |
Checkpointing Strategies for Shared High-Performance Computing PlatformsInternational Journal of Networking and Computing |
Journal Article – 2018 Journal Article | 2018 |
Lessons Learned from Errors Observed over the Lifetime of CieloSc18 |
Conference Paper – 2018 Conference Paper | 2018 |
Physics-Informed Machine Learning for DRAM Error Modelinghe 31st IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology |
Conference Paper – 2018 Conference Paper | 2018 |
Using Simulation to Examine the Effect of MPI Message Matching Costs on Application PerformanceEuroMPI 2018 |
Conference Paper – 2018 Conference Paper | 2018 |
Document Title | Type | Year |