Kurt Brian Ferreira

Scalable System Software

Author profile picture

Scalable System Software

kbferre@sandia.gov

(505) 844-0433

Sandia National Laboratories, New Mexico
P.O. Box 5800
Albuquerque, NM 87185-1319

Biography

Principal Member of Technical Staff 
My area of expertise is system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. I have designed and developed a number of innovative, high-performance, and resilient implementations of low-level system software for several HPC platforms including the Cray Red Storm (XT3) machine at Sandia National Laboratories. My research interests include the design and construction of operating systems for massively parallel processing machines and innovative application and system-level fault-tolerance mechanisms for HPC.

Education

I received my BS in mathematics and BS in computer science in 2000 from New Mexico Tech and my MS in computer science in 2008 and my PhD in computer science in 2011 from the University of New Mexico

Publications

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Joshua David Hemmert, Kevin Pedretti, (2022). Understanding Memory Failures on a Petascale Arm System The 31st International Symposium on High-Performance Parallel and Distributed Computing Document ID: 1527788

Stephen Lecler Olivier, Ronald B. Brightwell, Matthew Dosanjh, Kurt Brian Ferreira, Scott Larson Nicoll Levy, Kevin Pedretti, Andrew J Younge, (2022). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime 2022 Exascale Computing Project Annual Meeting (Virtual) Document ID: 1505231

Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2022). Characterizing Failures in HPC Using Benford?s Law The SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP22) Document ID: 1471261

Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2021). Characterizing Per-node Memory Failures Using Benford?s Law FTXS 2021 Workshop on Fault Tolerance for HPC at eXtreme Scale held in conjuction with SC21 Document ID: 1381184

Keira Haskins, Bridges, Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications ExaMPI21Workshop on Exascale MPI Document ID: 1370401

Keira Haskins, Patrick Bridges, Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications ExaMPI21Workshop on Exascale MPI Document ID: 1380992

Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2021). Characterizing Memory Failures Using Benford?s Law 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids Document ID: 1357464

Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2021). Characterizing Per-node Memory Failures Using Benford?s Law Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2021) Document ID: 1356401

Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2021). Evaluating MPI Resource Usage Summary Statistics Journal of Parallel Computing https://www.osti.gov/search/identifier:1822241 Document ID: 1344897

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Victor G. Kuhns, Nathan DeBardelaben, Sean Blanchard, (2021). Understanding the Effects of DRAM Correctable Error Logging at Scale IEEE Cluster Conference Document ID: 1343103

Scott Larson Nicoll Levy, Kurt Brian Ferreira, (2021). An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation Workshop on Resource Arbitration for Dynamic Runtimes (RADR) https://www.osti.gov/search/identifier:1869756 Document ID: 1307404

Stephen Lecler Olivier, Ronald B. Brightwell, Kurt Brian Ferreira, Ryan Eric Grant, Scott Larson Nicoll Levy, Kevin Pedretti, Andrew J Younge, (2021). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime 2021 Exascale Computing Project Annual Meeting (Virtual) https://www.osti.gov/search/identifier:1861479 Document ID: 1293055

Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2020). Examining the Impact of Approximate Coordination on Checkpoint/Restart https://ckpt-symposium.lbl.gov/home Document ID: 1254795

Scott Larson Nicoll Levy, Kurt Brian Ferreira, (2020). Evaluating MPI Message Size Summary Statistics EuroMPI/USA ’20 https://www.osti.gov/search/identifier:1825984 Document ID: 1209370

Ronald B. Brightwell, Kurt Brian Ferreira, Ryan Eric Grant, Scott Larson Nicoll Levy, Gerald Fredrick Lofstead, Stephen Lecler Olivier, Kevin Pedretti, Andrew J Younge, Ann C. Gentile, Bradley Keith Brandt, (2020). ALAMO: Autonomous Lightweight Allocation, Management and Optimization Smoky Mountains Computational Sciences and Engineering Conference https://www.osti.gov/search/identifier:1818044 Document ID: 1195366

Scott Larson Nicoll Levy, Kurt Brian Ferreira, (2019). Evaluating Tradeoffs Between MPI Message Matching Offload Hardware Capacity and Performance EuroMPI’19 26th European MPI Users’ Group Meeting https://www.osti.gov/search/identifier:1641378 Document ID: 996487

Scott Larson Nicoll Levy, Kurt Brian Ferreira, (2019). Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption International European Conference on Parallel and Distributed Computing https://www.osti.gov/search/identifier:1641289 Document ID: 985494

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Whit Schonbein, Ryan Eric Grant, Matthew Dosanjh, (2019). Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance Parallel ComputingSystems & Applications https://www.osti.gov/search/identifier:1502976 Document ID: 937350

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Taniya Siddiqua, Nathan DeBardelebe, Vilas Sridharan, Elisabeth Baseman, (2019). Lessons learned from memory errors observed over the lifetime of Cielo SIAM Conference on Computational Science and Engineering (CSE19) https://www.osti.gov/search/identifier:1639464 Document ID: 935561

Kurt Brian Ferreira, Ryan Eric Grant, Michael J. Levenhagen, Scott Larson Nicoll Levy, Taylor Groves, (2019). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design Concurrency and ComputationPractice and Experience https://www.osti.gov/search/identifier:1501630 Document ID: 913436

Stephen Lecler Olivier, Ronald B. Brightwell, Kevin Pedretti, Andrew J Younge, Noah Evans, Scott Larson Nicoll Levy, Kurt Brian Ferreira, Ryan Eric Grant, (2019). SNL ATDM Software Ecosystem 2019 Exascale Computing Project Annual Meeting https://www.osti.gov/search/identifier:1583026 Document ID: 902074

Kurt Brian Ferreira, (2018). Checkpointing Strategies for Shared High-Performance Computing Platforms International Journal of Networking and Computing https://www.osti.gov/search/identifier:1492861 Document ID: 889138

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Nathan Debardeleben, Taniya Siddiqua, Vilas Sridharan, Elisabeth Baseman, (2018). Lessons Learned from Errors Observed over the Lifetime of Cielo Sc18 https://www.osti.gov/search/identifier:1582542 Document ID: 853852

Elisabeth Baseman (LANL), Nathan DeBardeleben (LANL), Sean Blanchard (LANL), Juston Moore (LANL), Olena Tkachenko (NM Consortium), Kurt Brian Ferreira, Taniya Siddiqua (AMD), Vilas Sridharan (AMD), (2018). Physics-Informed Machine Learning for DRAM Error Modeling he 31st IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology https://www.osti.gov/search/identifier:1806747 Document ID: 841601

Scott Larson Nicoll Levy, Kurt Brian Ferreira, (2018). Using Simulation to Examine the Effect of MPI Message Matching Costs on Application Performance EuroMPI 2018 https://www.osti.gov/search/identifier:1569677 Document ID: 830451

Scott Larson Nicoll Levy, Kevin Pedretti, Kurt Brian Ferreira, (2018). Open Science on Trinity’s Knights Landing Partition: An Analysis of User Job Data The 14th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS 2018) https://www.osti.gov/search/identifier:1529450 Document ID: 809168

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Kevin Pedretti, Ryan Eric Grant, (2018). Characterizing MPI Matching via Trace-based Simulation Parallel Computing https://www.osti.gov/search/identifier:1457519 Document ID: 809042

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Kevin Pedretti, Ryan Eric Grant, (2018). Characterizing MPI Matching via Trace-based Simulation Parallel Computing https://www.osti.gov/search/identifier:1444084 Document ID: 807378

Elisabeth (LANL) Baseman, Nathan (LANL) DeBardelaben, Sean (LANL) Blanchard, Juston (LANL) Moore, Olena (NM Consortium) Tkachenko, Vilas (AMD) Sridharan, Kurt Brian Ferreira, Taniya (AMD) Siddiqua, (2018). Physics-Informed Machine Learning for DRAM Error Modeling The 31st IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems https://www.osti.gov/search/identifier:1515630 Document ID: 797312

Thomas (UTK) Herault, Yves (UTK) Robert, Aurelien (UTK) Bouteiller, Dorian (Emory) Arnold, Kurt Brian Ferreira, George (UTK) Bosilica, Jack (UTK) Dongarra, (2018). Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms 20th Workshop on Advances in Parallel and Distributed Computational Models @ IPDPS 2018 https://www.osti.gov/search/identifier:1572261 Document ID: 784724

Kurt Brian Ferreira, Ryan Eric Grant, Michael J. Levenhagen, Scott Larson Nicoll Levy, Taylor Groves, (2017). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design ExaMPI2017 – Workshop on Exascale MPI 2017 https://www.osti.gov/search/identifier:1511803 Document ID: 726260

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, (2017). The Unexpected Virtue of Almost: Exploiting MPI Collective Operations to Approximately Coordinate Checkpoints ExaMPI2017 – Workshop on Exascale MPI 2017 https://www.osti.gov/search/identifier:1482473 Document ID: 726227

Thomas Herault (UTK), Yves Robert (UTK), Aurelien Bouteiller (UTK), Dorian Arnold (Emory), Kurt Brian Ferreira, George Bosilca (UTK), Jack Dongarra (UTK), (2017). Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms IEEE International Parallel & Distributed Processing Symposium (IPDPS) 2018 https://www.osti.gov/search/identifier:1480217 Document ID: 724839

Rebecca Kreitinger, Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis SC17 The International Conference for High Performance Computing, Networking, Storage and Analysis https://www.osti.gov/search/identifier:1478158 Document ID: 703829

Rebecca Kreitinger, Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis SC17 The International Conference for High Performance Computing, Networking, Storage and Analysis https://www.osti.gov/search/identifier:1573776 Document ID: 703831

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick G Bridges, (2017). Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data 2017 IEEE International Conference on Cluster Computing (CLUSTER) https://www.osti.gov/search/identifier:1463961 Document ID: 659342

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Kevin Pedretti, Ryan Eric Grant, (2017). Characterizing MPI Matching via Trace-based Simulation EuroMPI/USA 2017 https://www.osti.gov/search/identifier:1462518 Document ID: 638253

Taniya (AMD) Siddiqua, Vilas (AMD) Sridharan, Steven E. (AMD) Raasch, Nathan (LANL) DeBardeleben, Kurt Brian Ferreira, Scott Larson Nicoll Levy, Elisabeth (LANL) Baseman, Guan Qiang (LANL), (2017). Lifetime Memory Reliability Data from the Field IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems https://www.osti.gov/search/identifier:1506882 Document ID: 637678

Patrick Widener, Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2017). It’s not the heat, it’s the humidity: scheduling resilience activity at scale 23rd International European Conference On Parallel And Distributed Computing https://www.osti.gov/search/identifier:1367189 Document ID: 624407

Marc (Rutgers) Gammel, Keita Teranishi, Samuel Knight, Gregory D. Sjaardema, Hemanth Kolla, Jason Wilke, Nicole Slattengren, Kurt Brian Ferreira, Janine Camille Bennett, Nikhil (UIUC) Jain, Laxmikant (UIUC) Kale, (2017). Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity International Workshop on Runtime and Operating Systems for Supercomputers held in conjunction with HPDC 2017 https://www.osti.gov/search/identifier:1456562 Document ID: 612825

Elizabeth (LANL) Baseman, Nathan (LANL) DeBardeleben, Kurt Brian Ferreira, Vilas (AMD) Sridharan, Taniya (AMD) Siddiqua, Olena (LANL) Tkachenko, (2017). Automating DRAM Fault Mitigation By Learning From Experience DSN 2017 The 47th IEEE/IFIP International Conference on Dependable Systems and Networks https://www.osti.gov/search/identifier:1456560 Document ID: 610540

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick G Bridges, (2016). Improving Application Resilience to Memory Errors with Lightweight Compression The International Conference for High Performance Computing, Networking, Storage and Analysis https://www.osti.gov/search/identifier:1410251 Document ID: 554663

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, Patrick G Bridges, Oscar H. Mondragon, (2016). How I Learned to Stop Worrying and Love In Situ Analytics: Leveraging Latent Synchronization in MPI Collective Algorithms The Message Passing Interface (MPI) Users and Developers Conference https://www.osti.gov/search/identifier:1394099 Document ID: 530057

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick G Bridges, (2016). Improving Application Resilience to Memory Errors with Lightweight Compression The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16) https://www.osti.gov/search/identifier:1372148 Document ID: 476343

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, Patrick G Bridges, Oscar H. Mondragon, (2016). Understanding Performance Interference in Next-Generation HPC Systems The International Conference for High Performance Computing, Networking, Storage and Analysis https://www.osti.gov/search/identifier:1372149 Document ID: 476344

Scott Larson Nicoll Levy, Kurt Brian Ferreira, (2016). An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart Fault Tolerance for HPC at Extreme Scale (FTXS) Workshop https://www.osti.gov/search/identifier:1368866 Document ID: 463972

Patrick Widener, Kurt Brian Ferreira, Scott Larson Nicoll Levy, (2016). Horseshoes and Hand Grenades: The Case for Appoximate Coordination in Local Checkpointing Protocols 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids @ EuroPar 2016 https://www.osti.gov/search/identifier:1368839 Document ID: 463931

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, Patrick (UNM) Bridges, Oscar (UNM) Mondragon, (2016). How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms EuroMPI 2016The Message Passing Interface (MPI) Users and Developers Conference https://www.osti.gov/search/identifier:1364728 Document ID: 453704

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, Patrick G Bridges, Oscar Mondragon, (2016). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale Meeting to discuss fault tolerance research with Los Alamos Nat’l Lab staff https://www.osti.gov/search/identifier:1428024 Document ID: 443447

Galen Shipman, Patrick McCormick, Kevin Pedretti, Stephen Lecler Olivier, Kurt Brian Ferreira, Ramanan Sankaran, Sean Treichler, Alex Aiken, Michael Bauer, (2016). Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime Runtime Systems for Extreme Scale Programming Models and Architectures (RESPA) https://www.osti.gov/search/identifier:1365384 Document ID: 442655

Elisabeth Baseman, Nathan DeBardeleben, Kurt Brian Ferreira, Scott Larson Nicoll Levy, Steven Rassch, Vilas Sridharan, Taniya Siddiqua, Qiang Guan, (2016). Improving DRAM Fault Characterization Through Machine Learning IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) https://www.osti.gov/search/identifier:1365234 Document ID: 432113

David (Google) Fiala, Frank (NCSU) Mueller, Kurt Brian Ferreira, Christian (ORNL) Englemann, (2016). Mini-Ckpts: Surviving OS Failures in Persistent Memory Ics 2016 https://www.osti.gov/search/identifier:1530502 Document ID: 430942

Elisabeth Baseman (LANL), Nathan DeBardeleben (LANL), Kurt Brian Ferreira, Scott Levy (UNM), Steven Raasch (AMD), Vilas Sridharan (AMD), Taniya Siddiqua (AMD), Qiang Guan (LANL), (2016). A Machine Learning Approach for Automatic Characterization of Memory Faults CoDA 2016Conference on Data Analysis, https://www.osti.gov/search/identifier:1346523 Document ID: 408989

Kurt Brian Ferreira, (2016). An Examination of the Impact of the Failure Distribution on Coordinated Checkpoint/Restart ACM Symposium on High-Performance Parallel and Distributed Computing https://www.osti.gov/search/identifier:1345094 Document ID: 408677

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick G Bridges, (2016). Similarity Engine: Using Content Similarity to Improve Memory Resilience ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC) https://www.osti.gov/search/identifier:1239385 Document ID: 387058

Oscar H. Mondragon, Patrick G Bridges, Kurt Brian Ferreira, Patrick Widener, Scott Larson Nicoll Levy, (2015). Scheduling In-Situ Analytics in Next-generation Applications 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing https://www.osti.gov/search/identifier:1333466 Document ID: 354815

Kevin Pedretti, Stephen Lecler Olivier, Kurt Brian Ferreira, Galen Shipman (LANL), Wei Shu (UNM), (2015). Early Experiences with Node-Level Power Capping on the Cray XC40 Platform Energy Efficient Supercomputing Workshop https://www.osti.gov/search/identifier:1333245 Document ID: 354745

Kevin Pedretti, Stephen Lecler Olivier, Kurt Brian Ferreira, Galen Shipman, Wei Shu, (2015). Early Experiences with Node-Level Power Capping on the Cray XC40 Platform Workshop on Energy Efficient Supercomputing (E2SC) https://www.osti.gov/search/identifier:1338038 Document ID: 342895

Patrick Widener, Kurt Brian Ferreira, Scott Larson Nicoll Levy, Nathan D. Fabian, (2015). Canaries in a Coal Mine: Using Application-level Checkpoints to Detect Memory Failures 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids in connection with Euro-Par 2015 https://www.osti.gov/search/identifier:1256569 Document ID: 286581

Rolf (Intel) Riesen, Barney (ORNL) Maccabe, Balazs (RIKEN) Gerofi, David (INTEL) Lombard, John (PITT) Lange, Kevin Pedretti, Kurt Brian Ferreira, Mike (LANL) Lang, Pardo (INTEL) Keppel, Robert (INTEL) Wisniewski, Ronald B. Brightwell, Todd (INTEL) Inglett, Yoonho (IBM) Park, Yutaka (RIKEN) Ishikawa, (2015). Panel: What is a Lightweight Kernel? International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2015 Held in conjunction with HPDC 2015 https://www.osti.gov/search/identifier:1258200 Document ID: 275870

Kevin Pedretti, Stephen Lecler Olivier, Kurt Brian Ferreira, Galen (LANL) Shipman, Wei (UNM) Shu, (2015). Exploring MPI Application Performance Under Power Capping on the Cray XC40 Platform EuroMPI 2015 https://www.osti.gov/search/identifier:1258232 Document ID: 275657

Kurt Brian Ferreira, (2015). Revisiting Checkpointing for Exascale-Class Systems The Salishan Conference on High-Speed Computing https://www.osti.gov/search/identifier:1251139 Document ID: 264876

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick G Bridges, (2015). Similarity Engine: Using Content Similarity to Improve Memory Resilience International Conference for High Performance Computing, Networking, Storage and Analysis https://www.osti.gov/search/identifier:1530987 Document ID: 264517

Galen (LANL) Shipman, Patrick (LANL) McCormick, Kevin Pedretti, Stephen Lecler Olivier, Kurt Brian Ferreira, Jacqueline H. Chen, Ramanan (ORNL) Sankaran, Sean (Stanford) Treichler, Alex (Stanford) Aiken, Michael (NVIDIA) Bauer, (2015). Dynamic Task Scheduling to Mitigate System Performance Variability The 27th International Conference for High Performance Computing, Networking, Storage and Analysis https://www.osti.gov/search/identifier:1249032 Document ID: 264521

Kurt Brian Ferreira, Alireza (UNM) Goudarzi, Dorian (UNM) Arnold, Guy (Perdue) Feldman, (2015). A Principled Approach to HPC Event Monitoring Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop, help at ACM Symposium on High Performance Distributed Computing (HPDC) https://www.osti.gov/search/identifier:1239260 Document ID: 221665

Patrick Widener, Scott Larson Nicoll Levy, Kurt Brian Ferreira, Torsten Hoefler, (2014). On noise and the performance benefit of nonblocking collectives International Journal of High Performance Computing Applications https://www.osti.gov/search/identifier:1257977 Document ID: 208093

Kurt Brian Ferreira, Scott Larson Nicoll Levy, Patrick Widener, Dorian Arnold, (2014). Using Machine Learning to Optimize Uncoordinated Checkpointing Performance ASCR Machine Learning Workshop https://www.osti.gov/search/identifier:1319751 Document ID: 187255

Kurt Brian Ferreira, (2014). Fault Survivability of Lightweight Operating Systems for exascale https://www.osti.gov/search/identifier:1459775 Document ID: 155587

Vilas Sridharan, Nathan Debardeleben, Sean Blanchard, Kurt Brian Ferreira, Sudhanva Gurumurthi, John Shalf, (2014). Memory Errors in Modern Systems: The Good, The Bad, and The Ugly 20th International Conference on Architectural Support for Programming Languages and Operating Systems https://www.osti.gov/search/identifier:1497665 Document ID: 143820

Kurt Brian Ferreira, Dorian Arnold, Dewan Ibtesham, (2014). A Study of Checkpoint Compression for High-Performance Computing Systems IJHPCA Special issue Resilience Techniques for High-Performance Computing https://www.osti.gov/search/identifier:1426906 Document ID: 101765

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, Bryan Topp, Dorian Arnold, Torsten Hoefler, (2014). Using Simulation to Evaluate the Performance of Resilience Strategies and Process Failures https://www.osti.gov/search/identifier:1204092 Document ID: 5331981

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Bridges, (2013). Predicting the Impact of Failure Avoidance on Checkpoint/Restart in Extreme-Scale Systems Third International Workshop on Extreme Scale Parallel Architectures and Systems https://www.osti.gov/search/identifier:1118703 Document ID: 5330212

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Aidan P. Thompson, Christian Robert Trott, Patrick Bridges, (2013). A Study of the Viability of Exploiting Memory Content Similarity to Improve Resilience to Memory Errors International Journal of High Performance Computing Applications https://www.osti.gov/search/identifier:1111407 Document ID: 5328303

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Patrick Widener, Bryan Topp, Dorian Arnold, Torsten Hoefler, (2013). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale 4th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance https://www.osti.gov/search/identifier:1110585 Document ID: 5327863

Ryan Eric Grant, Kurt Brian Ferreira, Bryan Mills, Rolf Riesen, (2013). Evaluating Energy Savings for Checkpoint/Restart 4th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance https://www.osti.gov/search/identifier:1109412 Document ID: 5327562

Bryan Mills, Kurt Brian Ferreira, Ryan Eric Grant, Taieb Znati, Rami Melhem, (2013). Energy Consumption of Resilience Mechanisms in Large Scale Systems 22nd Euromicro International Conference on Parallel, distributed, and network-based Processing https://www.osti.gov/search/identifier:1143916 Document ID: 5327045

Patrick Widener, Kurt Brian Ferreira, Scott Larson Nicoll Levy, Ronald B. Brightwell, Patrick G. Bridges, Dorian Arnold, (2013). Asking the right questions: benchmarking fault-tolerant extreme-scale systems Workshop on Resiliency in High-Performance Computing https://www.osti.gov/search/identifier:1083655 Document ID: 5323574

Steven J. Lockwood, John P. Vandyke, Kurt Brian Ferreira, James H. Laros, James Tomkins, (2013). Investigating An API for Resilient Exascale Computing https://www.osti.gov/search/identifier:1096503 Document ID: 5322235

Scott Larson Nicoll Levy, Kurt Brian Ferreira, Matthew G. F. Dosanjh, Patrick G. Bridges, (2013). Using Unreliable Virtual Hardware to Inject Errors in Extreme-Scale SystemsExtreme-Scale Systems Fault-Tolerance for HPC at Extreme Scale (FTXS) https://www.osti.gov/search/identifier:1063319 Document ID: 5319019

Bryan Embry Topp, Kurt Brian Ferreira, Rida A. Bazzi, Dorian Arnold, (2012). Addressing Message-log Scalability for Extreme-scale Systems The International Conference for High-Performance Networking, Storage, and Analysis https://www.osti.gov/search/identifier:1060449 Document ID: 5311283

Deepesh K. Kholwadwala, Kurt Brian Ferreira, Michael A. Heroux, Ronald B. Brightwell, Patrick Bridges, (2012). Cooperative Application/OS DRAM Fault Recovery https://www.osti.gov/search/identifier:1044954 Document ID: 5308123

David R. Bronowski, Keren Bergman, David Bunde, Elliot Cooper-Balis, Kurt Brian Ferreira, Karl Scott Hemmert, Brian Barrett, Cassandra Versaggi, Robert Hendry, Bruce Jacob, Hyesoon Kim, Vitus J. Leung, Michael J. Levenhagen, Mitchelle Rasquinha, Rolf Riesen, Paul Rosenfeld, Maria del Carmen Ruiz Varela, Sudhakar Yalamanchili, (2012). Improvements to the Structural Simulation Toolkit SIMUTools 2012 https://www.osti.gov/search/identifier:1117760 Document ID: 5305668

Steven J. Lockwood, Kurt Brian Ferreira, David G. Robinson, Dorian Arnold, Patrick Bridges, Rolf Reisen, (2012). Does Partial Replication Pay Off? Fault-Tolerance for HPC at Extreme Scale (FTXS 2012) https://www.osti.gov/search/identifier:1068346 Document ID: 5305712

David R. Bronowski, Keren Bergman, David Bunde, Elliot Cooper-Balis, Kurt Brian Ferreira, Karl Scott Hemmert, Brian Barrett, Cassandra Versaggi, Robert Hendry, Bruce Jacob, Hyesoon Kim, Vitus J. Leung, Michael J. Levenhagen, Mitchelle Rasquinha, Rolf Riesen, Paul Rosenfeld, Maria del Carmen Ruiz Varela, Sudhakar Yalamanchili, (2012). Improvements to the Structural Simulation Toolkit SIMUTools 2012 https://www.osti.gov/search/identifier:1068327 Document ID: 5305276

Brian Barrett, Richard Frederick Barrett, James M. Brandt, Ronald B. Brightwell, Matthew Leon Curry, Nathan D. Fabian, Kurt Brian Ferreira, Ann C. Gentile, Karl Scott Hemmert, Suzanne M. Kelly, Ruth Ann Klundt, James H. Laros, Vitus J. Leung, Michael J. Levenhagen, Gerald Fredrick Lofstead, Kenneth D. Moreland, Ron A. Oldfield, Kevin Pedretti, Arun F. Rodrigues, David Thompson, Harry Lee Ward, John P. Vandyke, Courtenay T. Vaughan, Kyle Bruce Wheeler, Tom Tucker, (2012). Report of Experiments and Evidence for ASC L2 Milestone 4467 – Demonstration of a Legacy Applications Path to Exascale https://www.osti.gov/search/identifier:1039013 Document ID: 5305233

Brian Barrett, Richard Frederick Barrett, James M. Brandt, Ronald B. Brightwell, Matthew Leon Curry, Nathan D. Fabian, Kurt Brian Ferreira, Ann C. Gentile, Karl Scott Hemmert, Suzanne M. Kelly, Ruth Ann Klundt, James H. Laros, Vitus J. Leung, Michael J. Levenhagen, Gerald Fredrick Lofstead, Kenneth D. Moreland, Ron A. Oldfield, Kevin Pedretti, Arun F. Rodrigues, David Thompson, Harry Lee Ward, John P. Vandyke, Courtenay T. Vaughan, Kyle Bruce Wheeler, Tom Tucker, (2012). Demonstration of a Legacy Applications Path to Exascale – ASC L2 Milestone 4467 Presentation to L2 Milestone Review Panel https://www.osti.gov/search/identifier:1688616 Document ID: 5305236

Kishor Kharbas, David Fiala, Frank Mueller, Christian Engelmann, Kurt Brian Ferreira, (2012). Combining Partial Redundancy and Checkpointing for HPC The 32nd International Conference on Distributed Computing Systems https://www.osti.gov/search/identifier:1069061 Document ID: 5304511

Michael A Butler, Kurt Brian Ferreira, (2011). An Extensible Operating System Design for Large-Scale Parallel Machines HotOS XII https://www.osti.gov/search/identifier:1141294 Document ID: 5268671

Deepesh K. Kholwadwala, Michael A. Heroux, Kurt Brian Ferreira, Patrick G. Bridges, (2011). Fault-tolerant iterative methods via selective reliability Supercomputing 2011 https://www.osti.gov/search/identifier:1111619 Document ID: 5301717

Steven J. Lockwood, David G. Robinson, Kurt Brian Ferreira, Rolf Riesen, (2011). A Model-Based Case for Redundant Computation https://www.osti.gov/search/identifier:1113872 Document ID: 5298365

Deepesh K. Kholwadwala, Kurt Brian Ferreira, Michael A. Heroux, Ronald B. Brightwell, Patrick G. Bridges, Philip Soltero, (2011). Cooperative Application/OS DRAM Fault Recovery 4th Workshop on Resiliency in High Performance Computing @ EuroPar https://www.osti.gov/search/identifier:1107189 Document ID: 5296256

Ron A. Oldfield, Kurt Brian Ferreira, Harry Lee Ward, Matthew Leon Curry, (2011). Addressing Scalable I/O Challenges for Exascale 27th IEEE Symposium on Massive Storage Systems and Technologies https://www.osti.gov/search/identifier:1109270 Document ID: 5295121

David Robinson, Kurt Brian Ferreira, Rolf E. Riesen, (2010). Reliability Modeling of Redundant Computation for HPC Systems Dependable Systems and Networks https://www.osti.gov/search/identifier:1035340 Document ID: 5289905

Kevin Pedretti, Michael J. Levenhagen, Kurt Brian Ferreira, Ronald B. Brightwell, Suzanne M. Kelly, Patrick G Bridges, Trammell Hudson, (2010). LDRD Final Report: A Lightweight Operating System for Multi-core Capability Class Supercomputers https://www.osti.gov/search/identifier:1007323 Document ID: 5286932

Michael A Butler, Kurt Brian Ferreira, (2010). An Extensible Operating System Design for Large-Scale Parallel Machines https://www.osti.gov/search/identifier:984155 Document ID: 5271790

Ronald B. Brightwell, Kurt Brian Ferreira, (2010). Transparent Redundant Computing with MPI EuroMPI 2010 https://www.osti.gov/search/identifier:1011627 Document ID: 5281989

Michael A Butler, Kurt Brian Ferreira, Edgar Leo, Arthur B. Maccabe, (2010). Cache Injection for Parallel Applications Supercomputing 2010 https://www.osti.gov/search/identifier:1000997 Document ID: 5281768

Ron A. Oldfield, Ronald B. Brightwell, Kevin Pedretti, Rolf E. Riesen, Kurt Brian Ferreira, Suzanne M. Kelly, Todd H. Kordenbrock, James H. Laros, (2010). System Software Research for Extreme-Scale Computing Leadership Computing Facility Seminar https://www.osti.gov/search/identifier:1673292 Document ID: 5280913

Michael A Butler, Ronald B. Brightwell, Kurt Brian Ferreira, Patrick G. Bridges, Trammell Hudson, Arthur B. Maccabe, Patrick M. Widener, (2008). Designing and Implementing Lightweight Kernels for Capability Computing Concurrency and ComputationPractice and Experience https://www.osti.gov/search/identifier:1141189 Document ID: 5260331

Ronald B. Brightwell, Kevin Pedretti, Kurt Brian Ferreira, (2008). Instrumentation and Analysis of MPI Queue Times on the SeaStar High-Performance Network 17th International Conference on Computer Communications and Networking https://www.osti.gov/search/identifier:1145976 Document ID: 5259786

Showing Results. Show More Publications

Awards & Recognition

2010

Kevin Pedretti, Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan, Kitten Operating System Virtualization Team, Sandia National Laboratories, March 23, 2010

2009

Ronald Brightwell, Kurt Brian Ferreira, Suzanne M. Kelly, James H. Laros, Kevin Pedretti, James Tomkins, John P. Vandyke, Courtenay T. Vaughan, Robert Ballance, Trammell Hudson, R&D 100 Award, R&D Magazine, One of the 100 Most Technologically Significant New Products of the Year, June 1, 2009