Kurt Brian Ferreira

Scalable System Software

Author profile picture

Scalable System Software

kbferre@sandia.gov

(505) 844-0433

Sandia National Laboratories, New Mexico
P.O. Box 5800
Albuquerque, NM 87185-1319

Biography

Principal Member of Technical Staff 
My area of expertise is system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. I have designed and developed a number of innovative, high-performance, and resilient implementations of low-level system software for several HPC platforms including the Cray Red Storm (XT3) machine at Sandia National Laboratories. My research interests include the design and construction of operating systems for massively parallel processing machines and innovative application and system-level fault-tolerance mechanisms for HPC.

Education

I received my BS in mathematics and BS in computer science in 2000 from New Mexico Tech and my MS in computer science in 2008 and my PhD in computer science in 2011 from the University of New Mexico

Publications

Kurt Ferreira, Scott Levy, (2022). Characterizing Memory Failures Using Benford’s Law Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://www.osti.gov/servlets/purl/1887503 Publication ID: 75682

Kurt Ferreira, Scott Levy, (2021). Evaluating MPI resource usage summary statistics Parallel Computing https://doi.org/10.1016/j.parco.2021.102825 Publication ID: 75299

Keira Haskins, bridges bridges, Kurt Ferreira, Scott Levy, (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications https://www.osti.gov/servlets/purl/1899492 Publication ID: 76415

Keira Haskins, Patrick Bridges, Kurt Ferreira, Scott Levy, (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications https://www.osti.gov/servlets/purl/1899493 Publication ID: 76416

Kurt Ferreira, Scott Levy, (2021). Characterizing Per-node Memory Failures Using Benford?s Law https://www.osti.gov/servlets/purl/1886179 Publication ID: 75504

Scott Levy, Kurt Ferreira, (2021). An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation https://doi.org/10.2172/1869756 Publication ID: 78565

Stephen Olivier, Ronald Brightwell, Kurt Ferreira, Ryan Grant, Scott Levy, Kevin Pedretti, Andrew Younge, (2021). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime https://www.osti.gov/servlets/purl/1861479 Publication ID: 77902

Kurt Ferreira, Scott Levy, Victor Kuhns, Nathan DeBardeleben, Sean Blanchard, (2021). Understanding the Effects of DRAM Correctable Error Logging at Scale Proceedings – IEEE International Conference on Cluster Computing, ICCC https://doi.org/10.1109/Cluster48925.2021.00060 Publication ID: 79606

Kurt Ferreira, Scott Levy, (2020). Evaluating MPI Message Size Summary Statistics ACM International Conference Proceeding Series https://www.osti.gov/servlets/purl/1825984 Publication ID: 71238

Ronald Brightwell, Kurt Ferreira, Ryan Grant, Scott Levy, Gerald Lofstead, Stephen Olivier, Kevin Pedretti, Andrew Younge, Ann Gentile, (2020). ALAMO: Autonomous Lightweight Allocation Management and Optimization https://www.osti.gov/servlets/purl/1818044 Publication ID: 74680

Kurt Ferreira, Ryan Grant, Michael Levenhagen, Scott Levy, Taylor Groves, (2020). Hardware MPI message matching: Insights into MPI matching behavior to inform design Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.5150 Publication ID: 64546

Scott Levy, Kurt Ferreira, Patrick Widener, (2020). The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.4890 Publication ID: 54218

Scott Levy, Kurt Ferreira, (2020). Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://www.osti.gov/servlets/purl/1641289 Publication ID: 69979

Scott Levy, Kurt Ferreira, (2019). Evaluating tradeoffs between MPI message matching offload hardware capacity and performance ACM International Conference Proceeding Series https://doi.org/10.1145/3343211.3343223 Publication ID: 70063

Scott Levy, Kurt Ferreira, Whit Schonbein, Ryan Grant, Matthew Dosanjh, (2019). Using simulation to examine the effect of MPI message matching costs on application performance Parallel Computing https://doi.org/10.1016/j.parco.2019.02.008 Publication ID: 67578

Scott Levy, Kurt Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, Elisabeth Baseman, (2019). Lessons learned from memory errors observed over the lifetime of cielo Proceedings – International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 https://doi.org/10.1109/SC.2018.00046 Publication ID: 67575

Elisabeth Baseman, Nathan Debardeleben, Sean Blanchard, Juston Moore, Olena Tkachenko, Kurt Ferreira, Taniya Siddiqua, Vilas Sridharan, (2019). Physics-Informed Machine Learning for DRAM Error Modeling 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2018 https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 62156

Kurt Ferreira, (2019). Checkpointing Strategies for Shared High-Performance Computing Platforms International Journal of Networking and Computing https://doi.org/10.15803/ijnc.9.1_28 Publication ID: 60074

Stephen Olivier, Ronald Brightwell, Kevin Pedretti, Andrew Younge, Noah Evans, Scott Levy, Kurt Ferreira, Ryan Grant, (2019). SNL ATDM Software Ecosystem https://www.osti.gov/servlets/purl/1583026 Publication ID: 64200

Scott Levy, Kurt Ferreira, (2018). Using simulation to examine the effect of MPI message matching costs on application performance ACM International Conference Proceeding Series https://doi.org/10.1145/3236367.3236375 Publication ID: 63034

Scott Levy, Kevin Pedretti, Kurt Ferreira, (2018). Open science on Trinity’s knights landing partition: An analysis of user job data ACM International Conference Proceeding Series https://doi.org/10.1145/3229710.3229753 Publication ID: 62662

Thomas Herault, Yves Robert, Aurelien Bouteiller, Dorian Arnold, Kurt Ferreira, George Bosilca, Jack Dongarra, (2018). Optimal cooperative checkpointing for shared high-performance computing platforms Proceedings – 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 https://www.osti.gov/servlets/purl/1480217 Publication ID: 53793

Scott Levy, Kurt Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, Elisabeth Baseman, (2018). Lessons Learned from Errors Observed over the Lifetime of Cielo https://doi.org/10.1109/SC.2018.00046 Publication ID: 63939

Elisabeth Baseman, Nathan DeBardeleben, Sean Blanchard, Juston Moore, Olena Tkachenko, Kurt Ferreira, Taniya Siddiqua, Vilas Sridharan, (2018). Physics-Informed Machine Learning for DRAM Error Modeling https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 63390

Thomas Herault, Yves Robert, Aurelien Bouteiller, Dorian Arnold, Kurt Ferreira, George Bosilica, Jack Dongarra, (2018). Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms https://doi.org/10.1109/IPDPSW.2018.00127 Publication ID: 61598

Kurt Ferreira, Ryan Grant, Michael Levenhagen, Scott Levy, Taylor Groves, (2017). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design https://doi.org/10.1002/cpe.5150 Publication ID: 54225

Rebecca Kreitinger, Scott Levy, Kurt Ferreira, Patrick Widener, (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis https://www.osti.gov/servlets/purl/1478158 Publication ID: 53562

Rebecca Kreitinger, Scott Levy, Kurt Ferreira, Patrick Widener, (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis https://www.osti.gov/servlets/purl/1573776 Publication ID: 53563

Kurt Ferreira, Scott Levy, Kevin Pedretti, Ryan Grant, (2017). Characterizing MPI matching via trace-based simulation ACM International Conference Proceeding Series https://www.osti.gov/servlets/purl/1462518 Publication ID: 57396

Scott Levy, Kurt Ferreira, Patrick Bridges, (2017). Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data Proceedings – IEEE International Conference on Cluster Computing, ICCC https://doi.org/10.1109/CLUSTER.2017.99 Publication ID: 57799

Elisabeth Baseman, Nathan Debardeleben, Kurt Ferreira, Vilas Sridharan, Taniya Siddiqua, Olena Tkachenko, (2017). Automating DRAM Fault Mitigation by Learning from Experience Proceedings – 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2017 https://doi.org/10.1109/DSN-W.2017.39 Publication ID: 55872

Taniya Siddiqua, Vilas Sridharan, Steven Raasch, Nathan Debardeleben, Kurt Ferreira, Scott Levy, Elisabeth Baseman, Qiang Guan, (2017). Lifetime memory reliability data from the field 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017 https://doi.org/10.1109/DFT.2017.8244428 Publication ID: 57295

Patrick Widener, Kurt Ferreira, Scott Levy, (2017). It’s not the heat it’s the humidity: scheduling resilience activity at scale https://www.osti.gov/servlets/purl/1367189 Publication ID: 56360

Marc Gammel, Keita Teranishi, Samuel Knight, Gregory Sjaardema, Hemanth Kolla, Jason Wilke, Nicole Slattengren, Kurt Ferreira, Janine Bennett, Nikhil Jain, Laxmikant Kale, (2017). Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity https://www.osti.gov/servlets/purl/1456562 Publication ID: 55874

Patrick Widener, Kurt Ferreira, Scott Levy, (2017). Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://doi.org/10.1007/978-3-319-58943-5_50 Publication ID: 50229

Scott Levy, Kurt Ferreira, Patrick Widener, Patrick Bridges, Oscar Mondragon, (2016). How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms ACM International Conference Proceeding Series https://doi.org/10.1145/2966884.2966920 Publication ID: 52299

Elisabeth Baseman, Nathan Debardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, Qiang Guan, (2016). Improving DRAM Fault Characterization through Machine Learning Proceedings – 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-W 2016 https://doi.org/10.1109/DSN-W.2016.13 Publication ID: 49553

Oscar Mondragon, Patrick Bridges, Scott Levy, Kurt Ferreira, Patrick Widener, (2016). Scheduling In-Situ Analytics in Next-Generation Applications Proceedings – 2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016 https://www.osti.gov/servlets/purl/1333466 Publication ID: 41676

Scott Levy, Kurt Ferreira, Patrick Bridges, (2016). Improving Application Resilience to Memory Errors with Lightweight Compression International Conference for High Performance Computing, Networking, Storage and Analysis, SC https://doi.org/10.1109/SC.2016.27 Publication ID: 47905

Oscar Mondragon, Patrick Bridges, Scott Levy, Kurt Ferreira, Patrick Widener, (2016). Understanding Performance Interference in Next-Generation HPC Systems International Conference for High Performance Computing, Networking, Storage and Analysis, SC https://www.osti.gov/servlets/purl/1372149 Publication ID: 51068

Scott Levy, Kurt Ferreira, Patrick Bridges, (2016). Improving Application Resilience to Memory Errors with Lightweight Compression https://doi.org/10.1109/SC.2016.27 Publication ID: 51067

David Fiala, Frank Mueller, Kurt Ferreira, Christian Engelmann, (2016). Mini-Ckpts: Surviving OS failures in persistent memory Proceedings of the International Conference on Supercomputing https://doi.org/10.1145/2925426.2926295 Publication ID: 49177

Scott Levy, Kurt Ferreira, (2016). An examination of the impact of failure distribution on coordinated checkpoint/restart FTXS 2016 – Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale https://doi.org/10.1145/2909428.2909430 Publication ID: 50259

Scott Levy, Kurt Ferreira, Patrick Widener, Patrick Bridges, Oscar Mondragon, (2016). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale https://doi.org/10.1007/978-3-319-10214-6_5 Publication ID: 50027

Scott Levy, Kurt Ferreira, Patrick Widener, Patrick Bridges, Oscar Mondragon, (2016). How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms https://www.osti.gov/servlets/purl/1364728 Publication ID: 50139

Galen Shipman, Patrick McCormick, Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Ramanan Sankaran, Sean Treichler, Alex Aiken, Michael Bauer, (2016). Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime https://www.osti.gov/servlets/purl/1365384 Publication ID: 49758

Kurt Ferreira, (2016). An Examination of the Impact of the Failure Distribution on Coordinated Checkpoint/Restart https://www.osti.gov/servlets/purl/1345094 Publication ID: 48501

Elisabeth Baseman, Nathan DeBardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, Qiang Guan, (2016). A Machine Learning Approach for Automatic Characterization of Memory Faults https://www.osti.gov/servlets/purl/1346523 Publication ID: 48579

Patrick Widener, Scott Levy, Kurt Ferreira, Torsten Hoefler, (2016). On noise and the performance benefit of nonblocking collectives International Journal of High Performance Computing Applications https://doi.org/10.1177/1094342015611952 Publication ID: 39411

Scott Levy, Kurt Ferreira, Patrick Bridges, (2016). Similarity Engine: Using Content Similarity to Improve Memory Resilience https://www.osti.gov/servlets/purl/1239385 Publication ID: 46804

Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Galen Shipman, Wei Shu, (2015). Early experiences with node-level power capping on the cray XC40 platform Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing – Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2834800.2834801 Publication ID: 41617

Dewan Ibtesham, Kurt Ferreira, Dorian Arnold, (2015). A checkpoint compression study for high-performance computing systems International Journal of High Performance Computing Applications https://doi.org/10.1177/1094342015570921 Publication ID: 37407

Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Galen Shipman, Wei Shu, (2015). Early Experiences with Node-Level Power Capping on the Cray XC40 Platform https://doi.org/10.1145/2834800.2834801 Publication ID: 46036

Alireza Goudarzi, Dorian Arnold, Darko Stefanovic, Kurt Ferreira, Guy Feldman, (2015). A principled approach to HPC event monitoring FTXS 2015 – Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015 https://www.osti.gov/servlets/purl/1239260 Publication ID: 41943

Rolf Riesen, Barney Maccabe, Balazs Gerofi, David Lombard, John Lange, Kevin Pedretti, Kurt Ferreira, Mike Lang, Pardo Keppel, Robert Wisniewski, Ronald Brightwell, Todd Inglett, Yoonho Park, Yutaka Ishikawa, (2015). Panel: What is a Lightweight Kernel? https://www.osti.gov/servlets/purl/1258200 Publication ID: 43556

Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Galen Shipman, Wei Shu, (2015). Exploring MPI Application Performance Under Power Capping on the Cray XC40 Platform https://www.osti.gov/servlets/purl/1258232 Publication ID: 43466

Scott Levy, Kurt Ferreira, Patrick Bridges, (2015). Similarity Engine: Using Content Similarity to Improve Memory Resilience https://www.osti.gov/servlets/purl/1530987 Publication ID: 43098

Galen Shipman, Patrick McCormick, Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Jacqueline Chen, Ramanan Sankaran, Sean Treichler, Alex Aiken, Michael Bauer, (2015). Dynamic Task Scheduling to Mitigate System Performance Variability https://www.osti.gov/servlets/purl/1249032 Publication ID: 43099

Kurt Ferreira, (2015). Revisiting Checkpointing for Exascale-Class Systems https://www.osti.gov/servlets/purl/1251139 Publication ID: 43249

Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt Ferreira, Jon Stearley, John Shalf, Sudhanva Gurumurthi, (2015). Memory errors in modern systems: The good, the bad, and the ugly International Conference on Architectural Support for Programming Languages and Operating Systems – ASPLOS https://doi.org/10.1145/2694344.2694348 Publication ID: 38008

Patrick Widener, Kurt Ferreira, Scott Levy, Nathan Fabian, (2015). Canaries in a coal mine: Using application-level checkpoints to detect memory failures Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://www.osti.gov/servlets/purl/1256569 Publication ID: 43835

Kurt Ferreira, Scott Levy, Patrick Widener, Dorian Arnold, (2014). Using Machine Learning to Optimize Uncoordinated Checkpointing Performance https://www.osti.gov/servlets/purl/1319751 Publication ID: 39111

Kurt Ferreira, (2014). Fault Survivability of Lightweight Operating Systems for exascale https://doi.org/10.2172/1459775 Publication ID: 38559

Showing Results. Show More Publications

Awards & Recognition

2010

Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan, Kitten Operating System Virtualization Team, Sandia National Laboratories, March 23, 2010

2009

Ronald Brightwell, Kurt Brian Ferreira, Suzanne M. Kelly, James H. Laros, Kevin Pedretti, James Tomkins, John P. Vandyke, Courtenay T. Vaughan, Robert Ballance, Trammell Hudson, R&D 100 Award, R&D Magazine, One of the 100 Most Technologically Significant New Products of the Year, June 1, 2009