Kurt Brian Ferreira

Scalable System Software

Author profile picture

Scalable System Software

kbferre@sandia.gov

(505) 844-0433

Sandia National Laboratories, New Mexico
P.O. Box 5800
Albuquerque, NM 87185-1319

Biography

Principal Member of Technical Staff 
My area of expertise is system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. I have designed and developed a number of innovative, high-performance, and resilient implementations of low-level system software for several HPC platforms including the Cray Red Storm (XT3) machine at Sandia National Laboratories. My research interests include the design and construction of operating systems for massively parallel processing machines and innovative application and system-level fault-tolerance mechanisms for HPC.

Education

I received my BS in mathematics and BS in computer science in 2000 from New Mexico Tech and my MS in computer science in 2008 and my PhD in computer science in 2011 from the University of New Mexico

Publications

  • Ferreira, K.B., Levy, S., & Levy, S. (2022). Characterizing Memory Failures Using Benford’s Law [Conference Paper]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85133026898&origin=inward Publication ID: 75682
  • Ferreira, K.B., Levy, S., & Levy, S. (2021). Evaluating MPI resource usage summary statistics. Parallel Computing, 108. https://doi.org/10.1016/j.parco.2021.102825 Publication ID: 75299
  • Haskins, K., Bridges, P., Ferreira, K.B., Levy, S., & Levy, S. (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications [Conference Paper]. https://www.osti.gov/biblio/1899493 Publication ID: 76416
  • Haskins, K., bridges, B., Ferreira, K.B., Levy, S., & Levy, S. (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications [Conference Paper]. https://www.osti.gov/biblio/1899492 Publication ID: 76415
  • Ferreira, K.B., Levy, S., & Levy, S. (2021). Characterizing Per-node Memory Failures Using Benford?s Law [Conference Paper]. https://www.osti.gov/biblio/1886179 Publication ID: 75504
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2021). An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation [Conference Presenation]. https://doi.org/10.2172/1869756 Publication ID: 78565
  • Olivier, S.L., Brightwell, R., Ferreira, K.B., Grant, R.E., Levy, S., Pedretti, K., Younge, A.J., & Younge, A.J. (2021). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime [Presentation]. https://www.osti.gov/biblio/1861479 Publication ID: 77902
  • Ferreira, K.B., Levy, S., Kuhns, V., DeBardeleben, N., Blanchard, S., & Blanchard, S. (2021). Understanding the Effects of DRAM Correctable Error Logging at Scale [Conference Paper]. Proceedings – IEEE International Conference on Cluster Computing, ICCC. https://doi.org/10.1109/Cluster48925.2021.00060 Publication ID: 79606
  • Ferreira, K.B., Levy, S., & Levy, S. (2020). Evaluating MPI Message Size Summary Statistics [Conference Proceeding]. ACM International Conference Proceeding Series. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85093937286&origin=inward Publication ID: 71238
  • Brightwell, R., Ferreira, K.B., Grant, R.E., Levy, S., Lofstead, G., Olivier, S.L., Pedretti, K., Younge, A.J., Gentile, A., & Gentile, A. (2020). ALAMO: Autonomous Lightweight Allocation Management and Optimization [Conference Poster]. https://www.osti.gov/biblio/1818044 Publication ID: 74680
  • Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2020). The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints [Conference Poster]. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.4890 Publication ID: 54218
  • Ferreira, K.B., Grant, R.E., Levenhagen, M., Levy, S., Groves, T., & Groves, T. (2020). Hardware MPI message matching: Insights into MPI matching behavior to inform design. Concurrency and Computation: Practice and Experience, 32(3). https://doi.org/10.1002/cpe.5150 Publication ID: 64546
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2020). Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85086273467&origin=inward Publication ID: 69979
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2019). Evaluating tradeoffs between MPI message matching offload hardware capacity and performance [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3343211.3343223 Publication ID: 70063
  • Levy, S., Ferreira, K.B., Schonbein, W., Grant, R.E., Dosanjh, M., & Dosanjh, M. (2019). Using simulation to examine the effect of MPI message matching costs on application performance. Parallel Computing, 84, pp. 63-74. https://doi.org/10.1016/j.parco.2019.02.008 Publication ID: 67578
  • Levy, S., Ferreira, K.B., DeBardeleben, N., Siddiqua, T., Sridharan, V., Baseman, E., & Baseman, E. (2019). Lessons learned from memory errors observed over the lifetime of cielo [Conference Poster]. Proceedings – International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018. https://doi.org/10.1109/SC.2018.00046 Publication ID: 67575
  • Baseman, E., Debardeleben, N., Blanchard, S., Moore, J., Tkachenko, O., Ferreira, K.B., Siddiqua, T., Sridharan, V., & Sridharan, V. (2019). Physics-Informed Machine Learning for DRAM Error Modeling [Conference Poster]. 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2018. https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 62156
  • Olivier, S.L., Brightwell, R., Pedretti, K., Younge, A.J., Evans, N., Levy, S., Ferreira, K.B., Grant, R.E., & Grant, R.E. (2019). SNL ATDM Software Ecosystem [Presentation]. https://www.osti.gov/biblio/1583026 Publication ID: 64200
  • Ferreira, K.B. (2019). Checkpointing Strategies for Shared High-Performance Computing Platforms. International Journal of Networking and Computing, 9(1), pp. 28-52. https://doi.org/10.15803/ijnc.9.1_28 Publication ID: 60074
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2018). Using simulation to examine the effect of MPI message matching costs on application performance [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3236367.3236375 Publication ID: 63034
  • Levy, S., Pedretti, K., Ferreira, K.B., & Ferreira, K.B. (2018). Open science on Trinity’s knights landing partition: An analysis of user job data [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3229710.3229753 Publication ID: 62662
  • Herault, T., Robert, Y., Bouteiller, A., Arnold, D., Ferreira, K.B., Bosilca, G., Dongarra, J., & Dongarra, J. (2018). Optimal cooperative checkpointing for shared high-performance computing platforms [Conference Poster]. Proceedings – 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85052224235&origin=inward Publication ID: 53793
  • Levy, S., Ferreira, K.B., DeBardeleben, N., Siddiqua, T., Sridharan, V., Baseman, E., & Baseman, E. (2018). Lessons Learned from Errors Observed over the Lifetime of Cielo [Conference Poster]. https://doi.org/10.1109/SC.2018.00046 Publication ID: 63939
  • Baseman, E., DeBardeleben, N., Blanchard, S., Moore, J., Tkachenko, O., Ferreira, K.B., Siddiqua, T., Sridharan, V., & Sridharan, V. (2018). Physics-Informed Machine Learning for DRAM Error Modeling [Conference Poster]. https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 63390
  • Herault, T., Robert, Y., Bouteiller, A., Arnold, D., Ferreira, K.B., Bosilica, G., Dongarra, J., & Dongarra, J. (2018). Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms [Conference Poster]. https://doi.org/10.1109/IPDPSW.2018.00127 Publication ID: 61598
  • Ferreira, K.B., Grant, R.E., Levenhagen, M., Levy, S., Groves, T., & Groves, T. (2017). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design [Conference Poster]. https://doi.org/10.1002/cpe.5150 Publication ID: 54225
  • Kreitinger, R., Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis [Conference Poster]. https://www.osti.gov/biblio/1478158 Publication ID: 53562
  • Kreitinger, R., Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis [Conference Poster]. https://www.osti.gov/biblio/1573776 Publication ID: 53563
  • Ferreira, K.B., Levy, S., Pedretti, K., Grant, R.E., & Grant, R.E. (2017). Characterizing MPI matching via trace-based simulation [Conference Poster]. ACM International Conference Proceeding Series. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85048316592&origin=inward Publication ID: 57396
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2017). Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data [Conference Poster]. Proceedings – IEEE International Conference on Cluster Computing, ICCC. https://doi.org/10.1109/CLUSTER.2017.99 Publication ID: 57799
  • Baseman, E., Debardeleben, N., Ferreira, K.B., Sridharan, V., Siddiqua, T., Tkachenko, O., & Tkachenko, O. (2017). Automating DRAM Fault Mitigation by Learning from Experience [Conference Poster]. Proceedings – 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2017. https://doi.org/10.1109/DSN-W.2017.39 Publication ID: 55872
  • Siddiqua, T., Sridharan, V., Raasch, S.E., Debardeleben, N., Ferreira, K.B., Levy, S., Baseman, E., Guan, Q., & Guan, Q. (2017). Lifetime memory reliability data from the field [Conference Poster]. 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017. https://doi.org/10.1109/DFT.2017.8244428 Publication ID: 57295
  • Widener, P., Ferreira, K.B., Levy, S., & Levy, S. (2017). It’s not the heat it’s the humidity: scheduling resilience activity at scale [Conference Poster]. https://www.osti.gov/biblio/1367189 Publication ID: 56360
  • Gammel, M., Teranishi, K., Knight, S., Sjaardema, G., Kolla, H., Wilke, J., Slattengren, N.L., Ferreira, K.B., Bennett, J., Jain, N., Kale, L., & Kale, L. (2017). Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity [Conference Poster]. https://www.osti.gov/biblio/1456562 Publication ID: 55874
  • Widener, P., Ferreira, K.B., Levy, S., & Levy, S. (2017). Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-319-58943-5_50 Publication ID: 50229
  • Levy, S., Ferreira, K.B., Widener, P., Bridges, P.G., Mondragon, O.H., & Mondragon, O.H. (2016). How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/2966884.2966920 Publication ID: 52299
  • Baseman, E., Debardeleben, N., Ferreira, K.B., Levy, S., Raasch, S., Sridharan, V., Siddiqua, T., Guan, Q., & Guan, Q. (2016). Improving DRAM Fault Characterization through Machine Learning [Conference Poster]. Proceedings – 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-W 2016. https://doi.org/10.1109/DSN-W.2016.13 Publication ID: 49553
  • Mondragon, O.H., Bridges, P.G., Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2016). Scheduling In-Situ Analytics in Next-Generation Applications [Conference Poster]. Proceedings – 2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84983471009&origin=inward Publication ID: 41676
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2016). Improving Application Resilience to Memory Errors with Lightweight Compression [Conference Poster]. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1109/SC.2016.27 Publication ID: 47905
  • Mondragon, O.H., Bridges, P.G., Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2016). Understanding Performance Interference in Next-Generation HPC Systems [Conference Poster]. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85017239152&origin=inward Publication ID: 51068
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2016). Improving Application Resilience to Memory Errors with Lightweight Compression [Conference Poster]. https://doi.org/10.1109/SC.2016.27 Publication ID: 51067
  • Fiala, D., Mueller, F., Ferreira, K.B., Engelmann, C., & Engelmann, C. (2016). Mini-Ckpts: Surviving OS failures in persistent memory [Conference Poster]. Proceedings of the International Conference on Supercomputing. https://doi.org/10.1145/2925426.2926295 Publication ID: 49177
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2016). An examination of the impact of failure distribution on coordinated checkpoint/restart [Conference Poster]. FTXS 2016 – Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale. https://doi.org/10.1145/2909428.2909430 Publication ID: 50259
  • Levy, S., Ferreira, K.B., Widener, P., Bridges, P., Mondragon, O., & Mondragon, O. (2016). How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms [Conference Poster]. https://www.osti.gov/biblio/1364728 Publication ID: 50139
  • Levy, S., Ferreira, K.B., Widener, P., Bridges, P.G., Mondragon, O., & Mondragon, O. (2016). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale [Presentation]. https://doi.org/10.1007/978-3-319-10214-6_5 Publication ID: 50027
  • Shipman, G., McCormick, P., Pedretti, K., Olivier, S.L., Ferreira, K.B., Sankaran, R., Treichler, S., Aiken, A., Bauer, M., & Bauer, M. (2016). Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime [Conference Poster]. https://www.osti.gov/biblio/1365384 Publication ID: 49758
  • Baseman, E., DeBardeleben, N., Ferreira, K.B., Levy, S., Raasch, S., Sridharan, V., Siddiqua, T., Guan, Q., & Guan, Q. (2016). A Machine Learning Approach for Automatic Characterization of Memory Faults [Conference Poster]. https://www.osti.gov/biblio/1346523 Publication ID: 48579
  • Widener, P., Levy, S., Ferreira, K.B., Hoefler, T., & Hoefler, T. (2016). On noise and the performance benefit of nonblocking collectives. International Journal of High Performance Computing Applications, 30(1), pp. 121-133. https://doi.org/10.1177/1094342015611952 Publication ID: 39411
  • Ferreira, K.B. (2016). An Examination of the Impact of the Failure Distribution on Coordinated Checkpoint/Restart [Conference Poster]. https://www.osti.gov/biblio/1345094 Publication ID: 48501
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2016). Similarity Engine: Using Content Similarity to Improve Memory Resilience [Conference Poster]. https://www.osti.gov/biblio/1239385 Publication ID: 46804
  • Pedretti, K., Olivier, S.L., Ferreira, K.B., Shipman, G., Shu, W., & Shu, W. (2015). Early experiences with node-level power capping on the cray XC40 platform [Conference Poster]. Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing – Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1145/2834800.2834801 Publication ID: 41617
  • Ibtesham, D., Ferreira, K.B., Arnold, D., & Arnold, D. (2015). A checkpoint compression study for high-performance computing systems. International Journal of High Performance Computing Applications, 29(4), pp. 387-402. https://doi.org/10.1177/1094342015570921 Publication ID: 37407
  • Pedretti, K., Olivier, S.L., Ferreira, K.B., Shipman, G., Shu, W., & Shu, W. (2015). Early Experiences with Node-Level Power Capping on the Cray XC40 Platform [Conference Poster]. https://doi.org/10.1145/2834800.2834801 Publication ID: 46036
  • Goudarzi, A., Arnold, D., Stefanovic, D., Ferreira, K.B., Feldman, G., & Feldman, G. (2015). A principled approach to HPC event monitoring [Conference Poster]. FTXS 2015 – Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84979736029&origin=inward Publication ID: 41943
  • Riesen, R., Maccabe, B., Gerofi, B., Lombard, D., Lange, J., Pedretti, K., Ferreira, K.B., Lang, M., Keppel, P., Wisniewski, R., Brightwell, R., Inglett, T., Park, Y., Ishikawa, Y., & Ishikawa, Y. (2015). Panel: What is a Lightweight Kernel? [Conference Poster]. https://www.osti.gov/biblio/1258200 Publication ID: 43556
  • Pedretti, K., Olivier, S.L., Ferreira, K.B., Shipman, G., Shu, W., & Shu, W. (2015). Exploring MPI Application Performance Under Power Capping on the Cray XC40 Platform [Conference Poster]. https://www.osti.gov/biblio/1258232 Publication ID: 43466
  • Ferreira, K.B. (2015). Revisiting Checkpointing for Exascale-Class Systems [Conference Poster]. https://www.osti.gov/biblio/1251139 Publication ID: 43249
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2015). Similarity Engine: Using Content Similarity to Improve Memory Resilience [Conference Poster]. https://www.osti.gov/biblio/1530987 Publication ID: 43098
  • Shipman, G., McCormick, P., Pedretti, K., Olivier, S.L., Ferreira, K.B., Chen, J.H., Sankaran, R., Treichler, S., Aiken, A., Bauer, M., & Bauer, M. (2015). Dynamic Task Scheduling to Mitigate System Performance Variability [Conference Poster]. https://www.osti.gov/biblio/1249032 Publication ID: 43099
  • Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K.B., Stearley, J., Shalf, J., Gurumurthi, S., & Gurumurthi, S. (2015). Memory errors in modern systems: The good, the bad, and the ugly. International Conference on Architectural Support for Programming Languages and Operating Systems – ASPLOS, 2015-January, pp. 297-310. https://doi.org/10.1145/2694344.2694348 Publication ID: 38008
  • Widener, P., Ferreira, K.B., Levy, S., Fabian, N., & Fabian, N. (2015). Canaries in a coal mine: Using application-level checkpoints to detect memory failures [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84952050378&origin=inward Publication ID: 43835
  • Ferreira, K.B., Levy, S., Widener, P., Arnold, D., & Arnold, D. (2014). Using Machine Learning to Optimize Uncoordinated Checkpointing Performance [Conference Poster]. https://www.osti.gov/biblio/1319751 Publication ID: 39111
  • Ferreira, K.B. (2014). Fault Survivability of Lightweight Operating Systems for exascale. https://doi.org/10.2172/1459775 Publication ID: 38559
Showing 10 of 63 publications.

Awards & Recognition

2010

Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan, Kitten Operating System Virtualization Team, Sandia National Laboratories, March 23, 2010

2009

Ronald Brightwell, Kurt Brian Ferreira, Suzanne M. Kelly, James H. Laros, Kevin Pedretti, James Tomkins, John P. Vandyke, Courtenay T. Vaughan, Robert Ballance, Trammell Hudson, , R&D 100 Award, R&D Magazine, One of the 100 Most Technologically Significant New Products of the Year, June 1, 2009