2010
Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan,
Kitten Operating System Virtualization Team, Sandia National Laboratories,
March 23, 2010
Kurt Brian Ferreira
Scalable System Software
Scalable System Software
(505) 844-0433
Sandia National Laboratories, New Mexico
P.O. Box 5800
Albuquerque, NM 87185-1319
Biography
Principal Member of Technical Staff
My area of expertise is system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. I have designed and developed a number of innovative, high-performance, and resilient implementations of low-level system software for several HPC platforms including the Cray Red Storm (XT3) machine at Sandia National Laboratories. My research interests include the design and construction of operating systems for massively parallel processing machines and innovative application and system-level fault-tolerance mechanisms for HPC.
Education
I received my BS in mathematics and BS in computer science in 2000 from New Mexico Tech and my MS in computer science in 2008 and my PhD in computer science in 2011 from the University of New Mexico
Publications
-
Olivier, S.L., Brightwell, R.B., Dosanjh, M.G.F., Ferreira, K., Levy, S.L.N., Bachman, W.B., Younge, A.J., & Younge, A.J. (2022). SNL ATDM Software Ecosystem Then and Now: Operating Systems and On-Node Runtime [Presentation]. https://www.osti.gov/biblio/2006330 Publication ID: 122008
-
Ferreira, K., Levy, S.L.N., Hemmert, J., Bachman, W.B., & Bachman, W.B. (2022). Understanding Memory Failures on a Petascale Arm System [Conference Paper]. HPDC 2022 – Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing. https://doi.org/10.1145/3502181.3531465 Publication ID: 112016
-
Olivier, S.L., Brightwell, R.B., Dosanjh, M.G.F., Ferreira, K., Levy, S.L.N., Bachman, W.B., Younge, A.J., & Younge, A.J. (2022). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime [Presentation]. https://www.osti.gov/biblio/2002316 Publication ID: 110212
-
Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2022). Characterizing Failures in HPC Using Benford?s Law [Conference Presenation]. https://doi.org/10.2172/2001912 Publication ID: 108664
-
Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2022). Characterizing Memory Failures Using Benford’s Law [Conference Paper]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85133026898&origin=inward Publication ID: 75682
-
Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2021). Evaluating MPI resource usage summary statistics. Parallel Computing, 108. https://doi.org/10.1016/j.parco.2021.102825 Publication ID: 75299
-
Haskins, K., Bridges, P., Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications [Conference Paper]. https://www.osti.gov/biblio/1899493 Publication ID: 76416
-
Haskins, K., Bridges, ., Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications [Conference Paper]. https://www.osti.gov/biblio/1899492 Publication ID: 76415
-
Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2021). Characterizing Per-node Memory Failures Using Benford?s Law [Conference Paper]. https://www.osti.gov/biblio/1886179 Publication ID: 75504
-
Levy, S.L.N., Ferreira, K., & Ferreira, K. (2021). An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation [Conference Presenation]. https://doi.org/10.2172/1869756 Publication ID: 78565
-
Olivier, S.L., Brightwell, R.B., Ferreira, K., Grant, R., Levy, S.L.N., Bachman, W.B., Younge, A.J., & Younge, A.J. (2021). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime [Presentation]. https://www.osti.gov/biblio/1861479 Publication ID: 77902
-
Ferreira, K., Levy, S.L.N., Kuhns, V., Debardeleben, N., Blanchard, S., & Blanchard, S. (2021). Understanding the Effects of DRAM Correctable Error Logging at Scale [Conference Paper]. Proceedings – IEEE International Conference on Cluster Computing, ICCC. https://doi.org/10.1109/Cluster48925.2021.00060 Publication ID: 79606
-
Brightwell, R.B., Ferreira, K., Grant, R., Levy, S.L.N., Lofstead, G.F., Olivier, S.L., Bachman, W.B., Younge, A.J., Gentile, A.C., Bachman, W.B., & Bachman, W.B. (2021). ALAMO: Autonomous lightweight allocation, management, and optimization [Conference Poster]. Communications in Computer and Information Science. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85107303666&origin=inward Publication ID: 74680
-
Levy, S.L.N., Ferreira, K., & Ferreira, K. (2020). Evaluating MPI Message Size Summary Statistics [Conference Proceeding]. https://www.osti.gov/biblio/1825984 Publication ID: 71238
-
Levy, S.L.N., Ferreira, K., & Ferreira, K. (2020). Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85086273467&origin=inward Publication ID: 69979
-
Levy, S.L.N., Ferreira, K., & Ferreira, K. (2019). Evaluating tradeoffs between MPI message matching offload hardware capacity and performance [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3343211.3343223 Publication ID: 70063
-
Levy, S.L.N., Ferreira, K., Schonbein, W., Grant, R., Dosanjh, M.G.F., & Dosanjh, M.G.F. (2019). Using simulation to examine the effect of MPI message matching costs on application performance. Parallel Computing, 84, pp. 63-74. https://doi.org/10.1016/j.parco.2019.02.008 Publication ID: 67578
-
Levy, S.L.N., Ferreira, K., Siddiqua, T., Debardelebe, N., Sridharan, V., Baseman, E., & Baseman, E. (2019). Lessons learned from memory errors observed over the lifetime of Cielo [Conference Poster]. https://doi.org/10.1109/SC.2018.00046 Publication ID: 67575
-
Ferreira, K., Grant, R., Levenhagen, M., Levy, S.L.N., Groves, T., & Groves, T. (2019). Hardware MPI message matching: Insights into MPI matching behavior to inform design: Hardware MPI message matching. Concurrency and Computation. Practice and Experience, 32(3). https://doi.org/10.1002/cpe.5150 Publication ID: 64546
-
Olivier, S.L., Brightwell, R.B., Bachman, W.B., Younge, A.J., Evans, N., Levy, S.L.N., Ferreira, K., Grant, R., & Grant, R. (2019). SNL ATDM Software Ecosystem [Presentation]. https://www.osti.gov/biblio/1583026 Publication ID: 64200
-
Ferreira, K. (2019). Checkpointing Strategies for Shared High-Performance Computing Platforms. International Journal of Networking and Computing, 9(1), pp. 28-52. https://doi.org/10.15803/ijnc.9.1_28 Publication ID: 60074
-
Levy, S.L.N., Ferreira, K., & Ferreira, K. (2018). Using simulation to examine the effect of MPI message matching costs on application performance [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3236367.3236375 Publication ID: 63034
-
Ferreira, K., Levy, S.L.N., Bachman, W.B., Grant, R., & Grant, R. (2018). Characterizing MPI matching via trace-based simulation [Conference Poster]. Parallel Computing. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85048343916&origin=inward Publication ID: 57396
-
Levy, S.L.N., Bachman, W.B., Ferreira, K., & Ferreira, K. (2018). Open science on Trinity’s knights landing partition: An analysis of user job data [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3229710.3229753 Publication ID: 62662
-
Herault, T., Robert, Y., Bouteiller, A., Arnold, D., Ferreira, K., Bosilca, G., Dongarra, J., & Dongarra, J. (2018). Optimal cooperative checkpointing for shared high-performance computing platforms [Conference Poster]. Proceedings – 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018. https://doi.org/10.1109/IPDPSW.2018.00127 Publication ID: 61598
-
Levy, S.L.N., Ferreira, K., Debardeleben, N., Siddiqua, T., Sridharan, V., Baseman, E., & Baseman, E. (2018). Lessons Learned from Errors Observed over the Lifetime of Cielo [Conference Poster]. https://doi.org/10.1109/SC.2018.00046 Publication ID: 63939
-
Baseman, E., Debardeleben, N., Blanchard, S., Moore, J., Tkachenko, O., Ferreira, K., Siddiqua, T., Sridharan, V., & Sridharan, V. (2018). Physics-Informed Machine Learning for DRAM Error Modeling [Conference Poster]. 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2018. https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 63390
-
Baseman, E., Debardelaben, N., Blanchard, S., Moore, J., Tkachenko, O., Sridharan, V., Ferreira, K., Siddiqua, T., & Siddiqua, T. (2018). Physics-Informed Machine Learning for DRAM Error Modeling [Conference Poster]. https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 62156
-
Widener, P., Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2018). It’s not the heat, it’s the humidity: Scheduling resilience activity at scale [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85042475218&origin=inward Publication ID: 56360
-
Levy, S.L.N., Ferreira, K., Widener, P., & Widener, P. (2017). The Unexpected Virtue of Almost: Exploiting MPI Collective Operations to Approximately Coordinate Checkpoints [Conference Poster]. https://doi.org/10.1002/cpe.4890 Publication ID: 54218
-
Ferreira, K., Grant, R., Levenhagen, M., Levy, S.L.N., Groves, T., & Groves, T. (2017). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design [Conference Poster]. https://doi.org/10.1002/cpe.5150 Publication ID: 54225
-
Kreitinger, R., Levy, S.L.N., Ferreira, K., Widener, P., & Widener, P. (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis [Conference Poster]. https://www.osti.gov/biblio/1478158 Publication ID: 53562
-
Kreitinger, R., Levy, S.L.N., Ferreira, K., Widener, P., & Widener, P. (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis [Conference Poster]. https://www.osti.gov/biblio/1573776 Publication ID: 53563
-
Herault, T., Robert, Y., Bouteiller, A., Arnold, D., Ferreira, K., Bosilca, G., Dongarra, J., & Dongarra, J. (2017). Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms [Conference Poster]. https://www.osti.gov/biblio/1480217 Publication ID: 53793
-
Ferreira, K., Levy, S.L.N., Bachman, W.B., Grant, R., & Grant, R. (2017). Characterizing MPI matching via trace-based simulation. ACM International Conference Proceeding Series, 2017, pp. 1-45. https://doi.org/10.1145/3127024.3127040 Publication ID: 98292
-
Levy, S.L.N., Ferreira, K., Bridges, P.G., & Bridges, P.G. (2017). Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data [Conference Poster]. Proceedings – IEEE International Conference on Cluster Computing, ICCC. https://doi.org/10.1109/CLUSTER.2017.99 Publication ID: 57799
-
Baseman, E., Debardeleben, N., Ferreira, K., Sridharan, V., Siddiqua, T., Tkachenko, O., & Tkachenko, O. (2017). Automating DRAM Fault Mitigation by Learning from Experience [Conference Poster]. Proceedings – 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2017. https://doi.org/10.1109/DSN-W.2017.39 Publication ID: 55872
-
Siddiqua, T., Sridharan, V., Raasch, S.E., Debardeleben, N., Ferreira, K., Levy, S.L.N., Baseman, E., Guan, Q., & Guan, Q. (2017). Lifetime memory reliability data from the field [Conference Poster]. 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017. https://doi.org/10.1109/DFT.2017.8244428 Publication ID: 57295
-
Gammel, M., Teranishi, K., Knight, S., Sjaardema, G.D., Kolla, H., Wilke, J., Slattengren, N., Ferreira, K., Bennett, J., Jain, N., Kale, L., & Kale, L. (2017). Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity [Conference Poster]. https://www.osti.gov/biblio/1456562 Publication ID: 55874
-
Widener, P., Ferreira, K., Levy, S.L.N., & Levy, S.L.N. (2017). Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-319-58943-5_50 Publication ID: 50229
-
Levy, S.L.N., Ferreira, K., Bridges, P.G., & Bridges, P.G. (2016). Improving Application Resilience to Memory Errors with Lightweight Compression [Conference Poster]. https://doi.org/10.1109/SC.2016.27 Publication ID: 47905
-
Levy, S.L.N., Ferreira, K., Widener, P., Bridges, P.G., Mondragon, O.H., & Mondragon, O.H. (2016). How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/2966884.2966920 Publication ID: 52299
-
Baseman, E., Debardeleben, N., Ferreira, K., Levy, S.L.N., Raasch, S., Sridharan, V., Siddiqua, T., Guan, Q., & Guan, Q. (2016). Improving DRAM Fault Characterization through Machine Learning [Conference Poster]. Proceedings – 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-W 2016. https://doi.org/10.1109/DSN-W.2016.13 Publication ID: 49553
-
Levy, S.L.N., Ferreira, K., Bridges, P.G., & Bridges, P.G. (2016). Improving Application Resilience to Memory Errors with Lightweight Compression [Conference Poster]. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1109/SC.2016.27 Publication ID: 51067
-
Levy, S.L.N., Ferreira, K., Widener, P., Bridges, P.G., Mondragon, O.H., & Mondragon, O.H. (2016). Understanding Performance Interference in Next-Generation HPC Systems [Conference Poster]. https://www.osti.gov/biblio/1372149 Publication ID: 51068
-
Fiala, D., Mueller, F., Ferreira, K., Engelmann, C., & Engelmann, C. (2016). Mini-Ckpts: Surviving OS failures in persistent memory [Conference Poster]. Proceedings of the International Conference on Supercomputing. https://doi.org/10.1145/2925426.2926295 Publication ID: 49177
-
Levy, S.L.N., Ferreira, K., & Ferreira, K. (2016). An examination of the impact of failure distribution on coordinated checkpoint/restart [Conference Poster]. FTXS 2016 – Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale. https://doi.org/10.1145/2909428.2909430 Publication ID: 50259
-
Levy, S.L.N., Ferreira, K., Widener, P., Bridges, P., Mondragon, O., & Mondragon, O. (2016). How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms [Conference Poster]. https://www.osti.gov/biblio/1364728 Publication ID: 50139
-
Levy, S.L.N., Ferreira, K., Widener, P., Bridges, P.G., Mondragon, O., & Mondragon, O. (2016). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale [Presentation]. https://doi.org/10.1007/978-3-319-10214-6_5 Publication ID: 50027
-
Shipman, G., McCormick, P., Bachman, W.B., Olivier, S.L., Ferreira, K., Sankaran, R., Treichler, S., Aiken, A., Bauer, M., & Bauer, M. (2016). Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime [Conference Poster]. https://www.osti.gov/biblio/1365384 Publication ID: 49758
-
Baseman, E., Debardeleben, N., Ferreira, K., Levy, S., Raasch, S., Sridharan, V., Siddiqua, T., Guan, Q., & Guan, Q. (2016). A Machine Learning Approach for Automatic Characterization of Memory Faults [Conference Poster]. https://www.osti.gov/biblio/1346523 Publication ID: 48579
-
Ferreira, K. (2016). An Examination of the Impact of the Failure Distribution on Coordinated Checkpoint/Restart [Conference Poster]. https://www.osti.gov/biblio/1345094 Publication ID: 48501
-
Widener, P., Levy, S.L.N., Ferreira, K., Hoefler, T., & Hoefler, T. (2016). On noise and the performance benefit of nonblocking collectives. International Journal of High Performance Computing Applications, 30(1), pp. 121-133. https://doi.org/10.1177/1094342015611952 Publication ID: 39411
-
Levy, S.L.N., Ferreira, K., Bridges, P.G., & Bridges, P.G. (2016). Similarity Engine: Using Content Similarity to Improve Memory Resilience [Conference Poster]. https://www.osti.gov/biblio/1239385 Publication ID: 46804
-
Bachman, W.B., Olivier, S.L., Ferreira, K., Shipman, G., Shu, W., & Shu, W. (2015). Early experiences with node-level power capping on the cray XC40 platform [Conference Poster]. Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing – Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis. https://doi.org/10.1145/2834800.2834801 Publication ID: 46036
-
Mondragon, O.H., Bridges, P.G., Ferreira, K., Widener, P., Levy, S.L.N., & Levy, S.L.N. (2015). Scheduling In-Situ Analytics in Next-generation Applications [Conference Poster]. https://www.osti.gov/biblio/1333466 Publication ID: 41676
-
Bachman, W.B., Olivier, S.L., Ferreira, K., Shipman, G., Shu, W., & Shu, W. (2015). Early Experiences with Node-Level Power Capping on the Cray XC40 Platform [PowerPoint] [Conference Poster]. https://doi.org/10.1145/2834800.2834801 Publication ID: 41617
-
Ferreira, K., Arnold, D., Ibtesham, D., & Ibtesham, D. (2015). A checkpoint compression study for high-performance computing systems. International Journal of High Performance Computing Applications, 29(4), pp. 387-402. https://doi.org/10.1177/1094342015570921 Publication ID: 37407
-
Riesen, R., MacCabe, B., Gerofi, B., Lombard, D., Lange, J., Bachman, W.B., Ferreira, K., Lang, M., Keppel, P., Wisniewski, R., Brightwell, R.B., Inglett, T., Park, Y., Ishikawa, Y., & Ishikawa, Y. (2015). Panel: What is a Lightweight Kernel? [Conference Poster]. https://www.osti.gov/biblio/1258200 Publication ID: 43556
-
Bachman, W.B., Olivier, S.L., Ferreira, K., Shipman, G., Shu, W., & Shu, W. (2015). Exploring MPI Application Performance Under Power Capping on the Cray XC40 Platform [Conference Poster]. https://www.osti.gov/biblio/1258232 Publication ID: 43466
-
Ferreira, K. (2015). Revisiting Checkpointing for Exascale-Class Systems [Conference Poster]. https://www.osti.gov/biblio/1251139 Publication ID: 43249
-
Shipman, G., McCormick, P., Bachman, W.B., Olivier, S.L., Ferreira, K., Chen, J., Sankaran, R., Treichler, S., Aiken, A., Bauer, M., & Bauer, M. (2015). Dynamic Task Scheduling to Mitigate System Performance Variability [Conference Poster]. https://www.osti.gov/biblio/1249032 Publication ID: 43099
-
Levy, S.L.N., Ferreira, K., Bridges, P.G., & Bridges, P.G. (2015). Similarity Engine: Using Content Similarity to Improve Memory Resilience [Conference Poster]. https://www.osti.gov/biblio/1530987 Publication ID: 43098
-
Sridharan, V., Debardeleben, N., Blanchard, S., Ferreira, K., Gurumurthi, S., Shalf, J., & Shalf, J. (2015). Memory errors in modern systems: The good, the bad, and the ugly. International Conference on Architectural Support for Programming Languages and Operating Systems – ASPLOS, 2015-January, pp. 297-310. https://doi.org/10.1145/2694344.2694348 Publication ID: 38008
-
Ferreira, K., Goudarzi, A., Arnold, D., Feldman, G., & Feldman, G. (2015). A Principled Approach to HPC Event Monitoring [Conference Poster]. https://www.osti.gov/biblio/1239260 Publication ID: 41943
-
Widener, P., Ferreira, K., Levy, S.L.N., Fabian, N., & Fabian, N. (2015). Canaries in a coal mine: Using application-level checkpoints to detect memory failures [Conference Poster]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84952050378&origin=inward Publication ID: 43835
-
Ferreira, K., Levy, S.L.N., Widener, P., Arnold, D., & Arnold, D. (2014). Using Machine Learning to Optimize Uncoordinated Checkpointing Performance [Conference Poster]. https://www.osti.gov/biblio/1319751 Publication ID: 39111
-
Ferreira, K. (2014). Fault Survivability of Lightweight Operating Systems for exascale. https://doi.org/10.2172/1459775 Publication ID: 38559
Awards & Recognition
2009
Ronald Brightwell, Kurt Brian Ferreira, Suzanne M. Kelly, James H. Laros, Kevin Pedretti, James Tomkins, John P. Vandyke, Courtenay T. Vaughan, Robert Ballance, Trammell Hudson, ,
R&D 100 Award, R&D Magazine, One of the 100 Most Technologically Significant New Products of the Year,
June 1, 2009