Scott Larson Nicoll Levy

Scalable System Software

Author profile picture

Scalable System Software

sllevy@sandia.gov

(505) 844-7292

Sandia National Laboratories, New Mexico
P.O. Box 5800
Albuquerque, NM 87185-1319

Biography

I am a Senior Member of Technical Staff in the Scalable System Software department of the Center for Computing Research (CCR). I research system software for next-generation extreme-scale systems. Specifically, I study the impact of system failures, and other sources of performance interference, on the execution of scientific simulations. I am also investigating application performance in power-constrained environments. I earned my Ph.D. from the University of New Mexico, where I worked with Prof. Patrick Bridges in the Scalable Systems Lab. At Sandia, I work with Kurt Ferreira, Patrick Widener and the 9lives research group on improving the resilience and fault tolerance of large-scale parallel systems.

Education

  • Ph.D., Computer Science, University of New Mexico
  • B.S., Electrical Engineering, Cornell University

Publications

  • Haskins, K., Bridges, P., Ferreira, K.B., Levy, S., & Levy, S. (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications [Conference Paper]. https://www.osti.gov/biblio/1899493 Publication ID: 76416
  • Haskins, K., bridges, B., Ferreira, K.B., Levy, S., & Levy, S. (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications [Conference Paper]. https://www.osti.gov/biblio/1899492 Publication ID: 76415
  • Ferreira, K.B., Levy, S., & Levy, S. (2021). Characterizing Per-node Memory Failures Using Benford?s Law [Conference Paper]. https://www.osti.gov/biblio/1886179 Publication ID: 75504
  • Marts, W., Dosanjh, M., Schonbein, W., Levy, S., Grant, R.E., Bridges, P., & Bridges, P. (2021). MiniMod: A Modular Miniapplication Benchmarking Framework for HPC [Conference Paper]. https://doi.org/10.1109/Cluster48925.2021.00028 Publication ID: 79517
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2021). An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation [Conference Presenation]. https://doi.org/10.2172/1869756 Publication ID: 78565
  • Olivier, S.L., Brightwell, R., Ferreira, K.B., Grant, R.E., Levy, S., Pedretti, K., Younge, A.J., & Younge, A.J. (2021). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime [Presentation]. https://www.osti.gov/biblio/1861479 Publication ID: 77902
  • Grant, R.E., Levy, S., Schonbein, W., & Schonbein, W. (2021). Co-design of System Software for Compute Accelerators and SmartNICs [Conference Paper]. https://www.osti.gov/biblio/1847622 Publication ID: 77227
  • Schonbein, W., Grant, R.E., Levy, S., Dosanjh, M., Marts, W., & Marts, W. (2020). Low-cost MPI Multithreaded Message Matching Benchmarking [Conference Presenation]. https://doi.org/10.2172/1882338 Publication ID: 72015
  • Grant, R.E., Schonbein, W., Levy, S., & Levy, S. (2020). RaDD Runtimes: Radical and Different Distributed Runtimes with SmartNICs [Conference Paper]. https://www.osti.gov/biblio/1825981 Publication ID: 71234
  • Templet Jr., G., Glickman, M., Kordenbrock, T., Levy, S., Lofstead, G., Mauldin, J., Otahal, T., Ulmer, C., Widener, P., Oldfield, R., & Oldfield, R. (2020). FY20 CSSE L2 Milestone 7186 [Presentation]. https://www.osti.gov/biblio/1820290 Publication ID: 74812
  • Brightwell, R., Ferreira, K.B., Grant, R.E., Levy, S., Lofstead, G., Olivier, S.L., Pedretti, K., Younge, A.J., Gentile, A., & Gentile, A. (2020). ALAMO: Autonomous Lightweight Allocation Management and Optimization [Conference Poster]. https://www.osti.gov/biblio/1818044 Publication ID: 74680
  • Dosanjh, M., Grant, R.E., Hjelmn, N., Levy, S., Schonbein, W., & Schonbein, W. (2019). The Upcoming Storm: The Implications of Increasing Core Count on Scalable System Software. https://www.osti.gov/biblio/1762669 Publication ID: 69426
  • Olivier, S.L., Brightwell, R., Pedretti, K., Younge, A.J., Evans, N., Levy, S., Ferreira, K.B., Grant, R.E., & Grant, R.E. (2019). SNL ATDM Software Ecosystem [Presentation]. https://www.osti.gov/biblio/1583026 Publication ID: 64200
  • Bettencourt, M.T., Kramer, R., Cartwright, K., Phillips, E., Ober, C., Pawlowski, R., Swan, M., Tezaur, I., Phipps, E., Conde, S., Cyr, E., Ulmer, C., Kordenbrock, T., Levy, S., Templet, G., Hu, J., Lin, P., Glusa, C., Siefert, C., Glass, M., & Glass, M. (2018). ASC ATDM Level 2 Milestone #6358: Assess Status of Next Generation Components and Physics Models in EMPIRE. https://doi.org/10.2172/1493832 Publication ID: 58868
  • Levy, S., Ferreira, K.B., DeBardeleben, N., Siddiqua, T., Sridharan, V., Baseman, E., & Baseman, E. (2018). Lessons Learned from Errors Observed over the Lifetime of Cielo [Conference Poster]. https://doi.org/10.1109/SC.2018.00046 Publication ID: 63939
  • Ulmer, C., Kordenbrock, T., Lawson, M., Levy, S., Lofstead, G., Mukherjee, S., Sjaardema, G., Templet, G., Ward, L., Widener, P., & Widener, P. (2018). SNL ATDM: I/O and Data Management [Presentation]. https://www.osti.gov/biblio/1806512 Publication ID: 59268
  • Lawson, M., Lofstead, G., Levy, S., Widener, P., Ulmer, C., Mukherjee, S., Templet, G., Kordenbrock, T., & Kordenbrock, T. (2017). EMPRESS?Extensible Metadata PRovider for Extreme-scale Scientific Simulations [Conference Poster]. https://www.osti.gov/biblio/1513597 Publication ID: 54449
  • Ferreira, K.B., Grant, R.E., Levenhagen, M., Levy, S., Groves, T., & Groves, T. (2017). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design [Conference Poster]. https://doi.org/10.1002/cpe.5150 Publication ID: 54225
  • Ulmer, C., Mukherjee, S., Templet, G., Levy, S., Lofstead, G., Widener, P., Kordenbrock, T., Lawson, M., & Lawson, M. (2017). Faodail: Enabling In Situ Analytics for Next-Generation Systems [Conference Poster]. https://www.osti.gov/biblio/1482474 Publication ID: 54217
  • Lawson, M., Lofstead, G., Levy, S., Widener, P., Ulmer, C., Mukherjee, S., Templet, G., Kordenbrock, T., & Kordenbrock, T. (2017). EMPRESS-Extensible Metadata PRovider for Extreme-scale Scientific Simulations [Conference Poster]. https://www.osti.gov/biblio/1481718 Publication ID: 54054
  • Kreitinger, R., Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis [Conference Poster]. https://www.osti.gov/biblio/1478158 Publication ID: 53562
  • Kreitinger, R., Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis [Conference Poster]. https://www.osti.gov/biblio/1573776 Publication ID: 53563
  • Ulmer, C., Oldfield, R., Kordenbrock, T., Levy, S., Lofstead, G., Mukherjee, S., Templet, G., Widener, P., & Widener, P. (2017). ATDM Data Warehouse: Data Management Services for Exascale Computing [Presentation]. https://www.osti.gov/biblio/1466487 Publication ID: 58113
  • Widener, P., Ferreira, K.B., Levy, S., & Levy, S. (2017). It’s not the heat it’s the humidity: scheduling resilience activity at scale [Conference Poster]. https://www.osti.gov/biblio/1367189 Publication ID: 56360
  • Ulmer, C., Ulmer, C., Kordenbrock, T., Levy, S., Lofstead, G., Mukherjee, S., Sjaardema, G., Templet, G., Widener, P., Oldfield, R., & Oldfield, R. (2017). ATDM Data Warehouse [Conference Poster]. https://www.osti.gov/biblio/1427407 Publication ID: 53054
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2016). Improving Application Resilience to Memory Errors with Lightweight Compression [Conference Poster]. https://doi.org/10.1109/SC.2016.27 Publication ID: 51067
  • Levy, S., Ferreira, K.B., Widener, P., Bridges, P., Mondragon, O., & Mondragon, O. (2016). How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms [Conference Poster]. https://www.osti.gov/biblio/1364728 Publication ID: 50139
  • Levy, S., Ferreira, K.B., Widener, P., Bridges, P.G., Mondragon, O., & Mondragon, O. (2016). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale [Presentation]. https://doi.org/10.1007/978-3-319-10214-6_5 Publication ID: 50027
  • Levy, S. (2016). Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems. https://www.osti.gov/biblio/1226922 Publication ID: 41675
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2016). Similarity Engine: Using Content Similarity to Improve Memory Resilience [Conference Poster]. https://www.osti.gov/biblio/1239385 Publication ID: 46804
  • Levy, S., Ferreira, K.B., Bridges, P.G., & Bridges, P.G. (2015). Similarity Engine: Using Content Similarity to Improve Memory Resilience [Conference Poster]. https://www.osti.gov/biblio/1530987 Publication ID: 43098
  • Ferreira, K.B., Levy, S., Widener, P., Arnold, D., & Arnold, D. (2014). Using Machine Learning to Optimize Uncoordinated Checkpointing Performance [Conference Poster]. https://www.osti.gov/biblio/1319751 Publication ID: 39111
  • Ferreira, K.B., Levy, S., & Levy, S. (2014). Exploring the effect of noise on the performance benefit of non-blocking MPI_Allreduce [Conference]. https://www.osti.gov/biblio/1145671 Publication ID: 40799
  • Ferreira, K.B., Levy, S., Widener, P., & Widener, P. (2014). Understanding the Effects of Communication and Coordination on Checkpointing at Scale [Conference]. https://doi.org/10.1109/SC.2014.77 Publication ID: 40506
  • Ferreira, K.B., Levy, S., & Levy, S. (2014). Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach [Conference]. https://www.osti.gov/biblio/1141101 Publication ID: 38842
  • Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2014). Using simulation to evaluate the performance of resilience strategies and process failures. https://doi.org/10.2172/1204092 Publication ID: 36991
  • Ferreira, K.B., Widener, P., Levy, S., & Levy, S. (2014). Understanding the Effects of Communication on Uncoordinated Checkpointing at Scale [Conference]. https://www.osti.gov/biblio/1140761 Publication ID: 36856
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2013). Predicting the Impact of Failure Avoidance on Checkpoint/Restart in Extreme-Scale Systems [Conference]. https://www.osti.gov/biblio/1118703 Publication ID: 36607
  • Ferreira, K.B., Levy, S., & Levy, S. (2013). Predicting Coordinated and Uncoordinated Checkpoint/Restart Protocol Performance at Extreme Scales [Conference]. https://www.osti.gov/biblio/1115087 Publication ID: 36309
  • Levy, S., Ferreira, K.B., Widener, P., & Widener, P. (2013). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale [Conference]. https://doi.org/10.1007/978-3-319-10214-6_5 Publication ID: 35680
  • Ferreira, K.B., Levy, S., Brightwell, R., & Brightwell, R. (2013). A Holistic Approach to Modeling and Simulation for Resilience and Power Configuration [Conference]. https://www.osti.gov/biblio/1111081 Publication ID: 34214
  • Ferreira, K.B., Levy, S., & Levy, S. (2013). A simulation infrastructure for examining the performance of resilience strategies at scale. https://doi.org/10.2172/1088091 Publication ID: 33098
  • Ferreira, K.B., Levy, S., & Levy, S. (2013). A Simulation Infrastructure for Examining the Performance of Resilience Strategies at Scale [Conference]. https://www.osti.gov/biblio/1078709 Publication ID: 33050
  • Ferreira, K.B., Pedretti, K., Levy, S., & Levy, S. (2013). Protect Yourself: Why Your OS Must Protect Against DRAM Failures [Conference]. https://www.osti.gov/biblio/1062878 Publication ID: 31581
  • Levy, S., Ferreira, K.B., & Ferreira, K.B. (2013). Using Unreliable Virtual Hardware to Inject Errors in Extreme-Scale Systems [Conference]. https://www.osti.gov/biblio/1063319 Publication ID: 32166
  • Levy, S. (2013). Exploiting Content Similarity to Improve Memory Performance in Large-Scale High-Performance Computing Systems [Conference]. https://www.osti.gov/biblio/1064175 Publication ID: 32294
  • Ferreira, K.B., Thompson, A., Trott, C.R., Levy, S., & Levy, S. (2013). An examination of content similarity within the memory of HPC applications. https://doi.org/10.2172/1088105 Publication ID: 31234
Showing 10 of 47 publications.