Results 326–350 of 9,998
Skip to search filters

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James E.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Nick T.; Tucker, Tom T.; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

Incentivizing Adoption of Software Quality Practices

Raybourn, Elaine M.; Milewicz, Reed M.; Mundt, Miranda R.

Although many software teams across the laboratories comply with yearly software quality engineering (SQE) assessments, the practice of introducing quality into each phase of the software lifecycle, or the team processes, may vary substantially. Even with the support of a quality engineer, many teams struggle to adapt and right-size software engineering best practices in quality to fit their context, and these activities aren’t framed in a way that motivates teams to take action. In short, software quality is often a “check the box for compliance” activity instead of a cultural practice that both values software quality and knows how to achieve it. In this report, we present the results of our 6600 VISTA Innovation Tournament project, "Incentivizing and Motivating High Confidence and Research Software Teams to Adopt the Practice of Quality." We present our findings and roadmap for future work based on 1) a rapid review of relevant literature, 2) lessons learned from an internal design thinking workshop, and 3) an external Collegeville 2021 workshop. These activities provided an opportunity for team ideation and community engagement/feedback. Based on our findings, we believe a coordinated effort (e.g. strategic communication campaign) aimed at diffusing the innovation of the practice of quality across Sandia National Laboratories could over time effect meaningful organizational change. As such, our roadmap addresses strategies for motivating and incentivizing individuals ranging from early career to seasoned software developers/scientists.

More Details

Physics-Based Optical Neuromorphic Classification

Leonard, Francois L.; Teeter, Corinne M.; Vineyard, Craig M.

Typical approaches to classify scenes from light convert the light field to electrons to perform the computation in the digital electronic domain. This conversion and downstream computational analysis require significant power and time. Diffractive neural networks have recently emerged as unique systems to classify optical fields at lower energy and high speeds. Previous work has shown that a single layer of diffractive metamaterial can achieve high performance on classification tasks. In analogy with electronic neural networks, it is anticipated that multilayer diffractive systems would provide better performance, but the fundamental reasons for the potential improvement have not been established. In this work, we present extensive computational simulations of two - layer diffractive neural networks and show that they can achieve high performance with fewer diffractive features than single layer systems.

More Details

Predictive Data-driven Platform for Subsurface Energy Production

Yoon, Hongkyu Y.; Verzi, Stephen J.; Cauthen, Katherine R.; Musuvathy, Srideep M.; Melander, Darryl J.; Norland, Kyle N.; Morales, Adriana M.; Lee, Jonghyun H.; Sun, Alexander Y.

Subsurface energy activities such as unconventional resource recovery, enhanced geothermal energy systems, and geologic carbon storage require fast and reliable methods to account for complex, multiphysical processes in heterogeneous fractured and porous media. Although reservoir simulation is considered the industry standard for simulating these subsurface systems with injection and/or extraction operations, reservoir simulation requires spatio-temporal “Big Data” into the simulation model, which is typically a major challenge during model development and computational phase. In this work, we developed and applied various deep neural network-based approaches to (1) process multiscale image segmentation, (2) generate ensemble members of drainage networks, flow channels, and porous media using deep convolutional generative adversarial network, (3) construct multiple hybrid neural networks such as convolutional LSTM and convolutional neural network-LSTM to develop fast and accurate reduced order models for shale gas extraction, and (4) physics-informed neural network and deep Q-learning for flow and energy production. We hypothesized that physicsbased machine learning/deep learning can overcome the shortcomings of traditional machine learning methods where data-driven models have faltered beyond the data and physical conditions used for training and validation. We improved and developed novel approaches to demonstrate that physics-based ML can allow us to incorporate physical constraints (e.g., scientific domain knowledge) into ML framework. Outcomes of this project will be readily applicable for many energy and national security problems that are particularly defined by multiscale features and network systems.

More Details

Understanding the Design Space of Sparse/Dense Multiphase Dataflows for Mapping Graph Neural Networks on Spatial Accelerators

Garg, Raveesh G.; Qin, Eric Q.; Martinez, Francisco M.; Guirado, Robert G.; Jain, Akshay J.; Abadal, Sergi A.; Abellan, Jose L.; Acacio, Manuel E.; Alarcon, Eduard A.; Rajamanickam, Sivasankaran R.; Krishna, Tushar K.

Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelerators offers promise for acceleration by mapping optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for the compilers to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for mapping the dense and sparse phases of GNNs spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution.

More Details

Thermal Infrared Detectors: expanding performance limits using ultrafast electron microscopy

Talin, A.A.; Ellis, Scott R.; Bartelt, Norman C.; Leonard, Francois L.; Perez, Christopher P.; Celio, Km C.; Fuller, Elliot J.; Hughart, David R.; Garland, Diana; Marinella, Matthew J.; Michael, Joseph R.; Chandler, D.W.; Young, Steve M.; Smith, Sean M.; Kumar, Suhas K.

This project aimed to identify the performance-limiting mechanisms in mid- to far infrared (IR) sensors by probing photogenerated free carrier dynamics in model detector materials using scanning ultrafast electron microscopy (SUEM). SUEM is a recently developed method based on using ultrafast electron pulses in combination with optical excitations in a pump- probe configuration to examine charge dynamics with high spatial and temporal resolution and without the need for microfabrication. Five material systems were examined using SUEM in this project: polycrystalline lead zirconium titanate (a pyroelectric), polycrystalline vanadium dioxide (a bolometric material), GaAs (near IR), InAs (mid IR), and Si/SiO 2 system as a prototypical system for interface charge dynamics. The report provides detailed results for the Si/SiO 2 and the lead zirconium titanate systems.

More Details

Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems

Parallel Computing

Acer, Seher A.; Boman, Erik G.; Glusa, Christian A.; Rajamanickam, Sivasankaran R.

Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner.

More Details
Results 326–350 of 9,998
Results 326–350 of 9,998