High Security Operations Summer Undergraduate Internship Program
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Incipient melting is a phenomenon that can occur in aluminum alloys where solute rich areas, such as grain boundaries, can melt before the rest of the material; incipient melting can degrade mechanical and corrosion properties and is irreversible, resulting in material scrapping. After detecting indications of incipient melting as the cause of failure in 7075 aluminum alloy parts (AA7075), a study was launched to determine threshold temperature for incipient melting. Samples of AA7075 were solution annealed using temperatures ranging from 870-1090F. A hardness profile was developed to demonstrate the loss of mechanical properties through the progression of incipient melting. Additionally, Zeiss software Zen Core Intellesis was utilized to more accurately quantify the changes in microstructural properties as AA7075 surpassed the onset of incipient melting. The results from this study were compared with previous AA7075 material that demonstrated incipient melting.
Abstract not provided.
HPDC 2022 - Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing
New and novel HPC platforms provide interesting challenges and opportunities. Analysis of these systems can provide a better understanding of both the specific platform being studied as well as large-scale systems in general. Arm is one such architecture that has been explored in HPC for several years, however little is still known about its viability for supporting large-scale production workloads in terms of system reliability. The Astra system at Sandia National Laboratories was the first public peta-FLOPS Arm-based system on the Top500 and has been successfully running production HPC applications for a couple of years. In this paper, we analyze memory failure data collected from Astra while the system was in production running unclassified applications. This analysis revealed several interesting contributions related to both the Arm platform and to HPC systems in general. First, we outline the number of components replaced due to reliability issues in standing-up this first-of-its-kind, large-scale HPC system. We show the distribution differences between correctable DRAM faults and errors on Astra, showing that, not properly accounting for faults can lead to erroneous conclusions. Additionally, we characterize DRAM faults on the system and show contrary to existing work that memory faults are uniformly distributed across CPU socket, DRAM column, bank and rack region, but are not uniform across node, DIMM rank, DIMM slot on the motherboard, and system rack: some racks, ranks and DIMM slots experience more faults than others. Similarly, we show the impact of temperature and power on DRAM correctable errors. Finally, we make a detailed comparison of results presented here with the positional affects found in several previous large-scale reliability studies. The results of this analysis provide valuable guidance to organizations standing-up first-in- class platforms in HPC, organizations using Arm in HPC, and the entire large-scale HPC community in general.