Publications Search

Proxy Applications: Curation and Assessment

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Hardware Evaluation Outreach: Application development Challenges Now and for the Exascale Era

Bair, Ray; Cook, Jeanine; Donofrio, David; Kuehn, Jeff; Moore, Shirley

The intent of this document is to assist the programmer in understanding details of contemporary and Exascale hardware system design and how these designs provide opportunities and place constraints on next-generation simulation software design. We attempt to clarify hardware organization and component details for our most current and Exascale systems to help program developers understand how software needs to change in order to take best advantage of the performance available. Exascale success is specifically defined for ECP as a 50x improvement over baseline in the aggregate "capability volume" on several KPP axes, of which raw floating point performance is only one, but also includes characteristics such as problem size, system memory size, node memory size, power, and efficiency. This multi-axis approach is particularly important to understand in the context of delivered improvements in real applications, since, for instance, the floating point computation may comprise less than 10% of the actual computational work required. Given the Exascale requirements and the constraints these requirements put on the performance expectations of fundamental system components, the programmer will be forced to re-think several application implementation details in order to achieve exaflop performance on these systems. The remainder of this document aims to present more detail on Exascale era system hardware and the specific areas that the programmer should address to extract performance from these systems. We attempt to give the programmer guidance at both a high- and low-level, providing some abstract suggestions on how to refactor codes given the expected system architectures and some low-level recommendations on how to implement these modifications. We also include a section on training resources that are helpful for both programmers that are just beginning to understand code modifications for contemporary and Exascale systems and for those that have done some refactoring and are now trying to extract maximal application performance from these systems.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

Sandia ATDM DevOps and Performance Analysis

Hoekstra, Robert J.; Bartlett, Roscoe; Hammond, Simon; Cook, Jeanine; Dinge, Dennis; Frye, Joseph R.; Hughes, Clayton; Lin, Paul T.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Proxy Apps Mysteries Revealed

Richards, David; Cook, Jeanine; Finkel, Hal; Junghans, Christoph; Mccorquodale, Peter; Moore, Shirley; Aaziz, Omar R.; Juedman, Tanner; Vaughan, Courtenay T.; Homerding, Brian; Urma, Thomas; Bhatele, Abhinav; Andrade, Zavier; Pavel, Robert; Ramakrishnaiah, Vinay; Mintz, Tiffany; Watson, Greg

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Exploring and quantifying how communication behaviors in proxies relate to real applications

Proceedings of Pmbs 2018 Performance Modeling Benchmarking and Simulation of High Performance Computer Systems Held in Conjunction with Sc 2018 the International Conference for High Performance Computing Networking Storage and Analysis

Aaziz, Omar R.; Cook, Jeanine; Cook, Jonathan; Vaughan, Courtenay T.

Proxy applications, or proxies, are simple applications meant to exercise systems in a way that mimics real applications (their parents). However, characterizing the relationship between the behavior of parent and proxy applications is not an easy task. In prior work [1], we presented a data-driven methodology to characterize the relationship between parent and proxy applications based on collecting runtime data from both and then using data analytics to find their correspondence or divergence. We showed that it worked well for hardware counter data, but our initial attempt using MPI function data was less satisfactory. In this paper, we present an exploratory effort at making an improved quantification of the correspondence of communication behavior for proxies and their respective parent applications. We present experimental evidence of positive results using four proxy applications from the current ECP Proxy Application Suite and their corresponding parent applications (in the ECP application portfolio). Results show that each proxy analyzed is representative of its parent with respect to communication data. In conjunction with our method presented in [1] (correspondence between computation and memory behavior), we get a strong understanding of how well a proxy predicts the comprehensive performance of its parent.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Exploring and quantifying how communication behaviors in proxies relate to real applications

Proceedings of PMBS 2018: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis

Aaziz, Omar R.; Cook, Jeanine; Cook, Jonathan; Vaughan, Courtenay T.

Proxy applications, or proxies, are simple applications meant to exercise systems in a way that mimics real applications (their parents). However, characterizing the relationship between the behavior of parent and proxy applications is not an easy task. In prior work [1], we presented a data-driven methodology to characterize the relationship between parent and proxy applications based on collecting runtime data from both and then using data analytics to find their correspondence or divergence. We showed that it worked well for hardware counter data, but our initial attempt using MPI function data was less satisfactory. In this paper, we present an exploratory effort at making an improved quantification of the correspondence of communication behavior for proxies and their respective parent applications. We present experimental evidence of positive results using four proxy applications from the current ECP Proxy Application Suite and their corresponding parent applications (in the ECP application portfolio). Results show that each proxy analyzed is representative of its parent with respect to communication data. In conjunction with our method presented in [1] (correspondence between computation and memory behavior), we get a strong understanding of how well a proxy predicts the comprehensive performance of its parent.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Application Performance Insights via System Monitoring

Brandt, James M.; Gentile, Ann C.; Hammond, Simon; Cook, Jeanine; Allan, Benjamin A.; Tucker, Thomas; Naksinehaboon, Nichamon; Taerat, Narate; Cook, Jeanine; Aaziz, Omar R.; Ates, Emre; Tuncer, Ozan; Egele, Manuel; Turk, Ata; Coskun, Ayse; Izadpanah, Ramin; Dechev, Damian

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

On the Use of Vectorization in Production Engineering Workloads

Vaughan, Courtenay T.; Cook, Jeanine; Benner, Robert E.; Dinge, Dennis; Lin, Paul T.; Hughes, Clayton; Hoekstra, Robert J.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

A Methodology for Characterizing the Correspondence Between Real and Proxy Applications

Cook, Jeanine; Aaziz, Omar R.; Cook, Jonathan; Juedeman, Tanner; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

On the Use of Vectorization in Production Engineering Workloads

Vaughan, Courtenay T.; Hammond, Simon; Dinge, Dennis; Lin, Paul T.; Hughes, Clayton; Benner, Robert E.; Cook, Jeanine; Pase, Douglas M.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Runtime HPC System and Application Performance Assessment and Diagnostics

Brandt, James M.; Gentile, Ann C.; Cook, Jonathan; Allan, Benjamin A.; Cook, Jeanine; Aaziz, Omar R.; Tucker, Thomas; Nichamon, Naksinehaboon; Taerat, Narate; Ates, Emre; Tuncer, Ozan; Egele, Manuel; Turk, Ata; Coskun, Ayse

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

PIMS: Memristor-Based Processing-in-Memory-and-Storage

Cook, Jeanine

Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energy efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Continuous Performance Tracking for Kokkos Applications Using LDMS

Brandt, James M.; Hammond, Simon; Tucker, Thomas; Gentile, Ann C.; Cook, Jeanine

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Enhanced Profiling for Kokkos Applications

Hammond, Simon; Trott, Christian R.; Ibanez, Daniel A.; Edwards, Harold C.; Sunderland, Daniel; Ellingwood, Nathan D.; Brandt, James M.; Gentile, Ann C.; Cook, Jeanine; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Sandia ATDM Performance Execution Tools & Analysis

Hammond, Simon; Vaughan, Courtenay T.; Dinge, Dennis; Lin, Paul T.; Benner, Robert E.; Hughes, Clayton; Trott, Christian R.; Cook, Jeanine; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Memory System Design for Ultra Low Power Computationally Error Resilient Processor Microarchitectures

Srikanth, Sriseshan; Rabbat, Paul; Hein, Eric; Deng, Bobin; Conte, Thomas; Debenedictis, Erik; Cook, Jeanine; Frank, Michael P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Journal of Supercomputing

Siddique, Nafiul A.; Grubel, Patricia A.; Badawy, Abdel-Hameed A.; Cook, Jeanine

Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the most crucial metrics of cache performance. Although the majority of research focuses on measuring cache hit rates and data movement as the primary cache performance metrics, cache utilization is significantly important. We investigate the application’s locality using cache utilization metrics. In addition, we present cache utilization and traditional cache performance metrics as the program progresses providing detailed insights into the dynamic application behavior on parallel applications from four benchmark suites running on multiple cores. We explore cache utilization for APEX, Mantevo, NAS, and PARSEC, mostly scientific benchmark suites. Our results indicate that 40% of the data bytes in a cache line are accessed at least once before line eviction. Also, on average a byte is accessed two times before the cache line is evicted for these applications. Moreover, we present runtime cache utilization, as well as, conventional performance metrics that illustrate a holistic understanding of cache behavior. To facilitate this research, we build a memory simulator incorporated into the Structural Simulation Toolkit (Rodrigues et al. in SIGMETRICS Perform Eval Rev 38(4):37–42, 2011). Finally, our results suggest that variable cache line size can result in better performance and can also conserve power.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI