Publications

Results 26–50 of 120
Skip to search filters

Performance analysis for using non-volatile memory DIMMs: Opportunities and challenges

ACM International Conference Proceeding Series

Awad, Amro A.; Hammond, Simon D.; Hughes, Clayton H.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

DRAM scalability is becoming more challenging, pushing the focus of the research community towards alternative memory technologies. Many emerging non-volatile memory (NVM) devices are proving themselves to be good candidates to replace DRAM in the coming years. For example, the recently announced 3D-XPoint memory by Intel/Micron promises latencies that are comparable to DRAM, while being non-volatile and much more dense. While emerging NVMs can be fabricated in different form factors, the most promising (from a performance perspective) are NVM-based DIMMs. Unfortunately, there is a shortage of studies that explore the design options for NVM-based DIMMs. Because of the read and write asymmetries in both power consumption and latency, as well as limited write endurance, which often requires wear-leveling techniques, NVMs require a specialized controller. The fact that future on-die memory controllers are expected to handle different memory technologies pushes future hardware towards on-DIMM controllers. In this paper, we propose an architectural model for NVM-based DIMMs with internal controllers, explore their design space, evaluate different optimizations and reach out to several architectural suggestions. Finally, we make our model publicly available and integrate it with a widely used architectural simulator.

More Details

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon D.; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron A.; Lang, Mike L.; Martin, Steve M.

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details

Unveiling the Interplay between Global Link Arrangements and Network Management Algorithms on Dragonfly Networks

Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017

Kaplan, Fulya; Tuncer, Ozan; Leung, Vitus J.; Hemmert, Karl S.; Coskun, Ayse K.

More Details

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Journal of Parallel and Distributed Computing

Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; Hemmert, Karl S.; McCauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” and if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Thus, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient. In this paper, we explore the design space for a user-controlled multi-level main memory. Our work identifies situations in which rewriting application kernels can provide significant performance gains when using near-memory. We present algorithms designed for two-level main memory, using divide-and-conquer to partition computations and streaming to exploit data locality. We consider algorithms for the fundamental application of sorting and for the data analysis kernel k-means. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories’ SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance improvements for our sorting algorithm. In contrast, the k-means algorithm is generally CPU bound and does not improve when using near-memory except under extreme conditions. These conditions require large instances that rule out SST simulation, but we demonstrate improvements by running on a customized machine with high and low bandwidth memory. These case studies in co-design serve as positive and cautionary templates, respectively, for the major task of optimizing the computational kernels of many fundamental applications for two-level main memory systems.

More Details

The Portals 4.1 Network Programming Interface

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan E.; Hemmert, Karl S.; Pedretti, Kevin P.; Wheeler, Kyle W.; Underwood, Keith; Riesen, Rolf R.; Maccabe, Arthur B.; Hudson, Trammel H.

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tar geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.

More Details

(SAI) stalled, active and idle: Characterizing power and performance of large-scale dragonfly networks

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Groves, Taylor G.; Grant, Ryan E.; Hemmert, Karl S.; Hammond, Simon D.; Levenhagen, Michael J.; Arnold, Dorian C.

Exascale networks are expected to comprise a significant part of the total monetary cost and 10-20% of the power budget allocated to exascale systems. Yet, our understanding of current and emerging workloads on these networks is limited. Left ignored, this knowledge gap likely will translate into missed opportunities for (1) improved application performance and (2) decreased power and monetary costs in next generation systems. This work targets a detailed understanding and analysis of the performance and utilization of the dragonfly network topology. Using the Structural Simulation Toolkit (SST) and a range of relevant workloads on a dragonfly topology of 110,592 nodes, we examine network design tradeoffs amongst execution time, power, bandwidth, and the number of global links. Our simulations report stalled, active and idle time on a per-port level of the fabric, in order to provide a detailed picture of future networks. The results of this work show potential savings of 3-10% of the exascale power budget and provide valuable insights to researchers looking for new opportunities to improve performance and increase power efficiency of next generation HPC systems.

More Details

Abstract Machine Models and Proxy Architectures for Exascale Computing

Ang, James A.; Barrett, Richard F.; Benner, R.E.; Burke, Daniel B.; Chan, Cy P.; Cook, Jeanine C.; Daley, Christopher D.; Donofrio, Dave D.; Hammond, Simon D.; Hemmert, Karl S.; Hoekstra, Robert J.; Ibrahim, Khaled I.; Kelly, Suzanne M.; Le, Hoang L.; Leung, Vitus J.; Michelogiannakis, George M.; Resnick, David R.; Rodrigues, Arun; Shalf, John S.; Stark, Dylan S.; Unat, D.U.; Wright, Nick W.; Voskuilen, Gwendolyn R.

Machine Models and Proxy Architectures for Exascale Computing Version 2.0 Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Approved for public release; further dissemination unlimited. Issued by Sandia National Laboratories, operated for the United States Department of Energy by Sandia Corporation. NOTICE: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government, nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or rep- resent that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government, any agency thereof, or any of their contractors or subcontractors. The views and opinions expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or any of their contractors. Printed in the United States of America. This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from U.S. Department of Energy Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone: (865) 576-8401 Facsimile: (865) 576-5728 E-Mail: reports@adonis.osti.gov Online ordering: http://www.osti.gov/bridge Available to the public from U.S. Department of Commerce National Technical Information Service 5285 Port Royal Rd Springfield, VA 22161 Telephone: (800) 553-6847 Facsimile: (703) 605-6900 E-Mail: orders@ntis.fedworld.gov Online ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online D E P A R T M E N T O F E N E R G Y * * U N I T E D S T A T E S O F A M E R I C A SAND2016-6049 Unlimited Release Printed Abstract Machine Models and Proxy Architectures for Exascale Computing Version 2.0 J.A. Ang 1 , R.F. Barrett 1 , R.E. Benner 1 , D. Burke 2 , C. Chan 2 , J. Cook 1 , C.S. Daley 2 , D. Donofrio 2 , S.D. Hammond 1 , K.S. Hemmert 1 , R.J. Hoekstra 1 , K. Ibrahim 2 , S.M. Kelly 1 , H. Le, V.J. Leung 1 , G. Michelogiannakis 2 , D.R. Resnick 1 , A.F. Rodrigues 1 , J. Shalf 2 , D. Stark, D. Unat, N.J. Wright 2 , G.R. Voskuilen 1 1 1 Sandia National Laboratories, P.O. Box 5800, Albuquerque, New Mexico 87185-MS 1319 2 Lawrence Berkeley National Laboratory, Berkeley, California Abstract To achieve exascale computing, fundamental hardware architectures must change. The most sig- nificant consequence of this assertion is the impact on the scientific and engineering applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to exascale architectures, developers must be able to reason about new hardware and deter- mine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware archi- tects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of mod- els that can help software developers prepare for exascale. In addition, through the use of proxy architectures, we can enable a more concrete exploration of how well new and evolving applica- tion codes map onto future architectures. This second version of the document addresses system scale considerations and provides a system-level abstract machine model with proxy architecture information.

More Details

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015

Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; Hemmert, Karl S.; McCauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A fundamental challenge for supercomputer architecture is that processors cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. As the number of cores per chip increases, and traditional DDR DRAM speeds stagnate, the problem is only getting worse. A variety of non-DDR 3D memory technologies (Wide I/O 2, HBM) offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems likely that future systems will require at least two levels of main memory: high-bandwidth, low-power memory near the processor and low-bandwidth high-capacity memory further away. This near memory will probably not have significantly faster latency than the far memory. This, combined with the large size of the near memory (multiple GB) and power constraints, may make it difficult to treat it as a standard cache. In this paper, we explore some of the design space for a user-controlled multi-level main memory. We present algorithms designed for the heterogeneous bandwidth, using streaming to exploit data locality. We consider algorithms for the fundamental application of sorting. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories' SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance. This co-design effort suggests implementing two-level main memory systems may improve memory performance in fundamental applications.

More Details
Results 26–50 of 120
Results 26–50 of 120