HPC Architectures Beyond Exascale
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
MLIR (Multi-Level Intermediate Representation), is an extensible compiler framework that supports high-level data structures and operation constructs. These higher-level code representations are particularly applicable to the artificial intelligence and machine learning (AI/ML) domain, allowing developers to more easily support upcoming heterogeneous AI/ML accelerators and develop flexible domain specific compilers/frameworks with higher-level intermediate representations (IRs) and advanced compiler optimizations. The result of using MLIR within the LLVM compiler framework is expected to yield significant improvement in the quality of generated machine code, which in turn will result in improved performance and hardware efficiency
Proceedings - International Symposium on High-Performance Computer Architecture
Non-volatile memories (NVMs) have the characteristics of both traditional storage systems (persistent) and traditional memory systems (byte-Addressable). However, they suffer from high write latency and have a limited write endurance. Researchers have proposed hybrid memory systems that combine DRAM and NVM, utilizing the lower latency of the DRAM to hide some of the shortcomings of the NVM-improving system's performance by caching resident NVM data in the DRAM. However, this can nullify the persistency of the cached pages, leading to a question of trade-offs in terms of performance and reliability. In this paper, we propose Stealth-Persist, a novel architecture support feature that allows applications that need persistence to run in the DRAM while maintaining the persistency features provided by the NVM. Stealth-Persist creates the illusion of a persistent memory for the application to use, while utilizing the DRAM for performance optimizations. Our experimental results show that Stealth-Persist improves the performance by 42.02% for persistent applications.
Proceedings - International Symposium on High-Performance Computer Architecture
1 The exponential growth of data has driven technology providers to develop new protocols, such as cache coherent interconnects and memory semantic fabrics, to help users and facilities leverage advances in memory technologies to satisfy these growing memory and storage demands. Using these new protocols, fabric-Attached memories (FAM) can be directly attached to a system interconnect and be easily integrated with a variety of processing elements (PEs). Moreover, systems that support FAM can be smoothly upgraded and allow multiple PEs to share the FAM memory pools using well-defined protocols. The sharing of FAM between PEs allows efficient data sharing, improves memory utilization, reduces cost by allowing flexible integration of different PEs and memory modules from several vendors, and makes it easier to upgrade the system. One promising use-case for FAMs is in High-Performance Compute (HPC) systems, where the underutilization of memory is a major challenge. However, adopting FAMs in HPC systems brings new challenges. In addition to cost, flexibility, and efficiency, one particular problem that requires rethinking is virtual memory support for security and performance. To address these challenges, this paper presents decoupled access control and address translation (DeACT), a novel virtual memory implementation that supports HPC systems equipped with FAM. Compared to the state-of-The-Art two-level translation approach, DeACT achieves speedup of up to 4.59x (1.8x on average) without compromising security.1Part of this work was done when Vamsee was working under the supervision of Amro Awad at UCF. Amro Awad is now with the ECE Department at NC State.
Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of unprecedented amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.
Abstract not provided.
Abstract not provided.
ACM International Conference Proceeding Series
With many recent advances in interconnect technologies and memory interfaces, disaggregated memory systems are approaching industrial adoption. For instance, the recent Gen-Z consortium focuses on a new memory semantic protocol that enables fabric-attached memories (FAM), where the memory and other compute units can be directly attached to fabric interconnects. Decoupling of memory from compute units becomes a feasible option as the rate of data transfer increases due to the emergence of novel interconnect technologies, such as Silicon Photonic Interconnects. Disaggregated memories not only enable more efficient use of capacity (minimizes under-utilization) they also allow easy integration of evolving technologies. Additionally, they simplify the programming model at the same time allowing efficient sharing of data. However, the latency of accessing the data in these Fabric Attached disaggregated Memories (FAMs) is dependent on the latency imposed by the fabric interfaces. To reduce memory access latency and to improve the performance of FAM systems, in this paper, we explore techniques to prefetch data from FAMs to the local memory present in the node (PreFAM). We realize that since the memory access latency is high in FAMs, prefetching a cache block (64 bytes) from FAM can be inefficient, since the possibility of issuing demand requests before the completion of prefetch requests, to the same FAM locations, is high. Hence, we explore predicting and prefetching FAM blocks at a distance; prefetching blocks which are going to be accessed in future but not immediately. We show that, with prefetching, the performance of FAM architectures increases by 38.84%, while memory access latency is improved by 39.6%, with only 17.65% increase in the number of accesses to the FAM, on average. Further, by prefetching at a distance we show a performance improvement of 72.23%.
Abstract not provided.
Abstract not provided.
With the dawn of the exascale era, computer scientists and engineers are faced with tremendous challenges across all facets of the HPC system - scalability, performance, reliability, and power consumption. In particular, the power-performance benefit from one processor generation to the next is seeing ever-diminishing returns and will require fundamental changes in the way we approach computation. In fact, it is likely that different applications will require different types of accelerators in order to meet power, performance, and reliability requirements at scale. One potential type of accelerator, a dataflow architecture, diverges from the traditional sequentially executed instruction model into one that reflects the inherent instruction-level parallelism in a program. This work presents the initial steps toward a tool that can extract the control-dataflow graph from an application.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
ACM International Conference Proceeding Series
As demands for memory-intensive applications continue to grow, the memory capacity of each computing node is expected to grow at a similar pace. In high-performance computing (HPC) systems, the memory capacity per compute node is decided upon the most demanding application that would likely run on such system, and hence the average capacity per node in future HPC systems is expected to grow significantly. However, since HPC systems run many applications with different capacity demands, a large percentage of the overall memory capacity will likely be underutilized; memory modules can be thought of as private memory for its corresponding computing node. Thus, as HPC systems are moving towards the exascale era, a better utilization of memory is strongly desired. Moreover, upgrading memory system requires significant efforts. Fortunately, disaggregated memory systems promise better utilization by defining regions of global memory, typically referred to as memory blades, which can be accessed by all computing nodes in the system, thus achieving much better utilization. Disaggregated memory systems are expected to be built using dense, power-efficient memory technologies. Thus, emerging nonvolatile memories (NVMs) are placing themselves as the main building blocks for such systems. However, NVMs are slower than DRAM. Therefore, it is expected that each computing node would have a small local memory that is based on either HBM or DRAM, whereas a large shared NVM memory would be accessible by all nodes. Managing such system with global and local memory requires a novel hardware/software co-design to initiate page migration between global and local memory to maximize performance while enabling access to huge shared memory. In this paper we provide support to migrate pages, investigate such memory management aspects and the major system-level aspects that can affect design decisions in disaggregated NVM systems
Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.
Abstract not provided.