Publications

Results 26–50 of 69
Skip to search filters

PreFAM: Understanding the Impact of Prefetching in Fabric-Attached Memory Architectures

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Kotra, Jagadish; Hughes, Clayton H.; Hammond, Simon D.; Awad, Amro

With many recent advances in interconnect technologies and memory interfaces, disaggregated memory systems are approaching industrial adoption. For instance, the recent Gen-Z consortium focuses on a new memory semantic protocol that enables fabric-attached memories (FAM), where the memory and other compute units can be directly attached to fabric interconnects. Decoupling of memory from compute units becomes a feasible option as the rate of data transfer increases due to the emergence of novel interconnect technologies, such as Silicon Photonic Interconnects. Disaggregated memories not only enable more efficient use of capacity (minimizes under-utilization) they also allow easy integration of evolving technologies. Additionally, they simplify the programming model at the same time allowing efficient sharing of data. However, the latency of accessing the data in these Fabric Attached disaggregated Memories (FAMs) is dependent on the latency imposed by the fabric interfaces. To reduce memory access latency and to improve the performance of FAM systems, in this paper, we explore techniques to prefetch data from FAMs to the local memory present in the node (PreFAM). We realize that since the memory access latency is high in FAMs, prefetching a cache block (64 bytes) from FAM can be inefficient, since the possibility of issuing demand requests before the completion of prefetch requests, to the same FAM locations, is high. Hence, we explore predicting and prefetching FAM blocks at a distance; prefetching blocks which are going to be accessed in future but not immediately. We show that, with prefetching, the performance of FAM architectures increases by 38.84%, while memory access latency is improved by 39.6%, with only 17.65% increase in the number of accesses to the FAM, on average. Further, by prefetching at a distance we show a performance improvement of 72.23%.

More Details

Page migration support for disaggregated non-volatile memories

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Hammond, Simon D.; Hughes, Clayton H.; Samih, Ahmad; Awad, Amro

As demands for memory-intensive applications continue to grow, the memory capacity of each computing node is expected to grow at a similar pace. In high-performance computing (HPC) systems, the memory capacity per compute node is decided upon the most demanding application that would likely run on such system, and hence the average capacity per node in future HPC systems is expected to grow significantly. However, since HPC systems run many applications with different capacity demands, a large percentage of the overall memory capacity will likely be underutilized; memory modules can be thought of as private memory for its corresponding computing node. Thus, as HPC systems are moving towards the exascale era, a better utilization of memory is strongly desired. Moreover, upgrading memory system requires significant efforts. Fortunately, disaggregated memory systems promise better utilization by defining regions of global memory, typically referred to as memory blades, which can be accessed by all computing nodes in the system, thus achieving much better utilization. Disaggregated memory systems are expected to be built using dense, power-efficient memory technologies. Thus, emerging nonvolatile memories (NVMs) are placing themselves as the main building blocks for such systems. However, NVMs are slower than DRAM. Therefore, it is expected that each computing node would have a small local memory that is based on either HBM or DRAM, whereas a large shared NVM memory would be accessible by all nodes. Managing such system with global and local memory requires a novel hardware/software co-design to initiate page migration between global and local memory to maximize performance while enabling access to huge shared memory. In this paper we provide support to migrate pages, investigate such memory management aspects and the major system-level aspects that can affect design decisions in disaggregated NVM systems

More Details

Balar: A SST GPU Component for Performance Modeling and Profiling

Hughes, Clayton H.; Hammond, Simon D.; Khairy, Mahmoud K.; Zhang, Mengchi Z.; Green, Roland G.; Rogers, Timothy R.; Hoekstra, Robert J.

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.

More Details

Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads

2019 International Conference on High Performance Computing and Simulation, HPCS 2019

Hammond, Simon D.; Hughes, Clayton H.; Levenhagen, Michael J.; Vaughan, Courtenay T.; Younge, Andrew J.; Schwaller, Benjamin S.; Aguilar, Michael J.; Pedretti, Kevin P.; Laros, James H.

The high performance computing industry is undergoing a period of substantial change. Not least because of fabrication and lithographic challenges in the manufacturing of next-generation processors. As such challenges mount, the industry is looking to generate higher performance from additional functionality in the micro-architecture space as well as a greater emphasis on efficiency in the design of networkon-chip resources and memory subsystems. Such variation in design opens opportunities for new entrants in the data center and server markets where varying compute-to-memory ratios can present end users with more efficient node designs for particular workloads. In this paper we compare the recently released Marvell ThunderX2 Arm processor - arguably the first high-performance computing capable Arm design available in the marketplace. We perform a set of micro-benchmarking and mini-application evaluation on the ThunderX2 comparing it with Intel's Haswell and Skylake Xeon server parts commonly used in contemporary HPC designs. Our findings show that no one processor performs the best across all benchmarks, but that the ThunderX2 excels in areas demanding high memory bandwidth due to the provisioning of more memory channels in its design. We conclude that the ThunderX2 is a serious contender in the HPC server segment and has the potential to offer supercomputing sites with a viable high-performance alternative to existing designs from established industry players.

More Details

Investigating Fairness in Disaggregated Non-Volatile Memories

Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI

Kommareddy, Vamsee R.; Hughes, Clayton H.; Hammond, Simon D.; Awad, Amro

Many applications have growing demands for memory, particularly in the HPC space, making the memory system a potential bottleneck of next-generation computing systems. Sharing the memory system across processor sockets and nodes becomes a compelling argument given that memory technology is scaling at a slower rate than processor technology. Moreover, as many applications rely on shared data, e.g., graph applications and database workloads, having a large number of nodes accessing shared memory allows for efficient use of resources and avoids duplicating huge files, which can be infeasible for large graphs or scientific data. As new memory technologies come on the market, the flexibility of upgrading memory and system updates become major a concern, disaggregated memory systems where memory is shared across different computing nodes, e.g., System-on-Chip (SoC), is expected to become the most common design/architecture on memory-centric systems, e.g., The Machine project from HP Labs. However, due to the nature of such systems, different users and applications compete for the available memory bandwidth, which can lead to severe contention due to memory traffic from different SoCs. In this paper, we discuss the contention problem in disaggregated memory systems and suggest mechanisms to ensure memory fairness and enforce QoS. Our simulation results show that employing our proposed QoS techniques can speed up memory response time by up to 55%.

More Details

SST_GPU: An Execution -Driven CUDA Kernel Scheduler and Streaming-Multiprocessor Compute Model

Khairy, Mahmoud K.; Zhang, Mengchi Z.; Green, Roland G.; Hammond, Simon D.; Hoekstra, Robert J.; Rogers, Timothy R.; Hughes, Clayton H.

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel acceleration where the context for thousands of concurrent threads are resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several mas- sively parallel computing cores. The design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. To address the need for a scalable, decentralized GPU model that can model large GPUs, chiplet- based GPUs and multi-node GPUs, this report details the first steps in integrating the open-source, execution driven GPGPU-Sim into the SST framework. The first stage of this project, creates two elements: a kernel scheduler SST element accepts work from SST CPU models and schedules it to an SM-collection element that performs cycle-by-cycle timing using SSTs Mem Hierarchy to model a flexible memory system.

More Details
Results 26–50 of 69
Results 26–50 of 69