Publications

Results 51–56 of 56
Skip to search filters

Summary of multi-core hardware and programming model investigations

Pedretti, Kevin P.; Kelly, Suzanne M.; Levenhagen, Michael J.

This report summarizes our investigations into multi-core processors and programming models for parallel scientific applications. The motivation for this study was to better understand the landscape of multi-core hardware, future trends, and the implications on system software for capability supercomputers. The results of this study are being used as input into the design of a new open-source light-weight kernel operating system being targeted at future capability supercomputers made up of multi-core processors. A goal of this effort is to create an agile system that is able to adapt to and efficiently support whatever multi-core hardware and programming models gain acceptance by the community.

More Details

The case for semi-permanent cache occupancy

ACM International Conference Proceeding Series

Dosanjh, Matthew D.; Ghazimirsaeed, S.M.; Grant, Ryan E.; Schonbein, William W.; Levenhagen, Michael J.; Bridges, Patrick G.; Afsahi, Ahmad

The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.

More Details
Results 51–56 of 56
Results 51–56 of 56