The case for semi-permanent cache occupancy
The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.