Publications

Results 1–50 of 119
Skip to search filters

A model for resource-aware load balancing on heterogeneous clusters

Proposed for publication in the IEEE Transactions on Parallel and Distributed Systems.

Devine, Karen D.

We address the problem of partitioning and dynamic load balancing on clusters with heterogeneous hardware resources. We propose DRUM, a model that encapsulates hardware resources and their interconnection topology. DRUM provides monitoring facilities for dynamic evaluation of communication, memory, and processing capabilities. Heterogeneity is quantified by merging the information from the monitors to produce a scalar number called 'power.' This power allows DRUM to be used easily by existing load-balancing procedures such as those in the Zoltan Toolkit while placing minimal burden on application programmers. We demonstrate the use of DRUM to guide load balancing in the adaptive solution of a Laplace equation on a heterogeneous cluster. We observed a significant reduction in execution time compared to traditional methods.

More Details

A parallel graph algorithm for detecting mesh singularities in distributed memory ice sheet simulations

ACM International Conference Proceeding Series

Bogle, Ian; Devine, Karen D.; Perego, Mauro P.; Rajamanickam, Sivasankaran R.; Slota, George M.

We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation - - running in parallel and taking a negligible amount of computation time - - so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.

More Details

Demonstrating improved application performance using dynamic monitoring and task mapping

2014 IEEE International Conference on Cluster Computing, CLUSTER 2014

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

This work demonstrates the integration of monitoring, analysis, and feedback to perform application-to-resource mapping that adapts to both static architecture features and dynamic resource state. In particular, we present a framework for mapping MPI tasks to compute resources based on run-time analysis of system-wide network data, architecture-specific routing algorithms, and application communication patterns. We address several challenges. Within each node, we collect local utilization data. We consolidate that information to form a global view of system performance, accounting for system-wide factors including competing applications. We provide an interface for applications to query the global information. Then we exploit the system information to change the mapping of tasks to nodes so that system bottlenecks are avoided. We demonstrate the benefit of this monitoring and feedback by remapping MPI tasks based on route-length, bandwidth, and credit-stalls metrics for a parallel sparse matrix-vector multiplication kernel. In the best case, remapping based on dynamic network information in a congested environment recovered 48.9% of the time lost to congestion, reducing matrix-vector multiplication time by 7.8%. Our experiments focus on the Cray XE/XK platform, but the integration concepts are generally applicable to any platform for which applicable metrics and route knowledge can be obtained.

More Details

Design of dynamic load-balancing tools for parallel applications

Devine, Karen D.; Hendrickson, Bruce A.; Boman, Erik G.; Vaughan, Courtenay T.

The design of general-purpose dynamic load-balancing tools for parallel applications is more challenging than the design of static partitioning tools. Both algorithmic and software engineering issues arise. The authors have addressed many of these issues in the design of the Zoltan dynamic load-balancing library. Zoltan has an object-oriented interface that makes it easy to use and provides separation between the application and the load-balancing algorithms. It contains a suite of dynamic load-balancing algorithms, including both geometric and graph-based algorithms. Its design makes it valuable both as a partitioning tool for a variety of applications and as a research test-bed for new algorithmic development. In this paper, the authors describe Zoltan's design and demonstrate its use in an unstructured-mesh finite element application.

More Details

Distributed Memory Graph Coloring Algorithms for Multiple GPUs

Proceedings of IA3 2020: 10th Workshop on Irregular Applications: Architectures and Algorithms, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Bogle, Ian; Boman, Erik G.; Devine, Karen D.; Rajamanickam, Sivasankaran R.; Slota, George M.

Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but hybrid MPI+GPU algorithms have been unexplored until this work, to the best of our knowledge. We present several MPI+GPU coloring approaches that use implementations of the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to solve for distance-2 coloring, giving the first known distributed and multi-GPU algorithm for this problem. In addition, we propose novel methods to reduce communication in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.

More Details

Distributed micro-releases of bioterror pathogens : threat characterizations and epidemiology from uncertain patient observables

Adams, Brian M.; Devine, Karen D.; Najm, H.N.; Marzouk, Youssef M.

Terrorist attacks using an aerosolized pathogen preparation have gained credibility as a national security concern since the anthrax attacks of 2001. The ability to characterize the parameters of such attacks, i.e., to estimate the number of people infected, the time of infection, the average dose received, and the rate of disease spread in contemporary American society (for contagious diseases), is important when planning a medical response. For non-contagious diseases, we address the characterization problem by formulating a Bayesian inverse problem predicated on a short time-series of diagnosed patients exhibiting symptoms. To keep the approach relevant for response planning, we limit ourselves to 3.5 days of data. In computational tests performed for anthrax, we usually find these observation windows sufficient, especially if the outbreak model employed in the inverse problem is accurate. For contagious diseases, we formulated a Bayesian inversion technique to infer both pathogenic transmissibility and the social network from outbreak observations, ensuring that the two determinants of spreading are identified separately. We tested this technique on data collected from a 1967 smallpox epidemic in Abakaliki, Nigeria. We inferred, probabilistically, different transmissibilities in the structured Abakaliki population, the social network, and the chain of transmission. Finally, we developed an individual-based epidemic model to realistically simulate the spread of a rare (or eradicated) disease in a modern society. This model incorporates the mixing patterns observed in an (American) urban setting and accepts, as model input, pathogenic transmissibilities estimated from historical outbreaks that may have occurred in socio-economic environments with little resemblance to contemporary society. Techniques were also developed to simulate disease spread on static and sampled network reductions of the dynamic social networks originally in the individual-based model, yielding faster, though approximate, network-based epidemic models. These reduced-order models are useful in scenario analysis for medical response planning, as well as in computationally intensive inverse problems.

More Details

Exploiting geometric partitioning in task mapping for parallel computers

Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS

Deveci, Mehmet; Rajamanickam, Sivasankaran R.; Leung, Vitus J.; Pedretti, Kevin P.; Olivier, Stephen L.; Bunde, David P.; Catalyurek, Umit V.; Devine, Karen D.

We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that communication and execution time are reduced. We consider the case of sparse node allocation within a parallel machine, where the nodes assigned to a job are not necessarily located within a contiguous block nor within close proximity to each other in the network. The goal is to assign tasks to cores so that interdependent tasks are performed by 'nearby' cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We show that, for the structured finite difference mini-app Mini Ghost, our mapping method reduced execution time 34% on average on 65,536 cores of a Cray XE6. In a molecular dynamics mini-app, Mini MD, our mapping method reduced communication time by 26% on average on 6144 cores. We also compare our mapping with graph-based mappings from the LibTopoMap library and show that our mappings reduced the communication time on average by 15% in MiniGhost and 10% in MiniMD. © 2014 IEEE.

More Details

Geometric mapping of tasks to processors on parallel computers with mesh or torus networks

IEEE Transactions on Parallel and Distributed Systems

Deveci, Mehmet; Devine, Karen D.; Pedretti, Kevin P.; Taylor, Mark A.; Rajamanickam, Sivasankaran R.; Çatalyurek, Umit V.

We present a new method for reducing parallel applications’ communication time by mapping their MPI tasks to processors in a way that lowers the distance messages travel and the amount of congestion in the network. Assuming geometric proximity among the tasks is a good approximation of their communication interdependence, we use a geometric partitioning algorithm to order both the tasks and the processors, assigning task parts to the corresponding processor parts. In this way, interdependent tasks are assigned to “nearby” cores in the network. We also present a number of algorithmic optimizations that exploit specific features of the network or application to further improve the quality of the mapping. We specifically address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network. However, our methods generalize to contiguous allocations as well, and results are shown for both contiguous and non-contiguous allocations. We show that, for the structured finite difference mini-application MiniGhost, our mapping methods reduced communication time up to 75 percent relative to MiniGhost’s default mapping on 128K cores of a Cray XK7 with sparse allocation. For the atmospheric modeling code E3SM/HOMME, our methods reduced communication time up to 31% on 16K cores of an IBM BlueGene/Q with contiguous allocation.

More Details
Results 1–50 of 119
Results 1–50 of 119