The performance of a large-scale, production-quality science and engineering application (‘app’) is often dominated by a small subset of the code. Even within that subset, computational and data access patterns are often repeated, so that an even smaller portion can represent the performance-impacting features. If application developers, parallel computing experts, and computer architects can together identify this representative subset and then develop a small mini-application (‘miniapp’) that can capture these primary performance characteristics, then this miniapp can be used to both improve the performance of the app as well as provide a tool for co-design for the high-performance computing community. However, a critical question is whether a miniapp can effectively capture key performance behavior of an app. This study provides a comparison of an implicit finite element semiconductor device modeling app on unstructured meshes with an implicit finite element miniapp on unstructured meshes. The goal is to assess whether the miniapp is predictive of the performance of the app. Finally, single compute node performance will be compared, as well as scaling up to 16,000 cores. Results indicate that the miniapp can be reasonably predictive of the performance characteristics of the app for a single iteration of the solver on a single compute node.
The energy concerns of many-core processors are increasing with the number of cores. We provide a new method that reduces energy consumption of an application on many-core processors by identifying unique segments to apply dynamic voltage and frequency scaling (DVFS). Our method, phase-based voltage and frequency scaling (PVFS), hinges on the identification of phases, i.e., Segments of code with unique performance and power attributes, using hidden Markov Models. In particular, we demonstrate the use of this method to target hardware components on many-core processors such as Network-on-Chip (NoC). PVFS uses these phases to construct a static power schedule that uses DVFS to reduce energy with minimal performance penalty. This general scheme can be used with a variety of performance and power metrics to match the needs of the system and application. More importantly, the flexibility in the general scheme allows for targeting of the unique hardware components of future many-core processors. We provide an in-depth analysis of PVFS applied to five threaded benchmark applications, and demonstrate the advantage of using PVFS for 4 to 32 cores in a single socket. Empirical results of PVFS show a reduction of up to 10.1% of total energy while only impacting total time by at most 2.7% across all core counts. Furthermore, PVFS outperforms standard coarse-grain time-driven DVFS, while scaling better in terms of energy savings with increasing core counts.
The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.
We present a new distributed model for graph computations motivated by limited information sharing. Two or more independent entities have collected large social graphs. They wish to compute the result of running graph algorithms on the entire set of relationships. Because the information is sensitive or economically valuable, they do not wish to simply combine the information in a single location. We consider two models for computing the solution to graph algorithms in this setting: 1) limited-sharing: the two entities can share only a poly logarithmic size subgraph, 2) low-trust: the entities must not reveal any information beyond the query answer, assuming they are all honest but curious. We believe this model captures realistic constraints on cooperating autonomous data centres' have results for both models for s-t connectivity, one of the simplest graph problems that requires global information in the worst case. In the limited-sharing model, our results exploit social network structure. Standard communication complexity gives polynomial lower bounds on s-t connectivity for general graphs. However, if the graph for each data centre has a giant component and these giant components intersect, then we can overcome this lower bound, computing-t connectivity while exchanging O(log 2 n) bits for a constant number of data centers. We can also test the assumption that the giant components overlap using O(log 2 n) bits provided the (unknown) overlap is sufficiently large. The second result is in the low trust model. We give a secure multi-party computation (MPC) algorithm that 1) does not make cryptographic assumptions when there are 3 or more entities, and 2) is efficient, especially when compared to the usual garbled circuit approach. The entities learn only the yes/no answer. No party learns anything about the others' graph, not even node names. This algorithm does not require any special graph structure. This secure MPC result for s-t connectivity is one of the first that involves a few parties computing on large inputs, instead of many parties computing on a few local values.
We describe new capabilities for modeling MPEC problems within the Pyomo modeling software. These capabilities include new modeling components that represent complementar- ity conditions, modeling transformations for re-expressing models with complementarity con- ditions in other forms, and meta-solvers that apply transformations and numeric optimization solvers to optimize MPEC problems. We illustrate the breadth of Pyomo's modeling capabil- ities for MPEC problems, and we describe how Pyomo's meta-solvers can perform local and global optimization of MPEC problems.