Publications Search

Conventional wisdom in the spacecraft domain is that on-orbit computation is expensive, and thus, information is traditionally funneled to the ground as directly as possible. The explosion of information due to larger sensors, the advancements of Moore's law, and other considerations lead us to revisit this practice. In this article, we consider the trade-off between computation, storage, and transmission, viewed as an energy minimization problem.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Asynchronous Ballistic Reversible Computing

2017 IEEE International Conference on Rebooting Computing, ICRC 2017 - Proceedings

Frank, Michael P.

Most existing concepts for hardware implementation of reversible computing invoke an adiabatic computing paradigm, in which individual degrees of freedom (e.g., node voltages) are synchronously transformed under the influence of externallysupplied driving signals. But distributing these "power/clock" signals to all gates within a design while efficiently recovering their energy is difficult. Can we reduce clocking overhead using a ballistic approach, wherein data signals self-propagating between devices drive most state transitions? Traditional concepts of ballistic computing, such as the classic Billiard-Ball Model, typically rely on a precise synchronization of interacting signals, which can fail due to exponential amplification of timing differences when signals interact. In this paper, we develop a general model of Asynchronous Ballistic Reversible Computing (ABRC) that aims to address these problems by eliminating the requirement for precise synchronization between signals. Asynchronous reversible devices in this model are isomorphic to a restricted set of Mealy finite-state machines. We explore ABRC devices having up to 3 bidirectional I/O terminals and up to 2 internal states, identifying a simple pair of such devices that comprises a computationally universal set of primitives. We also briefly discuss how ABRC might be implemented using single flux quanta in superconducting circuits.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

A spike-Timing neuromorphic architecture

2017 IEEE International Conference on Rebooting Computing, ICRC 2017 - Proceedings

Hill, Aaron; Donaldson, Jonathon W.; Rothganger, Fredrick R.; Vineyard, Craig M.; Follett, David R.; Follett, Pamela L.; Smith, Michael R.; Verzi, Stephen J.; Severa, William M.; Wang, Felix W.; Aimone, James B.; Naegle, John H.; James, Conrad D.

Unlike general purpose computer architectures that are comprised of complex processor cores and sequential computation, the brain is innately parallel and contains highly complex connections between computational units (neurons). Key to the architecture of the brain is a functionality enabled by the combined effect of spiking communication and sparse connectivity with unique variable efficacies and temporal latencies. Utilizing these neuroscience principles, we have developed the Spiking Temporal Processing Unit (STPU) architecture which is well-suited for areas such as pattern recognition and natural language processing. In this paper, we formally describe the STPU, implement the STPU on a field programmable gate array, and show measured performance data.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning

IEEE Transactions on Parallel and Distributed Systems

Groves, Taylor L.; Grant, Ryan; Gonzales, Aaron; Arnold, Dorian

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI

Designing vector-friendly compact BLAS and LAPACK kernels

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

OpenMPIR: Implementing OpenMP tasks with tapir

Proceedings of LLVM-HPC 2017: 4th Workshop on the LLVM Compiler Infrastructure in HPC - Held in conjunction with SC 2017: The International Conference for High Performance Computing, Networking, Storage and Analysis

Stelle, George; Moses, William S.; Olivier, Stephen L.; Mccormick, Patrick

Optimizing compilers for task-level parallelism are still in their infancy. This work explores a compiler front end that translates OpenMP tasking semantics to Tapir, an extension to LLVM IR that represents fork-join parallelism. This enables analyses and optimizations that were previously inaccessible to OpenMP codes, as well as the ability to target additional runtimes at code generation. Using a Cilk runtime back end, we compare results to existing OpenMP implementations. Initial performance results for the Barcelona OpenMP task suite show performance improvements over existing implementations.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus