This report documents the completion of milestone STPM12-4 Kokkos Training Bootcamp. The goal of this milestone was to hold a combined tutorial and hackathon bootcamp event for the Kokkos community and prospective users. The Kokkos Bootcamp event was held on-site at Oak Ridge National Lab from July 24 — July 27, 2018. There were over 40 registered participants from 12 institutions, including 7 Kokkos project staff from SNL, LANL, and ORNL. The event consisted of a roughly a two-day tutorial session including hands exercises, followed by 1.5 days of intensive porting work on codes that the participants brought explore, port, and optimize the use of Kokkos with the help of Kokkos project experts.
Here, we provide a demonstration that gas-kinetic methods incorporating molecular chaos can simulate the sustained turbulence that occurs in wall-bounded turbulent shear flows. The direct simulation Monte Carlo method, a gas-kinetic molecular method that enforces molecular chaos for gas-molecule collisions, is used to simulate the minimal Couette flow at Re = 500 . The resulting law of the wall, the average wall shear stress, the average kinetic energy, and the continually regenerating coherent structures all agree closely with corresponding results from direct numerical simulation of the Navier-Stokes equations. Finally, these results indicate that molecular chaos for collisions in gas-kinetic methods does not prevent development of molecular-scale long-range correlations required to form hydrodynamic-scale turbulent coherent structures.
The retina plays an important role in animal vision - namely preprocessing visual information before sending it to the brain through the optic nerve. Understanding howthe retina does this is of particular relevance for development and design of neuromorphic sensors, especially those focused towards image processing. Our research focuses on examining mechanisms of motion processing in the retina. We are specifically interested in detection of moving targets under challenging conditions, specifically small or low-contrast (dim) targets amidst high quantities of clutter or distractor signals. In this paper we compare a classic motion-sensitive cell model, the Hassenstein-Reichardt model, to a model of the OMS (object motion-sensitive) cell, that relies primarily on change-detection, and describe scenarios for which each model is better suited. We also examine mechanisms, inspired by features of retinal circuitry, by which performance may be enhanced. For example, lateral inhibition (mediated by amacrine cells) conveys selectivity for small targets to the W3 ganglion cell - we demonstrate that a similar mechanism can be combined with the previously mentioned motion-processing cell models to select small moving targets for further processing.
The retina plays an important role in animal vision - namely preprocessing visual information before sending it to the brain through the optic nerve. Understanding howthe retina does this is of particular relevance for development and design of neuromorphic sensors, especially those focused towards image processing. Our research focuses on examining mechanisms of motion processing in the retina. We are specifically interested in detection of moving targets under challenging conditions, specifically small or low-contrast (dim) targets amidst high quantities of clutter or distractor signals. In this paper we compare a classic motion-sensitive cell model, the Hassenstein-Reichardt model, to a model of the OMS (object motion-sensitive) cell, that relies primarily on change-detection, and describe scenarios for which each model is better suited. We also examine mechanisms, inspired by features of retinal circuitry, by which performance may be enhanced. For example, lateral inhibition (mediated by amacrine cells) conveys selectivity for small targets to the W3 ganglion cell - we demonstrate that a similar mechanism can be combined with the previously mentioned motion-processing cell models to select small moving targets for further processing.
Malware detection and remediation is an on-going task for computer security and IT professionals. Here, we examine the use of neural algorithms to detect malware using the system calls generated by executables-alleviating attempts at obfuscation as the behavior is monitored. We examine several deep learning techniques, and liquid state machines baselined against a random forest. The experiments examine the effects of concept drift to understand how well the algorithms generalize to novel malware samples by testing them on data that was collected after the training data. The results suggest that each of the examined machine learning algorithms is a viable solution to detect malware-achieving between 90% and 95% class-averaged accuracy (CAA). In real-world scenarios, the performance evaluation on an operational network may not match the performance achieved in training. Namely, the CAA may be about the same, but the values for precision and recall over the malware can change significantly. We structure experiments to highlight these caveats and offer insights into expected performance in operational environments. In addition, we use the induced models to better understand what differentiates malware samples from goodware, which can further be used as a forensics tool to provide directions for investigation and remediation.
This paper formulates general computation as a feedback-control problem, which allows the agent to autonomously overcome some limitations of standard procedural language programming: resilience to errors and early program termination. Our formulation considers computation to be trajectory generation in the program's variable space. The computing then becomes a sequential decision making problem, solved with reinforcement learning (RL), and analyzed with Lyapunov stability theory to assess the agent's resilience and progression to the goal. We do this through a case study on a quintessential computer science problem, array sorting. Evaluations show that our RL sorting agent makes steady progress to an asymptotically stable goal, is resilient to faulty components, and performs less array manipulations than traditional Quicksort and Bubble sort.
Proceedings of PMBS 2018: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
Proxy applications, or proxies, are simple applications meant to exercise systems in a way that mimics real applications (their parents). However, characterizing the relationship between the behavior of parent and proxy applications is not an easy task. In prior work [1], we presented a data-driven methodology to characterize the relationship between parent and proxy applications based on collecting runtime data from both and then using data analytics to find their correspondence or divergence. We showed that it worked well for hardware counter data, but our initial attempt using MPI function data was less satisfactory. In this paper, we present an exploratory effort at making an improved quantification of the correspondence of communication behavior for proxies and their respective parent applications. We present experimental evidence of positive results using four proxy applications from the current ECP Proxy Application Suite and their corresponding parent applications (in the ECP application portfolio). Results show that each proxy analyzed is representative of its parent with respect to communication data. In conjunction with our method presented in [1] (correspondence between computation and memory behavior), we get a strong understanding of how well a proxy predicts the comprehensive performance of its parent.
Proceedings of ScalA 2018: 9th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
Sparse matrix-matrix multiplication is a critical kernel for several scientific computing applications, especially the setup phase of algebraic multigrid. The MPI+X programming model, which is growing in popularity, requires that such kernels be implemented in a way that exploits on-node parallelism. We present a single-pass OpenMP variant of Gustavson's sparse matrix matrix multiplication algorithm designed for architectures (e.g. CPU or Intel Xeon Phi) with reasonably large memory and modest thread counts (tens of threads, not thousands). These assumptions allow us to exploit perfect hashing and dynamic memory allocation to achieve performance improvements of up to 2x over third-party kernels for matrices derived from algebraic multigrid setup.
2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2018
Baseman, Elisabeth; Debardeleben, Nathan; Blanchard, Sean; Moore, Juston; Tkachenko, Olena; Ferreira, Kurt B.; Siddiqua, Taniya; Sridharan, Vilas
As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.
Logic-memory integration helps mitigate the von Neumann bottleneck, and this has enabled a new class of architectures that helps accelerate graph analytics and operations on sparse data streams. These utilize merge networks as a key unit of computation. Such networks are highly parallel and their performance increases with tighter coupling between logic and memory when a bitonic algorithm is used. This paper presents energy-efficient on-chip network architectures for merging key-value pairs using both word-parallel and bit-serial paradigms. The proposed architectures are capable of merging two rows of high bandwidth memory (HBM)worth of data in a manner that is completely overlapped with the reading from and writing back to such a row. Furthermore, their energy consumption is about an order of magnitude lower when compared to a naive crossbar based design.
Proceedings of PMBS 2018: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis
Proxy applications, or proxies, are simple applications meant to exercise systems in a way that mimics real applications (their parents). However, characterizing the relationship between the behavior of parent and proxy applications is not an easy task. In prior work [1], we presented a data-driven methodology to characterize the relationship between parent and proxy applications based on collecting runtime data from both and then using data analytics to find their correspondence or divergence. We showed that it worked well for hardware counter data, but our initial attempt using MPI function data was less satisfactory. In this paper, we present an exploratory effort at making an improved quantification of the correspondence of communication behavior for proxies and their respective parent applications. We present experimental evidence of positive results using four proxy applications from the current ECP Proxy Application Suite and their corresponding parent applications (in the ECP application portfolio). Results show that each proxy analyzed is representative of its parent with respect to communication data. In conjunction with our method presented in [1] (correspondence between computation and memory behavior), we get a strong understanding of how well a proxy predicts the comprehensive performance of its parent.