Final report for Cognitive Computing for Security LDRD 165613. It reports on the development of hybrid of general purpose/ne uromorphic computer architecture, with an emphasis on potential implementation with memristors.
The XVis project brings together the key elements of research to enable scientific discovery at extreme scale. Scientific computing will no longer be purely about how fast computations can be performed. Energy constraints, processor changes, and I/O limitations necessitate significant changes in both the software applications used in scientific computation and the ways in which scientists use them. Components for modeling, simulation, analysis, and visualization must work together in a computational ecosystem, rather than working independently as they have in the past. This project provides the necessary research and infrastructure for scientific discovery in this new computational ecosystem by addressing four interlocking challenges: emerging processor technology, in situ integration, usability, and proxy analysis.
We study a time-parallel approach to solving quadratic optimization problems with linear time-dependent partial differential equation (PDE) constraints. These problems arise in formulations of optimal control, optimal design and inverse problems that are governed by parabolic PDE models. They may also arise as subproblems in algorithms for the solution of optimization problems with nonlinear time-dependent PDE constraints, e.g., in sequential quadratic programming methods. We apply a piecewise linear finite element discretization in space to the PDE constraint, followed by the Crank-Nicolson discretization in time. The objective function is discretized using finite elements in space and the trapezoidal rule in time. At this point in the discretization, auxiliary state variables are introduced at each discrete time interval, with the goal to enable: (i) a decoupling in time; and (ii) a fixed-point iteration to recover the solution of the discrete optimality system. The fixed-point iterative schemes can be used either as preconditioners for Krylov subspace methods or as smoothers for multigrid (in time) schemes. We present promising numerical results for both use cases.
In this report we formulate eigenvalue-based methods for model calibration using a PDE-constrained optimization framework. We derive the abstract optimization operators from first principles and implement these methods using Sierra-SD and the Rapid Optimization Library (ROL). To demon- strate this approach, we use experimental measurements and an inverse solution to compute the joint and elastic foam properties of a low-fidelity unit (LFU) model.
This paper presents an end-to-end design process for compliance minimization based topological optimization of cellular structures through to the realization of a final printed product. Homogenization is used to derive properties representative of these structures through direct numerical simulation of unit cell models of the underlying periodic structure. The resulting homogenized properties are then used assuming uniform distribution of the cellular structure to compute the final macro-scale structure. A new method is then presented for generating an STL representation of the final optimized part that is suitable for printing on typical industrial machines. Quite fine cellular structures are shown to be possible using this method as compared to other approaches that use nurb based CAD representations of the geometry. Finally, results are presented that illustrate the fine-scale stresses developed in the final macro-scale optimized part and suggestions are made as to incorporate these features into the overall optimization process.
As transistors start to approach fundamental limits and Moore's law slows down, new devices and architectures are needed to enable continued performance gains. New approaches based on RRAM (resistive random access memory) or memristor crossbars can enable the processing of large amounts of data[1, 2]. One of the most promising applications for RRAM crossbars is brain inspired or neuromorphic computing[3, 4].
Millivolt switches will not only improve energy efficiency, but will enable a new capability to manage the energy-reliability tradeoff. By effectively utilizing this system-level capability, it may be possible to obtain one or two additional generations of scaling beyond current projections. Millivolt switches will enable further energy scaling, a process that is expected to continue until the technology encounters thermal noise errors [Theis 10]. If thermal noise errors can be accommodated at higher levels through a new form of error correction, it may be possible to scale about 3× lower in system energy than is currently projected. A general solution to errors would also address long standing problems with Cosmic Ray strikes, weak and aging parts, some cyber security vulnerabilities, etc.
Millivolt switches will not only improve energy efficiency, but will enable a new capability to manage the energy-reliability tradeoff. By effectively utilizing this system-level capability, it may be possible to obtain one or two additional generations of scaling beyond current projections. Millivolt switches will enable further energy scaling, a process that is expected to continue until the technology encounters thermal noise errors [Theis 10]. If thermal noise errors can be accommodated at higher levels through a new form of error correction, it may be possible to scale about 3× lower in system energy than is currently projected. A general solution to errors would also address long standing problems with Cosmic Ray strikes, weak and aging parts, some cyber security vulnerabilities, etc.
Proceedings of ISAV 2015: 1st International Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
We present an architecture for high-performance computers that integrates in situ analysis of hardware and system monitoring data with application-specific data to reduce application runtimes and improve overall platform utilization. Large-scale high-performance computing systems typically use monitoring as a tool unrelated to application execution. Monitoring data flows from sampling points to a centralized off-system machine for storage and post-processing when root-cause analysis is required. Along the way, it may also be used for instantaneous threshold-based error detection. Applications can know their application state and possibly allocated resource state, but typically, they have no insight into globally shared resource state that may affect their execution. By analyzing performance data in situ rather than off-line, we enable applications to make real-time decisions about their resource utilization. We address the particular case of in situ network congestion analysis and its potential to improve task placement and data partitioning. We present several design and analysis considerations.
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
We consider techniques to improve the performance of parallel sparse triangular solution on non-uniform memory architecture multicores by extending earlier coloring and level set schemes for single-core multiprocessors. We develop STS-k, where k represents a small number of transformations for latency reduction from increased spatial and temporal locality of data accesses. We propose a graph model of data reuse to inform the development of STS-k and to prove that computing an optimal cost schedule is NP-complete. We observe significant speed-ups with STS-3 on 32-core Intel Westmere-Ex and 24-core AMD 'MagnyCours' processors. Incremental gains solely from the 3-level transformations in STS-3 for a fixed ordering, correspond to reductions in execution times by factors of 1.4(Intel) and 1.5(AMD) for level sets and 2(Intel) and 2.2(AMD) for coloring. On average, execution times are reduced by a factor of 6(Intel) and 4(AMD) for STS-3 with coloring compared to a reference implementation using level sets.
Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis
Power consumption of extreme-scale supercomputers has become a key performance bottleneck. Yet current practices do not leverage power management opportunities, instead running at maximum power. This is not sustainable. Future systems will need to manage power as a critical resource, directing it to where it has greatest benefit. Power capping is one mechanism for managing power budgets, however its behavior is not well understood. This paper presents an empirical evaluation of several key HPC workloads running under a power cap on a Cray XC40 system, and provides a comparison of this technique with p-state control, demonstrating the performance differences of each. These results show: 1.) Maximum performance requires ensuring the cap is not reached; 2.) Performance slowdown under a cap can be attributed to cascading delays which result in unsynchronized performance variability across nodes; and, 3.) Due to lag in reaction time, considerable time is spent operating above the set cap. This work provides a timely and much needed comparison of HPC application performance under a power cap and attempts to enable users and system administrators to understand how to best optimize application performance on power-constrained HPC systems.
It is challenging to obtain scalable HPC performance on real applications, especially for data science applications with irregular memory access and computation patterns. To drive co-design efforts in architecture, system, and application design, we are developing miniapps representative of data science workloads. These in turn stress the state of the art in Graph BLAS-like Graph Algorithm Building Blocks (GABB). In this work, we outline a Graph BLAS-like, linear algebra based approach to miniTri, one such miniapp. We describe a task-based prototype implementation and give initial scalability results.
As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.