Publications Search

Analog hardware accelerators, which perform computation within a dense memory array, have the potential to overcome the major bottlenecks faced by digital hardware for data-heavy workloads such as deep learning. Exploiting the intrinsic computational advantages of memory arrays, however, has proven to be challenging principally due to the overhead imposed by the peripheral circuitry and due to the non-ideal properties of memory devices that play the role of the synapse. We review the existing implementations of these accelerators for deep supervised learning, organizing our discussion around the different levels of the accelerator design hierarchy, with an emphasis on circuits and architecture. We explore and consolidate the various approaches that have been proposed to address the critical challenges faced by analog accelerators, for both neural network inference and training, and highlight the key design trade-offs underlying these techniques.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

Adapting in Situ Accelerators for Sparsity with Granular Matrix Reordering

IEEE Computer Architecture Letters

Mikhailenko, Darya; Nakamoto, Yujin; Feinberg, Benjamin; Ipek, Engin

Neural network (NN) inference is an essential part of modern systems and is found at the heart of numerous applications ranging from image recognition to natural language processing. In situ NN accelerators can efficiently perform NN inference using resistive crossbars, which makes them a promising solution to the data movement challenges faced by conventional architectures. Although such accelerators demonstrate significant potential for dense NNs, they often do not benefit from sparse NNs, which contain relatively few non-zero weights. Processing sparse NNs on in situ accelerators results in wasted energy to charge the entire crossbar where most elements are zeros. To address this limitation, this letter proposes Granular Matrix Reordering (GMR): a preprocessing technique that enables an energy-efficient computation of sparse NNs on in situ accelerators. GMR reorders the rows and columns of sparse weight matrices to maximize the crossbars' utilization and minimize the total number of crossbars needed to be charged. The reordering process does not rely on sparsity patterns and incurs no accuracy loss. Overall, GMR achieves an average of 28 percent and up to 34 percent reduction in energy consumption over seven pruned NNs across four different pruning methods and network architectures.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

Multiscale Modeling of Single Event-Induced Faults in FinFET-based Processors

Cannon, Matthew J.; Rodrigues, Arun; Black, Dolores A.; Black, Jeffrey D.; Bustamante, Luis; Feinberg, Benjamin; Clark, Larry; Brunhaver, John; Barnaby, Hugh; Agarwal, Sapan; Marinella, Matthew

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads

Feinberg, Benjamin; Heyman, Benjamin; Mikhailenko, Darya; Wong, Ryan; Ho, An; Ipek, Engin

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads

Proceedings - International Symposium on Computer Architecture

Feinberg, Benjamin; Heyman, Benjamin C.; Mikhailenko, Darya; Wong, Ryan; Ho, An C.; Ipek, Engin

Data movement is a significant and growing consumer of energy in modern systems, from specialized low-power accelerators to GPUs with power budgets in the hundreds of Watts. Given the importance of the problem, prior work has proposed designing interconnects on which the energy cost of transmitting a 0 is significantly lower than that of transmitting a 1. With such an interconnect, data movement energy is reduced by encoding the transmitted data such that the number of 1s is minimized. Although promising, these data encoding proposals do not take full advantage of application level semantics. As an example of a neglected optimization opportunity, consider the case of a dot product computation as part of a neural network inference task. The order in which the neural network weights are fetched and processed does not affect correctness, and can be optimized to further reduce data movement energy. This paper presents commutative data reordering (CDR), a hardware-software approach that leverages the commutative property in linear algebra to strategically select the order in which weight matrix coefficients are fetched from memory. To find a low-energy transmission order, weight ordering is modeled as an instance of one of two well-studied problems, the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. This reduction makes it possible to leverage the vast body of work on efficient approximation methods to find a good transmission order. CDR exploits the indirection inherent to sparse matrix formats such that no additional metadata is required to specify the selected order. The hardware modifications required to support CDR are minimal, and incur an area penalty of less than 0.01% when implemented on top of a mobile-class GPU. When applied to 7 neural network inference tasks running on a GPU-based system, CDR respectively reduces average DRAM IO energy by 53.1% and 22.2% over the data bus invert encoding scheme used by LPDDR4, and the recently proposed Base + XOR encoding. These savings are attained with no changes to the mobile system software and no runtime performance penalty.

More Details

TYPE Conference Poster YEAR 2020

DOI OSTI Scopus

Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads

Feinberg, Benjamin; Heyman, Benjamin; Mikhailenko, Darya; Wong, Ryan; Ho, An; Ipek, Engin

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

DOI OSTI

Device-aware inference operations in SONOS nonvolatile memory arrays

IEEE International Reliability Physics Symposium Proceedings

Bennett, Christopher; Xiao, Tianyao P.; Dellana, Ryan; Feinberg, Benjamin; Agarwal, Sapan; Marinella, Matthew; Agrawal, Vineet; Prabhakar, Venkatraman; Ramkumar, Krishnaswamy; Hinh, Long; Saha, Swatilekha; Raghavan, Vijay; Chettuvetty, Ramesh

Non-volatile memory arrays can deploy pre-trained neural network models for edge inference. However, these systems are affected by device-level noise and retention issues. Here, we examine damage caused by these effects, introduce a mitigation strategy, and demonstrate its use in fabricated array of SONOS (Silicon-Oxide-Nitride-Oxide-Silicon) devices. On MNIST, fashion-MNIST, and CIFAR-10 tasks, our approach increases resilience to synaptic noise and drift. We also show strong performance can be realized with ADCs of 5-8 bits precision.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Evaluating complexity and resilience trade-offs in emerging memory inference machines

Bennett, Christopher; Dellana, Ryan; Xiao, Tianyao P.; Feinberg, Benjamin; Agarwal, Sapan; Cardwell, Suma G.; Marinella, Matthew; Severa, William M.; Aimone, James B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Energy and Performance Benchmarking of a Domain Wall-Magnetic Tunnel Junction Multibit Adder

IEEE Journal on Exploratory Solid-State Computational Devices and Circuits

Xiao, Tianyao P.; Bennett, Christopher; Hu, Xuan; Feinberg, Benjamin; Jacobs-Gedrim, Robin B.; Agarwal, Sapan; Brunhaver, John S.; Friedman, Joseph S.; Incorvia, Jean A.C.; Marinella, Matthew

The domain-wall (DW)-magnetic tunnel junction (MTJ) device implements universal Boolean logic in a manner that is naturally compact and cascadable. However, an evaluation of the energy efficiency of this emerging technology for standard logic applications is still lacking. In this article, we use a previously developed compact model to construct and benchmark a 32-bit adder entirely from DW-MTJ devices that communicates with DW-MTJ registers. The results of this large-scale design and simulation indicate that while the energy cost of systems driven by spin-Transfer torque (STT) DW motion is significantly higher than previously predicted, the same concept using spin-orbit torque (SOT) switching benefits from an improvement in the energy per operation by multiple orders of magnitude, attaining competitive energy values relative to a comparable CMOS subprocessor component. This result clarifies the path toward practical implementations of an all-magnetic processor system.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Publications

Search results