Publications

14 Results

Search results

Jump to search filters

Accelerating Finite-temperature Kohn-Sham Density Functional Theory\ with Deep Neural Networks

Ellis, John E.; Cangi, Attila; Modine, N.A.; Stephens, John A.; Thompson, Aidan P.; Rajamanickam, Sivasankaran R.

We present a numerical modeling workflow based on machine learning (ML) which reproduces the the total energies produced by Kohn-Sham density functional theory (DFT) at finite electronic temperature to within chemical accuracy at negligible computational cost. Based on deep neural networks, our workflow yields the local density of states (LDOS) for a given atomic configuration. From the LDOS, spatially-resolved, energy-resolved, and integrated quantities can be calculated, including the DFT total free energy, which serves as the Born-Oppenheimer potential energy surface for the atoms. We demonstrate the efficacy of this approach for both solid and liquid metals and compare results between independent and unified machine-learning models for solid and liquid aluminum. Our machine-learning density functional theory framework opens up the path towards multiscale materials modeling for matter under ambient and extreme conditions at a computational scale and cost that is unattainable with current algorithms.

More Details

ALO-NMF: Accelerated Locality-Optimized Non-negative Matrix Factorization

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Moon, Gordon E.; Ellis, John E.; Sukumaran-Rajam, Aravind; Parthasarathy, Srinivasan; Sadayappan, P.

Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including graph mining, recommender systems and natural language processing. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed. However, existing parallel NMF algorithms have not addressed data locality optimizations, which are critical for high performance since data movement costs greatly exceed the cost of arithmetic/logic operations on current computer systems. In this paper, we present a novel optimization method for parallel NMF algorithm based on the HALS (Hierarchical Alternating Least Squares) scheme that incorporates algorithmic transformations to enhance data locality. Efficient realizations of the algorithm on multi-core CPUs and GPUs are developed, demonstrating a new Accelerated Locality-Optimized NMF (ALO-NMF) that obtains up to 2.29x lower data movement cost and up to 4.45x speedup over existing state-of-the-art parallel NMF algorithms.

More Details

ALO-NMF: Accelerated Locality-Optimized Non-negative Matrix Factorization

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Moon, Gordon E.; Ellis, John E.; Sukumaran-Rajam, Aravind; Parthasarathy, Srinivasan; Sadayappan, P.

Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including graph mining, recommender systems and natural language processing. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed. However, existing parallel NMF algorithms have not addressed data locality optimizations, which are critical for high performance since data movement costs greatly exceed the cost of arithmetic/logic operations on current computer systems. In this paper, we present a novel optimization method for parallel NMF algorithm based on the HALS (Hierarchical Alternating Least Squares) scheme that incorporates algorithmic transformations to enhance data locality. Efficient realizations of the algorithm on multi-core CPUs and GPUs are developed, demonstrating a new Accelerated Locality-Optimized NMF (ALO-NMF) that obtains up to 2.29x lower data movement cost and up to 4.45x speedup over existing state-of-the-art parallel NMF algorithms.

More Details

miniGAN: A Generative Adversarial Network proxy application WBS 2.2.6.08 ECP-2.1.3 (Q3 FY2020 Milestone Report) (V.1.0)

Ellis, John E.

In order to support the machine learning co-design needs of ECP applications in current and future DOE HPC hardware, we have developed a generative adversarial network (GAN) proxy application, miniGAN, that has been released through the ECP proxy application suite. The proxy application is representative of the needs of ExaLearn's target applications, specifically the Cosmoflow and ExaGAN cosmology applications and the ExaWind energy application. The proxy application also demonstrates the first use of performance portable kernels within widely-used machine learning frameworks: PyTorch (Facebook) and Horovod (Uber). We provide performance scaling results for similar workloads to ExaGAN and a profile of individual GAN training components.

More Details

ECP Report: Update on Proxy Applications and Vendor Interactions

Ang, Jim; Sweeney, Christine; Wolf, Michael W.; Ellis, John E.; Ghosh, Sayan; Kagawa, Ai; Huang, Yunzhi; Rajamanickam, Sivasankaran R.; Ramakrishnaiah, Vinay; Schram, Malachi; Yoo, Shinjae

The ExaLearn miniGAN team (Ellis and Rajamanickam) have released miniGAN, a generative adversarial network(GAN) proxy application, through the ECP proxy application suite. miniGAN is the first machine learning proxy application in the suite (note: the ECP CANDLE project did previously release some benchmarks) and models the performance for training generator and discriminator networks. The GAN's generator and discriminator generate plausible 2D/3D maps and identify fake maps, respectively. miniGAN aims to be a proxy application for related applications in cosmology (CosmoFlow, ExaGAN) and wind energy (ExaWind). miniGAN has been developed so that optimized mathematical kernels (e.g., kernels provided by Kokkos Kernels) can be plugged into to the proxy application to explore potential performance improvements. miniGAN has been released as open source software and is available through the ECP proxy application website (https://proxyapps.exascaleproject.ordecp-proxy-appssuite/) and on GitHub (https://github.com/SandiaMLMiniApps/miniGAN). As part of this release, a generator is provided to generate a data set (series of images) that are inputs to the proxy application.

More Details

Scalable inference for sparse deep neural networks using kokkos kernels

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Ellis, John E.; Rajamanickam, Sivasankaran R.

Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.

More Details

Understanding the Machine Learning Needs of ECP Applications

Ellis, John E.; Rajamanickam, Sivasankaran R.

In order to support the codesign needs of ECP applications in current and future hardware in the area of machine learning, the ExaLearn team at Sandia studied the different machine learning use cases in three different ECP applications. This report is a summary of the needs of the three applications. The Sandia ExaLearn team will develop a proxy application representative of ECP application needs, specifically the ExaSky and EXAALT ECP projects. The proxy application will allow us to demonstrate performance portable kernels within machine learning codes. Furthermore, current training scalability of machine learning networks in these applications is negatively affected by large batch sizes. Training throughput of the network will increase as batch size increases, but network accuracy and generalization worsens. The proxy application will contain hybrid model- and data-parallelism to improve training efficiency while maintaining network accuracy. The proxy application will also target optimizing 3D convolutional layers, specific to scientific machine learning, which have not been as thoroughly explored by industry.

More Details
14 Results
14 Results