Monte Carlo Particle Transport at Sandia National Laboratories 2021
Abstract not provided.
Abstract not provided.
Proceedings of the International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering, M and C 2021
Conditional Point Sampling (CoPS) is a recently developed stochastic media transport algorithm that has demonstrated a high degree of accuracy in 1D and 3D simulations implemented for the CPU in Python. However, it is increasingly important that modern, production-level transport codes like CoPS be adapted for use on next-generation computing architectures. In this project, we describe the creation of a fast and accurate variant of CoPS implemented for the GPU in C++. As an initial test, we performed a code-to-code verification using single-history cohorts, which showed that the GPU implementation matched the original CPU implementation to within statistical uncertainty, while improving the speed by over a factor of 4000. We then tested the GPU implementation for cohorts up to size 64 and compared three variants of CoPS based on how the particle histories are grouped into cohorts: successive, simultaneous, and a successive-simultaneous hybrid. We examined the accuracy-efficiency tradeoff of each variant for 9 different benchmarks, measuring the reflectance and transmittance in a cubic geometry with reflecting boundary conditions on the four non-transmissive or reflective faces. Successive cohorts were found to be far more accurate than simultaneous cohorts for both reflectance (4.3 times) and transmittance (5.9 times), although simultaneous cohorts run more than twice as fast as successive cohorts, especially for larger cohorts. The hybrid cohorts demonstrated speed and accuracy behavior most similar to that of simultaneous cohorts. Overall, successive cohorts were found to be more suitable for the GPU due to their greater accuracy and reproducibility, although simultaneous and hybrid cohorts present an enticing prospect for future research.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Traditional Monte Carlo particle transport codes are expected to run inefficiently on next-generation architectures as they are memory-intensive and highly divergent. Since electrons and photons also behave differently, the future for coupled electron-photon radiation transport looks even worse. This project describes preliminary efforts to improve the performance of Monte Carlo particle transport codes when using accelerators like the graphics processing unit (GPU). Two key issues are addressed: how to handle memory-intensive tallies, and how to reduce divergence. Tallying on the GPU can be done efficiently by post-processing particle data, or by using a feature called warp shuffle for summing scores in parallel during the simulation. Reducing divergence is possible by using an event-based algorithm for particle tracking instead of the traditional history-based one. Although performance tests presented in this work show that the history-based algorithm generally outperformed the event-based one for simple problems, this outcome will likely change as the complexity of the code increases.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
International Conference on Physics of Reactors, PHYSOR 2018: Reactor Physics Paving the Way Towards More Efficient Systems
Effectively using a graphics processing unit (GPU) for Monte Carlo particle transport is a challenging task due to its memory storage requirements and traditionally divergent algorithms. Most efforts in this area have focused on the entire transport process, choosing to use atomic operations or tally replication Tor computing tallies. This work isolates the performance of the tallies from the rest of the transport process, and studies the impact of using different approaches for tallying on the GPU. Five implementations of a photon escape tally are compared, using both single and double precision data types. Results show that replicating tallies is clearly the best option overall, if there is enough memory available on the GPU to store them. When insufficient memory becomes an issue, the best method to use depends on the size, data type, and update frequency of the tally. Global atomic updates can be a reasonable option in some cases, especially if they arc infrequently used. However, there arc two alternatives for general-purpose tallying that were shown to be more effective in most of the scenarios considered. These two alternatives arc based on NVIDIA's warp shuffle feature, which allows 32 threads to simultaneously exchange or broadcast data, minimizing the number of atomic operations needed to get the final tally result.
International Conference on Physics of Reactors, PHYSOR 2018: Reactor Physics Paving the Way Towards More Efficient Systems
Effectively using a graphics processing unit (GPU) for Monte Carlo particle transport is a challenging task due to its memory storage requirements and traditionally divergent algorithms. Most efforts in this area have focused on the entire transport process, choosing to use atomic operations or tally replication Tor computing tallies. This work isolates the performance of the tallies from the rest of the transport process, and studies the impact of using different approaches for tallying on the GPU. Five implementations of a photon escape tally are compared, using both single and double precision data types. Results show that replicating tallies is clearly the best option overall, if there is enough memory available on the GPU to store them. When insufficient memory becomes an issue, the best method to use depends on the size, data type, and update frequency of the tally. Global atomic updates can be a reasonable option in some cases, especially if they arc infrequently used. However, there arc two alternatives for general-purpose tallying that were shown to be more effective in most of the scenarios considered. These two alternatives arc based on NVIDIA's warp shuffle feature, which allows 32 threads to simultaneously exchange or broadcast data, minimizing the number of atomic operations needed to get the final tally result.
Abstract not provided.
Abstract not provided.
Abstract not provided.