Publications Details
Methods for computing Monte Carlo tallies on the GPU
Effectively using a graphics processing unit (GPU) for Monte Carlo particle transport is a challenging task due to its memory storage requirements and traditionally divergent algorithms. Most efforts in this area have focused on the entire transport process, choosing to use atomic operations or tally replication Tor computing tallies. This work isolates the performance of the tallies from the rest of the transport process, and studies the impact of using different approaches for tallying on the GPU. Five implementations of a photon escape tally are compared, using both single and double precision data types. Results show that replicating tallies is clearly the best option overall, if there is enough memory available on the GPU to store them. When insufficient memory becomes an issue, the best method to use depends on the size, data type, and update frequency of the tally. Global atomic updates can be a reasonable option in some cases, especially if they arc infrequently used. However, there arc two alternatives for general-purpose tallying that were shown to be more effective in most of the scenarios considered. These two alternatives arc based on NVIDIA's warp shuffle feature, which allows 32 threads to simultaneously exchange or broadcast data, minimizing the number of atomic operations needed to get the final tally result.