Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

Stephen A. Attaway
Sandia National Laboratories
P.O. Box 5800, MS 0437
Albuquerque, NM 87185-0437
swattaw@sandia.gov

Edward J. Barragy
Intel Corporation / DP2-420
2800 Center Drive N.
Dupont, WA 98327
Edward_J_Barray@ccm.jf.intel.com

Kevin H. Brown
Sandia National Laboratories
P.O. Box 5800, MS 0443
Albuquerque, NM 87185-0443
khbrown@sandia.gov

David R. Gardner
Sandia National Laboratories
P.O. Box 5800, MS 1111
Albuquerque, NM 87185-1111
drgardn@sandia.gov
http://www.cs.sandia.gov/~drgardn

Bruce A. Hendrickson
Sandia National Laboratories
P.O. Box 5800, MS 1111
Albuquerque, NM 87185-1111
bahendr@sandia.gov
http://www.cs.sandia.gov/~bahendr

Steven J. Plimpton
Sandia National Laboratories
P.O. Box 5800, MS 1111
Albuquerque, NM 87185-1111
sjplimp@sandia.gov
http://www.cs.sandia.gov/~sjplimp

Courtenay T. Vaughan
Sandia National Laboratories
P.O. Box 5800, MS 1111
Albuquerque, NM 87185-1111
ctvaugh@sandia.gov

Abstract:
Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia's transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten-million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.

Keywords:
scalable, parallel, finite elements, transient solid dynamics

Introduction

Transient dynamics simulations are among the most widely used engineering calculations. The industrial application which consumes more time on Cray vector supercomputers than any other is crash simulations, a prototypical transient dynamics calculation[7]. Other industrial applications include simulations of metal forging, powder compaction prior to sintering and other processes involving high stresses and strains. These calculations are also critical to defense applications including safety studies and weapons simulations. A number of commercial and government solid dynamics codes have been developed including DYNA, PamCrash and ABACUS. Sandia also has a long history of research and code development in this area, headlined by the PRONTO code suite. PRONTO is similar in scope to the commercial codes, but also includes smoothed particle hydrodynamics (SPH), which allows for simulations with very high strains (e.g., explosions) or coupled fluid/structure interaction problems. A discussion of some PRONTO simulations can be found Section 7. The practical importance of transient dynamics simulations, combined with their computational intensiveness would seem to make them natural candidates for parallelization. Unfortunately, this has proved to be quite difficult. For reasons discussed below, existing parallel implementations fail to scale to more than a few dozen processors. These disappointing results have convinced leaders in the solid dynamics community that parallel computing can not yet make a significant impact in this field[2].

In Section 2 we describe the functionality and structure of PRONTO. In Section 3 we explain why transient dynamics simulations have been difficult to parallelize. Our parallelization strategy is sketched in Section 4 and some further performance enhancements are described in Section 5. The performance of the code on some scalable problems is discussed in Section 6. A discussion of applications enabled by parallel PRONTO follows in Section 7. Conclusions are drawn in Section 8.

What is PRONTO?

PRONTO is a three-dimensional, transient solid dynamics code which is used for analyzing large deformations of nonlinear materials subjected to high rates of strain[3]. Developed over the past 10 years, PRONTO is a production-level code used by over 50 organizations inside and outside Sandia. Input to the code includes an unstructured grid consisting of an arbitrary mixture of hexahedral elements, shell elements, rigid bodies and smoothed particles. PRONTO implements a Lagrangian finite-element method with explicit time integration and adaptive timestep control to integrate the equations of motion. The finite-element formulation uses eight-node, uniform strain hexahedral elements and four-node quadrilateral uniform strain shell elements. Either the Flanagan-Belytschko hourglass control scheme or an assumed-strain hourglass control scheme can be used to control element distortions. PRONTO contains a variety of complex, nonlinear material models, including elastic-plastic materials with various types of strain hardening. A critical feature of the code is a robust algorithm for detecting when one material surface contacts another, for example in an automobile collision when the bumper buckles into the radiator. Correctly identifying surfaces in contact requires sophisticated algorithms for searching the global set of finite-elements. In a complex simulation, the cost of contact detection alone can be more than 50% of the run time on a sequential machine. A PRONTO timestep has the following structure.

1. Perform finite element analysis to compute forces on elements.
2. Compute forces between smoothed particles.
3. Predict new locations of particles and grid elements.
4. Search for contacts between mesh elements, or between elements and particles.
5. Correct the locations by pushing back objects in contact.

Stages (1), (2) and (4) dominate the sequential run time. The contact search in stage (4) typically consumes 30-60% of the time, so a great deal of effort has been expended over the years to make this computation fast[4]. The result of this effort was the replacement in PRONTO of floating point operations with a faster approach involving sorting and searching in integer lists.

Why is Parallelization Difficult?

Parallelizing transient dynamics codes is challenging for several reasons. For PRONTO there is the obvious complexity of starting with a fully featured production code. All its functionality must be parallelized in a scalable way. Even more daunting is the inherent difficulty of parallelizing several key kernel operations which operate on different data sets. The first task is to parallelize the finite element (FE) portion of the code. This is conceptually straightforward: partition the elements among processors in a way that balances computation while minimizing communication[5]. But parallelizing contact detection (which is performed on only the surface mesh - not the volumetric FE mesh) is much harder. To our knowledge, no previous attempts at parallelizing contact detection have scaled to more than a few dozen processors[8,9]. Since, in principle, on a given timestep any surface can contact any other, contact detection requires some kind of global search. As the geometry of the simulation evolves, this requires dynamic load balancing and irregular communication. Problems which exhibit any global, dynamic or irregular behavior are challenging to parallelize; contact detection exhibits all three. Parallelizing smoothed particle hydrodynamics (SPH) is also a challenging problem. Particles with time-dependent radii interact if they are geometrically near each other, and their density can vary greatly as the calculation proceeds, posing a load-balancing problem. Computing the physics of the SPH interactions also requires several stages of inter-processor communication within a timestep. The key difficulty in making a code like PRONTO perform well on a large parallel machine is that all of these computational kernels must be parallelized efficiently within the same timestep. And each of the kernels operates on a different data set (volumetric mesh, surface mesh, particles) whose spatial density is dynamically changing.

Our Parallel Implementation

We only sketch our parallelization strategy here. More details can be found in some of the references[1,6,10]. Most previous attempts to parallelize transient dynamics codes have relied upon a single decomposition of the mesh for both finite elements and contact detection. But these operations demand very different decomposition properties. The finite element analysis performs optimally only if each processor has the same number of elements and interprocessor boundaries are minimized. This decomposition can be generated once and used throughout the calculation. In contrast, contact detection and SPH depend upon geometric proximity, so a geometric decomposition is most appropriate. As the elements and particles evolve, the decompositions should change dynamically. The key idea behind our parallelization strategy is that we construct and maintain different decompositions for the different portions of the calculation. We choose appropriate decompositions to optimize performance of each phase: a graph-based static method for the finite element analysis generated by Chaco[5], and dynamic, geometric decompositions for contact detection and SPH. For the latter we use recursive coordinate bisection (RCB) which has a number of attractive properties for this application. The advantage of this approach is that we can achieve high performance in all phases of the calculation. The downside is that we need to communicate considerable information between the different decompositions which is expensive in both time and memory. But by carefully implementing the communication routines we can limit the run time cost, and solid dynamics calculations are not generally memory-bound. As our results will indicate, the advantages of multiple decompositions greatly outweigh the costs.

A timestep of parallel PRONTO has the following structure.
1. Perform finite element analysis to compute forces on elements.
2. Update the RCB decomposition of smoothed particles.
3. Compute forces between smoothed particles.
4. Predict new locations of particles and grid points.
5. Ship data to previous decomposition of the contact problem.
6. Update the RCB decomposition of the contact problem.
7. Search for contacts between mesh elements, or between elements and particles.
8. Communicate contact results back to finite element and SPH decompositions.
9. Correct the locations by pushing back objects in contact.

Our parallelization of PRONTO required about 15,000 lines of new code. In addition, much of the original PRONTO code was restructured for the parallel version to improve data locality on cache-based architectures.

Maximizing Performance

The goal of both serial and parallel PRONTO is to enable very large problems to run as quickly as possible. The dominant steps in the above outline are stages (1) and (7). (In this and the next section we focus on mesh-only problems though SPH computations can also be time consuming.) The fastest way to perform the global searches inherent in stage (7) is to do virtually no flops at all, but rather to use integer-based sort and search operations. Our calculations were performed on the 3600-node Sandia Teraflop computer. Each node of this machine has 128 Mbytes of memory and two 200 Mhz Pentium-Pro processors, each of which runs at 200 Mflops peak. We specially coded the kernel operations of the finite element computation to use the second processor for computation wherever possible. In practice the speedup thus obtained is limited by memory bandwidth since the two processors share the same memory bus. We also reorganized some data structures to improve cache locality. These efforts improved the performance of the finite element computation from 40 Mflops per node to over 120 Mflops per node. For the contact computation, our algorithm already insures load-balance of the basic sort and search operations. We further optimized by altering the basic algorithm to avoid a global search on most of the timesteps. To accomplish this we occasionally perform a full search which stores all pairs of nearby surfaces. On subsequent timesteps we need only scan this list instead of searching the processor's entire domain. When the geometry has evolved enough that the lists could miss possible contacts, a new global search is triggered. This method requires extra memory for storing the lists, but it halved the overall contact computation time.

Performance

Depending on the physical problem being modeled, parallel PRONTO can run as a pure finite element computation without contacts, as finite elements with contacts, as pure SPH particles (no finite elements), or as coupled finite elements and SPH particles with contacts. Here we focus on the performance of the first two cases. In all of the performance numbers we present, we timed the outermost timestepping loop of PRONTO to determine CPU time per timestep. Problem setup time (which is constant independent of the number of timesteps simulated), was not included since it is insignificant in production-scale runs. We counted floating-point operations using hardware counters on the Pentium Pro chips. This hardware counts floating point divides, adds and multiplies as one flop each.

To test the performance of a pure finite element run of parallel PRONTO, we modeled a steel bar with hexahedral elements vibrating due to an oscillatory stress while being pinned at the ends. This simple problem was selected since it is easy to scale to different sizes. Strains induced between adjacent elements and the material's equation of state are modeled in the FE computation, but the bar does not bend enough to create contacts. We observed nearly 100% parallel efficiency in running this problem if we scaled the problem size (number of mesh elements) linearly with the number of processors. As mentioned above, the FE computational kernels run at about 120 Mflops/node. Interprocessor communication is only a few percent of the total run time. Other lower flop-rate overhead within the timestep (boundary conditions and time integration) takes about one half the CPU time regardless of the number of processors. Scaling the problem to the full Teraflop machine, we ran a 14.04 million element version of the beam problem on 3600 nodes at 224.9 Gflops (62.5 Mflops/node), requiring 0.166 CPU secs/timestep.

Our second benchmark is more interesting as it is prototypical of the problems for which PRONTO was designed. We simulated the crush of an idealized steel shipping container by an inclined wall, as shown in Fig. 1. As with the first benchmark, this computation is easily scaled due to its regular geometry. However, this calculation is considerably more complex. The crumpling of the folded surfaces is a stringent test of the contact algorithm's accuracy and performance. A symmetry plane was used so that only half the container was actually simulated. An elastic-plastic material model was used for the steel in both the can and wall. Within the contact algorithm, global searches were conducted about every five timesteps.

  
Figure 1: Crushing of idealized shipping container.

Parallel timings are shown in Fig. 2 for a set of small scaled simulations with 1875 elements/node. Every time the number of processors P is doubled, the mesh is refined in one of the three dimensions so that the number of mesh elements N also doubles. Thus the leftmost data points are for a 3750 element simulation running on 2 processors. The rightmost data points are for a 6.57 million element simulation on 3504 processors.

  
Figure 2: Scaled speedup for small container-crush problem.

The topmost curve is the total CPU time per timestep averaged over a 100 microsecond (problem time) run. On the small problems this is a few hundred timesteps; on the large problems it is several thousand, since the timestep size must shrink as the mesh is refined. The lowest curve is the portion of time spent in the FE computation. Contact detection is the time between the lowest and middle curves. Overhead is the time between the top two curves. We again see excellent scalability to very large N and P. Perfect scalability would be a horizontal line on this plot. The FE computation scales essentially perfectly. The contact detection time varies from one problem size to the next due to variations in surface-to-volume ratios of mesh elements as refinement is done in different dimensions, but is also roughly horizontal. The overhead time is also a constant portion of the total run time (i.e. scalable) as P increases until the P=2048 and P=3504 data points. The reason for the non-scalability here is that the overhead timing includes the cost to push-back contacts that are detected. This normally small computation becomes somewhat unbalanced in this problem on very large numbers of processors. The overall flop performance of parallel PRONTO on this problem is 76.05 Gflops on 3504 nodes of the Teraflop machine. Essentially all the flops are computed within the FE computation (lowest curve) which again runs at about 120 Mflops/node. The majority of the remaining CPU time is spent in the integer-based contact searches and sorts (no flops).

A set of larger simulations of the container crush was also performed where each run used a mesh with about 3800 elements/node. These timings are shown in the Fig. 3. As before, the upper curve is total CPU time per timestep. PRONTO again evidences excellent scalability, since all of the timing curves are roughly horizontal. The largest problem (rightmost data points) is a simulation of 13.8 million mesh elements on 3600 nodes of the Teraflop machine. It runs at a sustained rate of 120.4 Gflops or 33.4 Mflops/node.

  
Figure 3: Scaled speedup for large container-crush problem.

Applications

  Parallel PRONTO has been used to perform a range of calculations which were previously impractical or impossible. Here we briefly sketch three representative applications.

Application I: Airplane Crash Fuel Dispersal

In an airplane crash, fires fed by ruptured fuel tanks are a great threat to survivors and to hazardous cargo. The danger posed by such a fire depends critically on the dispersal pattern of the fuel. Parallel PRONTO is ideally suited for simulating these kinds of incidents since it can combine structural analysis for the plane with smoothed particle hydrodynamics for the fuel. Fig. 4 shows a simulation of an airplane wing striking a vertical pole. In the image on the left, the purple dots are SPH particles representing the resulting fuel cloud. The image on the right shows the damage to the wing itself. Note that the collision tears the wing. This particular example illustrates how pronto allows the surface to be adaptively redefined as portions of model experience failure. If the strain in a given element becomes too large, failure is simulated by deleting the element. Allowing the elements to be adaptively deleted requires the parallel contact algorithm to be capable of tracking and updating the changing contact surface as the problem progresses. This calculation was run on 128 nodes of the Teraflop computer using about 110,000 hexahedral and shell elements to model the structures and about 130,000 SPH elements to model the fuel. More detailed versions of this problem are being developed which will include the entire airplane and a soil model for impact. the current limitation lies in the tools to build the computational mesh. These calculations are being performed by John Pott at Sandia.

  
Figure 4: Simulation of wing hitting vertical pole.

Application II: Shipping Container Integrity

A problem of great interest to the DOE is the integrity of shipping containers for transporting weapons and hazardous waste. Specifically, will the containers function properly in the event of a vehicular collision? An image of such a simulation of interest is depicted in Fig. 5, where the container is about to be crushed between two steel walls. This simulation involves more than 1.3 million elements, and includes both hexahedral and shell elements. The large number of elements is necessary to resolve critical small-scale structural details of the container. Studies of this model with parallel PRONTO are ongoing. This work is being performed by Jeff Gruda at Sandia.

  
Figure 5: Simulation of shipping container crushed between steel walls.

Application III: Constitutive Models of Foams

Foams of various types are widely used to distribute impact forces or to absorb energy in collisions. The macroscopic properties of foams depend upon their fine-scale structure in a complex manner that is not well understood. Better constitutive models of foam properties can be obtained through simulations of small-scale behavior. Unfortunately, very large simulations are necessary to be able to compare computations to experiments. Until the parallelization of PRONTO, such simulations were impossible. This example illustrates how parallel PRONTO has enabled qualitatively new and different engineering studies.

Fig. 6 depicts a simulation of an open-cell foam, with cells about 1mm in diameter. A linear elastic material model was used, but the complex buckling and folding generates complex nonlinear behavior. The foam is being crushed from above by a fast moving plate.

  
Figure 6: Simulation of partially crushed, open-cell foam.

As the picture reveals, there is some crush near the impacting plate, but much more on the opposing boundary. This is due to the reflection of stress waves off of the bottom plate. This behavior is consistent with experimental observations.

Each of the foam struts was modeled with multiple hexahedral elements, totaling more than 900,000. While one could use beam finite elements, the complex deformation patterns associated with large crush could cause the beam elements difficulty. In particular, the beam-on-beam contact would be very hard to detect. By using hexahedral elements, we are able to model very complex contact conditions. The drawback to using hexahedral elements, aside from the number of elements required, is that a very small timestep is required to properly integrate the motion. The problem was run on 512 nodes of the Teraflop computer and required 8.8 hours of CPU time. Over 650,000 timesteps were used to integrate the motion in this problem. The complexity of the model and the physics can be appreciated in the close-up view shown in Fig. 7. The large number of finite elements comprising the struts are clearly visible, as is the complicated folding and contact patterns. The red regions are those with the highest stresses. This study is being conducted by Mike Neilsen and Stephen Attaway at Sandia.

  
Figure 7: Close-up view of partially crushed, open-cell foam.

Conclusions

We have successfully parallelized a large-scale production solid dynamics code with a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations. On our 3600-node Teraflop computer, parallel PRONTO runs complex finite element (FE) simulations with global contact searches at rates of up to 120 Gflops. The finite element kernel can run contact-free FE simulations at a rate of 225 Gflops. While these flop rates may not seem impressive when compared to other kinds of simulations or the peak rate of the Pentium Pro chips, some context may be useful. First, to be able to simulate a more than ten million element model in a few tenths of second per timestep is unprecedented for solid dynamics simulations, especially when full global contact searches are required. The key reason is our new algorithm for efficiently parallelizing the contact detection stage. To our knowledge scalability of this computation had never before been demonstrated on more than 64 processors. This has enabled parallel PRONTO to become the only solid dynamics code we are aware of that can run effectively on 1000s of processors. More importantly, our parallel performance compares very favorably to the original serial PRONTO code which is optimized for vector supercomputers. On the container crush problem, a Teraflop node (two Pentium Pro processors) is as fast as a single processor of the Cray Jedi. This means on 3600 nodes of the Teraflop machine we can now run simulations with tens of millions of elements over 3000 times faster than we could on the Jedi! This is enabling transient dynamics simulations of unprecedented scale and fidelity. Not only can previous applications be run with vastly improved resolution and speed, but qualitatively new and different analyses have been made possible.

Acknowledgments

We are indebted to Gary Hennigan at Sandia for creating the NEMESIS toolkit which has been of great utility to us in pre-processing our multi-million element meshes for the Teraflop machine. Ben Cole of Intel has provided timely hardware and software support to our project. Tim Preston of Sandia helped us maintain a quality production-level code. Running our suites of test problems would not have been possible without the helpful scripts provided by Christi Forsythe and Mike Brayer of Sandia. This work was performed at Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy (DOE) under contract DE-AC04-94AL85000. We gratefully acknowledge funding from the DOE's ASCI program and access to the Sandia/Intel Teraflop computer. We also received funding from the Joint DoD/DOE Munitions Technology Development Program for development of the contact algorithm.

References

1. Stephen W. Attaway, Bruce A. Hendrickson, Steven J. Plimpton, David R. Gardner, Courtenay T. Vaughan, Martin W. Heinstein, James S. Peery. Parallel Contact Detection Algorithm for Transient Solid Dynamics Simulations Using PRONTO3D. Proc. International Mech. Eng. Congress & Exposition `96, November 1996.

2. Ted Belytschko. Northwestern University, personal communication.

3. L. M. Flanagan and D. P. Flanagan. PRONTO3D: A Three-Dimensional Transient Solid Dynamics Program. Tech. Rep. SAND87-1912, Sandia National Labs, Albuquerque, NM, March 1989.

4. M. W. Heinstein, S. W. Attaway, J. W. Swegle and F. J. Mello. A General-Purpose Contact Detection Algorithm for Nonlinear Structural Analysis Codes. Tech. Rep. SAND92-2141, Sandia National Labs, Albuquerque, NM, May 1993.

5. Bruce Hendrickson and Robert Leland. The Chaco User's Guide: Version 2.0. Tech. Rep. SAND94-2692, Sandia National Labs, Albuquerque, NM, June 1995.

6. Bruce Hendrickson, Steve Plimpton, Steve Attaway, Courtenay Vaughan, David Gardner. A New Parallel Algorithm for Contact Detection in Finite Element Methods. Proc. High Performance Computing '96, April 1996.

7. Mike Heroux. Cray Research, personal communication.

8. C. G. Hoover, A. J. DeGroot, J. D. Maltby, R. D. Procassini. Paradyn: DYNA3D for massively Parallel Computers. Presentation at the Tri-Laboratory Engineering Conference on Computational Modeling, October 1995.

9. J. G. Malone and N. L. Johnson. A Parallel Finite Element Contact/Impact Algorithm for Nonlinear Explicit Transient Analysis: Part II - Parallel Implementation. Intl. J. Num. Methods Eng. 37 (1994), pp. 591-603.

10. Steve Plimpton, Steve Attaway, Bruce Hendrickson, Jeff Swegle, Courtenay Vaughan, David Gardner. Transient Dynamics Simulations: Parallel Algorithms for Contact Detection and Smoothed Particle Hydrodynamics. Proc. Supercomputing'96, November 1996.