Sandia National Laboratories

P.O. Box 5800, MS 0437

Albuquerque, NM 87185-0437

`swattaw@sandia.gov`

Intel Corporation / DP2-420

2800 Center Drive N.

Dupont, WA 98327

`Edward_J_Barray@ccm.jf.intel.com`

Sandia National Laboratories

P.O. Box 5800, MS 0443

Albuquerque, NM 87185-0443

`khbrown@sandia.gov`

Sandia National Laboratories

P.O. Box 5800, MS 1111

Albuquerque, NM 87185-1111

`drgardn@sandia.gov`

`http://www.cs.sandia.gov/~drgardn`

Sandia National Laboratories

P.O. Box 5800, MS 1111

Albuquerque, NM 87185-1111

`bahendr@sandia.gov`

`http://www.cs.sandia.gov/~bahendr`

Sandia National Laboratories

P.O. Box 5800, MS 1111

Albuquerque, NM 87185-1111

`sjplimp@sandia.gov`

`http://www.cs.sandia.gov/~sjplimp`

Sandia National Laboratories

P.O. Box 5800, MS 1111

Albuquerque, NM 87185-1111

`ctvaugh@sandia.gov`

**Abstract:**-
Transient solid dynamics simulations are among the most widely used
engineering calculations. Industrial applications include vehicle
crashworthiness studies, metal forging, and powder compaction prior to
sintering. These calculations are also critical to defense
applications including safety studies and weapons simulations. The
practical importance of these calculations and their computational
intensiveness make them natural candidates for parallelization. This
has proved to be difficult, and existing implementations fail to scale
to more than a few dozen processors. In this paper we describe our
parallelization of PRONTO, Sandia's transient solid dynamics code, via
a novel algorithmic approach that utilizes multiple decompositions for
different key segments of the computations, including the material
contact calculation. This latter calculation is notoriously difficult
to perform well in parallel, because it involves dynamically changing
geometry, global searches for elements in contact, and unstructured
communications among the compute nodes. Our approach scales to at
least 3600 compute nodes of the Sandia/Intel Teraflop computer (the
largest set of nodes to which we have had access to date) on problems
involving millions of finite elements. On this machine we can
simulate models using more than ten-million elements in a few tenths
of a second per timestep, and solve problems more than 3000 times
faster than a single processor Cray Jedi.
**Keywords:**- scalable, parallel, finite elements, transient solid dynamics

In Section 2 we describe the functionality and structure of PRONTO. In Section 3 we explain why transient dynamics simulations have been difficult to parallelize. Our parallelization strategy is sketched in Section 4 and some further performance enhancements are described in Section 5. The performance of the code on some scalable problems is discussed in Section 6. A discussion of applications enabled by parallel PRONTO follows in Section 7. Conclusions are drawn in Section 8.

1. Perform finite element analysis to compute forces on
elements.

2. Compute forces between smoothed
particles.

3. Predict new locations of particles and grid
elements.

4. Search for contacts between mesh elements, or
between elements and particles.

5. Correct the locations by
pushing back objects in contact.

Stages (1), (2) and (4) dominate the sequential run time. The contact search in stage (4) typically consumes 30-60% of the time, so a great deal of effort has been expended over the years to make this computation fast[4]. The result of this effort was the replacement in PRONTO of floating point operations with a faster approach involving sorting and searching in integer lists.

A timestep of parallel PRONTO has the following structure.

1. Perform finite element analysis to compute forces on elements.

2. Update the RCB decomposition of smoothed particles.

3. Compute forces between smoothed particles.

4. Predict new locations of particles and grid points.

5. Ship data to previous decomposition of the contact problem.

6. Update the RCB decomposition of the contact problem.

7. Search for contacts between mesh elements, or between elements and
particles.

8. Communicate contact results back to finite element and SPH decompositions.

9. Correct the locations by pushing back objects in contact.

Our parallelization of PRONTO required about 15,000 lines of new code. In addition, much of the original PRONTO code was restructured for the parallel version to improve data locality on cache-based architectures.

To test the performance of a pure finite element run of parallel PRONTO, we modeled a steel bar with hexahedral elements vibrating due to an oscillatory stress while being pinned at the ends. This simple problem was selected since it is easy to scale to different sizes. Strains induced between adjacent elements and the material's equation of state are modeled in the FE computation, but the bar does not bend enough to create contacts. We observed nearly 100% parallel efficiency in running this problem if we scaled the problem size (number of mesh elements) linearly with the number of processors. As mentioned above, the FE computational kernels run at about 120 Mflops/node. Interprocessor communication is only a few percent of the total run time. Other lower flop-rate overhead within the timestep (boundary conditions and time integration) takes about one half the CPU time regardless of the number of processors. Scaling the problem to the full Teraflop machine, we ran a 14.04 million element version of the beam problem on 3600 nodes at 224.9 Gflops (62.5 Mflops/node), requiring 0.166 CPU secs/timestep.

Our second benchmark is more interesting as it is prototypical of the problems for which PRONTO was designed. We simulated the crush of an idealized steel shipping container by an inclined wall, as shown in Fig. 1. As with the first benchmark, this computation is easily scaled due to its regular geometry. However, this calculation is considerably more complex. The crumpling of the folded surfaces is a stringent test of the contact algorithm's accuracy and performance. A symmetry plane was used so that only half the container was actually simulated. An elastic-plastic material model was used for the steel in both the can and wall. Within the contact algorithm, global searches were conducted about every five timesteps.

**Figure 1: Crushing of idealized shipping container.**

Parallel timings are shown in Fig. 2 for a set of small scaled simulations with 1875 elements/node. Every time the number of processors P is doubled, the mesh is refined in one of the three dimensions so that the number of mesh elements N also doubles. Thus the leftmost data points are for a 3750 element simulation running on 2 processors. The rightmost data points are for a 6.57 million element simulation on 3504 processors.

**Figure 2: Scaled speedup for small container-crush problem.**

The topmost curve is the total CPU time per timestep averaged over a 100 microsecond (problem time) run. On the small problems this is a few hundred timesteps; on the large problems it is several thousand, since the timestep size must shrink as the mesh is refined. The lowest curve is the portion of time spent in the FE computation. Contact detection is the time between the lowest and middle curves. Overhead is the time between the top two curves. We again see excellent scalability to very large N and P. Perfect scalability would be a horizontal line on this plot. The FE computation scales essentially perfectly. The contact detection time varies from one problem size to the next due to variations in surface-to-volume ratios of mesh elements as refinement is done in different dimensions, but is also roughly horizontal. The overhead time is also a constant portion of the total run time (i.e. scalable) as P increases until the P=2048 and P=3504 data points. The reason for the non-scalability here is that the overhead timing includes the cost to push-back contacts that are detected. This normally small computation becomes somewhat unbalanced in this problem on very large numbers of processors. The overall flop performance of parallel PRONTO on this problem is 76.05 Gflops on 3504 nodes of the Teraflop machine. Essentially all the flops are computed within the FE computation (lowest curve) which again runs at about 120 Mflops/node. The majority of the remaining CPU time is spent in the integer-based contact searches and sorts (no flops).

A set of larger simulations of the container crush was also performed where each run used a mesh with about 3800 elements/node. These timings are shown in the Fig. 3. As before, the upper curve is total CPU time per timestep. PRONTO again evidences excellent scalability, since all of the timing curves are roughly horizontal. The largest problem (rightmost data points) is a simulation of 13.8 million mesh elements on 3600 nodes of the Teraflop machine. It runs at a sustained rate of 120.4 Gflops or 33.4 Mflops/node.

**Figure 3: Scaled speedup for large container-crush problem.**

**Figure 4: Simulation of wing hitting vertical pole.**

**Figure 5: Simulation of shipping container crushed between steel walls.**

Fig. 6 depicts a simulation of an open-cell foam, with cells about 1mm in diameter. A linear elastic material model was used, but the complex buckling and folding generates complex nonlinear behavior. The foam is being crushed from above by a fast moving plate.

**Figure 6: Simulation of partially crushed, open-cell foam.**

As the picture reveals, there is some crush near the impacting plate, but much more on the opposing boundary. This is due to the reflection of stress waves off of the bottom plate. This behavior is consistent with experimental observations.

Each of the foam struts was modeled with multiple hexahedral elements,
totaling more than 900,000. While one could use * beam* finite
elements, the complex deformation patterns associated with large crush
could cause the beam elements difficulty. In particular, the
beam-on-beam contact would be very hard to detect. By using hexahedral
elements, we are able to model very complex contact conditions. The
drawback to using hexahedral elements, aside from the number of
elements required, is that a very small timestep is required to
properly integrate the motion. The problem was run on 512 nodes of
the Teraflop computer and required 8.8 hours of CPU time. Over 650,000
timesteps were used to integrate the motion in this problem. The
complexity of the model and the physics can be appreciated in the
close-up view shown in Fig. 7. The large number of
finite elements comprising the struts are clearly visible, as is the
complicated folding and contact patterns. The red regions are those
with the highest stresses. This study is being conducted by Mike
Neilsen and Stephen Attaway at Sandia.

**Figure 7: Close-up view of partially crushed, open-cell foam.**

2. Ted Belytschko. Northwestern University, personal communication.

3.
L. M. Flanagan and D. P. Flanagan. ** PRONTO3D: A Three-Dimensional
Transient Solid Dynamics Program**. Tech. Rep. SAND87-1912,
Sandia National Labs, Albuquerque, NM, March 1989.

4.
M. W. Heinstein, S. W. Attaway, J. W. Swegle and F. J. Mello. ** A
General-Purpose Contact Detection Algorithm for Nonlinear Structural
Analysis Codes**. Tech. Rep. SAND92-2141, Sandia National Labs,
Albuquerque, NM, May 1993.

5.
Bruce Hendrickson and Robert Leland. **The Chaco User's Guide:
Version 2.0**. Tech. Rep. SAND94-2692, Sandia National Labs, Albuquerque,
NM, June 1995.

6.
Bruce Hendrickson, Steve Plimpton, Steve Attaway, Courtenay
Vaughan, David Gardner. ** A New Parallel Algorithm for Contact
Detection in Finite Element Methods**. Proc. High Performance Computing '96,
April 1996.

7. Mike Heroux. Cray Research, personal communication.

8.
C. G. Hoover, A. J. DeGroot, J. D. Maltby, R. D. Procassini.
** Paradyn: DYNA3D for massively Parallel Computers**. Presentation at
the Tri-Laboratory Engineering Conference on Computational Modeling,
October 1995.

9.
J. G. Malone and N. L. Johnson. ** A Parallel Finite Element
Contact/Impact Algorithm for Nonlinear Explicit Transient Analysis:
Part II - Parallel Implementation**. Intl. J. Num. Methods Eng. 37
(1994), pp. 591-603.

10.
Steve Plimpton, Steve Attaway, Bruce Hendrickson, Jeff Swegle,
Courtenay Vaughan, David Gardner. ** Transient Dynamics Simulations:
Parallel Algorithms for Contact Detection and Smoothed Particle
Hydrodynamics**. Proc. Supercomputing'96, November 1996.