4. Running and Troubleshooting

4.1. Launching an MPMD Job

The command to run in MPMD mode is different from what is used to run typical MPI programs. To run a typical MPI program you simply run a parallel job using something like

$ mpirun -np 10 some_program

which launches 10 ranks of some_program to run on 10 CPU cores. With MPMD runs you are launching two separate MPI jobs with two different codes that can communicate. An example MPMD launch command to run PMR and Fuego together would look like

$ mpirun -np 10 fuego -i fuego.i : -np 10 pmr -i pmr.i

For some applications, the launch command should include an MPI color, used for the MPMD communication. For example, to launch PMR with Aria, you would use

$ mpirun -np 10 aria -i aria.i --mpmd_master --mpi_color 9999 : -np 10 pmr -i pmr.i

The mpmd_master flag indicates that Aria will be controlling the execution (the “leader”) of the simulation while the other application “follows” Aria.

Note that the order of the two apps does not matter in an MPMD call. The mpi_color argument for Aria specifies an id to label the cores executing Aria with to distinguish them from the cores executing the coupled application. This can be any integer, and must be unique across the coupled apps.

By default, Fuego is always the “leader” in MPMD execution mode and uses a hard-coded color so these flags are not required when coupling to Fuego.

There is no requirement that the two codes use the same number of cores, so depending on the mesh and computational costs you may choose a different allocation per code. For example, if your PMR solve is very expensive, you may allocate more cores to it than the fluid code:

$ mpirun -np 10 fuego -i fuego.i : -np 100 pmr -i pmr.i

4.1.1. HPC Execution

Special care must be taken when submitting MPMD jobs on the HPCs or any queued environment. By default, the two MPMD codes cannot share cores so to launch a case where each code is given 100 cores on an HPC you would need to request an allocation of 200 cores. This is unnecessarily wasteful though since PMR would not be using its 100 cores while the fluid code runs, and the fluid code would not be using its 100 cores while PMR runs. To get around this, you must enable oversubscription to allow the two codes to share resources. To get an allocation of 100 cores and use all the cores for both codes you must add additional mpi flags:

$ mpiexec --oversubscribe \
    --bind-to core:overload-allowed -np 100 fuego -i fuego.i : \
    --bind-to core:overload-allowed -np 100 pmr -i pmr.i

Keep in mind that the specific command to use can be platform dependent. A more complete example submission script on an HPC may look like

#!/bin/bash

#SBATCH --nodes=10
#SBATCH --time=48:00:00
#SBATCH --account=PUT_YOUR_WCID_HERE
#SBATCH --job-name=pmr
#SBATCH --partition=batch

nodes=$SLURM_JOB_NUM_NODES
cores=36

module load sierra
export OMPI_MCA_rmaps_base_oversubscribe=1
mpiexec --oversubscribe                                        \
  --bind-to core:overload-allowed                              \
  --npernode $cores --n $(($cores*$nodes)) fuego -i fuego.i :  \
  --bind-to core:overload-allowed                              \
  --npernode $cores --n $(($cores*$nodes)) pmr -i pmr.i

Contact sierra-help@sandia.gov if you need more help or encounter issues running MPMD jobs.

4.2. Mesh Requirements

The PMR mesh will apply a single radiative boundary condition to all exposed faces in the mesh automatically, and solve the RTE equations on all elements. This means that you do not need to provide sidesets in the mesh (if present, they will be ignored). Internal sidesets can be defined for post-processing and will not affect the RTE solver domain or transferred boundaries.

4.3. Troubleshooting

For help using PMR, visit the TF Teams channel, or submit a help request at the CompSim Help Portal or by emailing sierra-help@sandia.gov. Be sure to include relevant information like log files, input files, the version you are running, and what commands you used.