6.1. Running Fuego

6.1.1. Loading the Sierra Module

Once you have finished setting up an input file, you are ready to run Fuego. From a CEE or HPC UNIX environment at Sandia, you can load the sierra module to access the latest release of Fuego.

$ module load sierra

This will load the current release of Sierra. To load other versions of Sierra, you can use the following modules

  • module load sierra/x.x - Load version x.x (e.g. module load sierra/5.10)

  • module load sierra/sprint - Load the latest sprint release (every 3-week release)

  • module load sierra/daily - Load the daily build of Sierra.

Warning

Using the sierra/daily module exposes you to potential bugs and instabilities since it is actively developed. If the nightly Sierra build process fails, the sierra/daily executable may not exist, or may be much older than expected.

To see a list of the available Sierra versions (and other useful modules available like apps/anaconda3 or apps/matlab) you can use

$ module avail

6.1.2. Running Fuego Locally

To run a job on a non-queued system (e.g. a CEE blade or a cee compute machine) you can call launch or mpirun. For example, to run a job on 4 processors, you would use

$ module load sierra
$ launch -n 4 fuego -i demo.i

The launch command is usually equivalent to using mpirun for local execution, but handles setting required MPI flags when running on some HPC systems.

$ module load sierra
$ mpirun -np 4 fuego -i demo.i

6.1.3. Using the Sierra Script

The sierra script included in the sierra modules provides some additional functionality for launching sierra jobs. Its use is similar to the launch and mpirun commands.

$ sierra --np 4 fuego -i demo.i

By default, the sierra script will perform extra steps that are not necessary for running Fuego. These include:

  • Reading the input file to find the mesh file and running decomp on it beforehand.

  • Reading the input file to find the output files and running epu on them after the simulation is done.

Fuego will automatically decompose your mesh, which is usually faster than manually running decomp, and most visualization tools can view decomposed output files so running epu to combine them into a single file is usually not needed either. To use the sierra script without invoking those steps, add the --run option.

$ sierra --run --np 4 fuego -i demo.i

6.1.4. Running Fuego on an HPC

The HPC systems at Sandia use slurm to schedule jobs. To run a job on an HPC you need to submit it to the slurm system along with some additional information (listed below) and the system will put your job in the queue and run at some point later in the future.

  • The number of compute nodes to run on.

  • The number of cores to use per node.

  • The wall-clock duration of the job (slurm will kill the job if it’s not done by this time limit).

  • A WCID for the job based on the project funding it. The WCID is used for tracking purposes and also determines the job priority. Use the WC Tool web site to check your WCIDs or get a new one.

  • Which queue to submit to (most HPCs have “batch”, “short”, and “long” - “batch” is the standard).

Refer to the HPC homepage for details about queue limits, core counts per node, and other useful HPC information for the machine you intend to run on.

If you are using SAW to submit your jobs, it can handle collecting the required information and submitting the job to the queue. If you do not use SAW, you can submit your jobs manually using either a batch script or the sierra script.

For simple job submissions you can use the sierra script directly after logging in to the HPC you want to run on. For example, to run Fuego on 360 processors for up to 24 hours you would log in to the HPC and run the following command.

$ sierra --run --np 360 fuego -i demo.i --account WCID --queue-name batch --time-limit 24:00:00

For more complicated submissions, you may need to prepare a batch script to perform any custom pre-processing steps you need. There are example submission scripts for different platforms in /projects/samples at Sandia. An example script to run Fuego is shown below.

#!/bin/bash
#SBATCH --nodes=10                    # Number of nodes
#SBATCH --ntasks-per-node=36          # Number of cores per node
#SBATCH --time=24:00:00               # Wall clock time (HH:MM:SS)
#SBATCH --account=PUT_YOUR_WCID_HERE  # WC ID
#SBATCH --job-name=test               # Name of job
#SBATCH --partition=batch             # partition/queue name: short or batch

nodes=$SLURM_JOB_NUM_NODES
cores=36

# do any pre-processing steps you need

mpiexec --bind-to core --npernode $cores --n $(($cores*$nodes)) fuego -i demo.i

If you saved the above script as run_fuego then you would submit it to the queue using

$ sbatch run_fuego

You can check on the status of your queued jobs using squeue -u myusername. To see an estimate of when they will start, use squeue -u myusername --start.

6.1.5. Running Fuego on Hops

The Hops (SRN) cluster is comprised of 64 nodes, each with 4x 80 GB H100 NVIDIA GPUs. Compared to traditional HPC clusters like Attaway / Eclipse / Amber / etc., Hops also uses the slurm scheduler, however, each node has significantly more memory bandwidth / compute available. The following are general recommendations for running on Hops, along with example submission batch scripts for both regular MPI jobs and MPMD jobs.

Recommendations:

  • Nodes are powerful but limited in number, try to fully utilize them. As a starting point, target around 15M-20M DOFs per linear system per compute node. For Fuego, DOFs are always at the mesh nodes so this is equivalent to sizing 15M-20M mesh nodes per compute node.

Note

Excluding the cost of the first step, node for node you should see order of magnitude speedups 10-20x relative to Eclipse (CTS-1) and 2.5-5x speedups relative to Amber (CTS-2). If you do not see such speedups read below.

Warning

Fuego is designed to run with 1 MPI rank per GPU, or in MPMD mode with PMR, at most 2 MPI ranks per GPU. In regular MPI submissions, do not run more than 1 task per GPU, as performance will degrade due to context switching. Use nvidia-smi to verify tasks are split properly across GPUs.

  • Configure models on host platforms (e.g. cee / amber, etc) as some of the initialization occurs on host when using Hops. Since we are running with only four MPI ranks per node, this initialization cost can be significant compared to host platforms that run on many more ranks per node (e.g., Amber has 112 ranks per node).

  • Fuego is partially converted, meaning some algorithms may run on host. To check if there are algorithms running on host, pass --afgoout debug to Fuego to write out a list of host algorithms to the log file, then in a terminal cat fuego.log | grep "Creating host". Depending on the algorithms listed, they may severely degrade overall performance.

Note

As example, consider a volume host algorithm that is running on 1 node of Hops vs 1 node of Amber. Excluding the added cost of syncing fields (potentially also the LHS / RHS) to host and back to device, a rough estimate of the respective slowdown is 112 / 4 since we run with 1 MPI rank per GPU on Hops but 112 ranks per node on Amber. If on the other hand, it is a surface algorithm running on a small portion of your domain (and does not modify LHS), you may potentially get away with running the host algorithm on Hops. A question one might have is, can we thread the legacy host algorithms to fully utilize the remaining CPU cores, the answer is threading of host algorithms requires the same Kokkos conversion that would be required to run on GPU, in which case it would be more beneficial to simply run on the GPU after conversion of the host algorithm.

Warning

All particle physics runs on the host at the moment. We do not have a timeline on when this feature will be converted to run on GPUs.

  • Check your preconditioners. If you specify to use the SGS preconditioner, we will respect that irrespective of the potential performance penalties. SGS is not GPU friendly. For scalar transport problems that are diagonally dominant we recommend to start with 1-3 sweeps of the Jacobi preconditioner. If you observe linear convergence issues, switch to SGS2 preconditioner which is also GPU friendly.

  • If you suspect host algorithms are hot spots in your GPU runs, email us and we will determine a priority to convert the algorithm. If no host algorithms are reported and you do not see order of magnitude speedups relative to Eclipse, email us at sierra-help@sandia.gov so we can diagnose your issue.

To submit a regular MPI job fully utilizing the four gpus on each node you can call the following script using sbatch

#!/bin/bash
#SBATCH --account=fyXXXXX  # WC ID
#SBATCH --job-name=test
#SBATCH --partition=batch
#SBATCH --qos=normal

# at the moment only daily is installed and we are waiting on point release to be installed
module load sierra/daily

num_nodes=$SLURM_JOB_NUM_NODES
task_per_node=4
ntasks=$((task_per_node * num_nodes))

launch -n $ntasks fuego -i input.i

# alternatively
# srun -n $ntasks fuego -i input.i

The following sbatch command will submit to two nodes, running 4 tasks split across the available gpus, for a total of 8 tasks (MPI ranks) reserving 60 minutes.

$ sbatch -N 2 -t 60 submit_hops.sh

As mentioned earlier, for regular MPI jobs, each task should bind exactly to one gpu, otherwise if multiple tasks are bound per GPU a performance penalty may be incurred due to context switching. To verify, you have the correct distribution of tasks, ssh to a compute node where your job is running and use nvidia-smi

$ squeue -u [username]
$ ssh hops[node]
$ nvidia-smi

For this example, for a given node, the nvidia-smi output should show 4 tasks running, each on their own gpu along with the memory consumed per GPU.

Similarly, to submit a MPMD job you can also use the launch wrapper

#!/bin/bash
#SBATCH --account=fyXXXXX  # WC ID
#SBATCH --job-name=test
#SBATCH --partition=batch
#SBATCH --qos=normal

module load sierra/daily

num_nodes=$SLURM_JOB_NUM_NODES
fuego_task_per_node=4
pmr_task_per_node=4

ntasks_fuego=$((fuego_task_per_node * num_nodes))
ntasks_pmr=$((pmr_task_per_node * num_nodes))

launch -n $ntasks_fuego fuego -i fuego_input.i : -n $ntasks_pmr pmr -i pmr_input.i

where the following sbatch command will split tasks in a round robin fashion across available GPUs.

$ sbatch -N 2 --distribution=cyclic -t 60 submit_mpmd_hops.sh

Using nvidia-smi, you will see each node has 4 tasks of Fuego and 4 tasks of PMR, with one task of Fuego and one task of PMR per GPU for a total of 16 tasks across the 8 GPUs. Fuego and PMR alternate in work and so it is fine to overload each GPU in this case, though each GPU will consume more memory.

Note

The launch wrapper automatically outputs a temporary MPMD config file and passes that file to srun i.e.,

srun -n $total_tasks --multi-prog mpmd_config_file

where the MPMD config file specifies tasks per app i.e.,

0-7 fuego -i fuego_input.i
8-15 pmr -i pmr_input.i

With this MPMD config file, without passing --distribution=cyclic to srun, the first 8 tasks of Fuego will land on the first node and the second 8 tasks of PMR will land on the second node. This is not ideal as each GPU will need to frequently context switch between different Fuego (node 1) or PMR (node 2) tasks.

Note

PMR is fully converted to the GPU with the exception of minor initialization costs.