6.5. Troubleshooting

Sometimes an Aria job will fail. This can be for a number of reasons, including invalid syntax in the input file, an ill-posed or non-physical problem setup, insufficient mesh resolution, computing hardware issues, or other reasons. When this happens, you will need to look for clues in the log file, the stdout stream (or slurm file), and any results output files (Exodus or Heartbeat). The following sections address some of the common failure modes you might encounter.

6.5.1. Invalid Input Syntax

If you use something in the input file that is not valid Aria syntax, you will get an error in the log file. If the problem is invalid syntax, the error will be at the point in the input file where the invalid syntax was, and will suggest alternative syntax. For example, if we use the wrong syntax for a constant thermal conductivity (using tk instead of value or k for the constant)

BEGIN ARIA MATERIAL Kryptonite
  Thermal Conductivity = constant tk=1.0

The first signal we get that something is wrong is in the stdout stream (or the slurm file if launched on an HPC), but it does not tell us what the specific error is.

$ aria -i demo.i
exception on all processors:
Execution terminated due to errors

*** SIERRA ABORT on P0 ***
*** check the log file for more information ***

Looking in the log file, at the end of the file we would see

There was 1 error encountered during parse
There were no warnings encountered during parse
Execution terminated due to errors

SIERRA execution failed during parse with the following exception:
Execution terminated due to errors

This does not tell us what the error is by itself either, but since it says it was a parse error we should search for the keyword “error” in the section of the log file where it repeats the input file. Then we find

  BEGIN SIERRA myJob
    BEGIN ARIA MATERIAL Kryptonite
      Thermal Conductivity = constant tk=1.0
demo.i:7: Error: No matching command line found for 'Thermal Conductivity = constant tk=1.0'
demo.i:7: The following command lines with matching keywords are defined:
demo.i:7:   Thermal Conductivity [ {Of|Species|Subindex} <speciesname: string> ] = Constant {K|Value} = <k: real>
demo.i:7:
      heat conduction     = basic

The error message tells us which line was invalid, and shows the syntax for the best match. In this case, the model we used (constant) is a valid model, so it only prints the syntax for that model. If we had used an invalid model (e.g. constnt), the list of possibilities would include all the valid models for thermal conductivity (shown below with a truncated list)

  BEGIN ARIA MATERIAL Kryptonite
    Thermal Conductivity = constnt k=1.0
demo.i:7: Error: No matching command line found for 'Thermal Conductivity = constnt k=1.0'
demo.i:7: The following command lines with matching keywords are defined:
demo.i:7:   Thermal Conductivity [ {Of|Species|Subindex} <speciesname: string> ] = T_Exponent K_Ref = <k_ref:
demo.i:7:      real>  T_Ref = <t_ref: real>  N = <n: real>
demo.i:7:   Thermal Conductivity [ {Of|Species|Subindex} <speciesname: string> ] = Volume_Average
demo.i:7:   Thermal Conductivity [ {Of|Species|Subindex} <speciesname: string> ] = Summed Contributions =
demo.i:7:      <contributions: string>
...

6.5.2. Deprecated Syntax

Most Aria problems use stable syntax that does not change from one release to the next, however sometimes features need to be removed for maintainability and sustainability of the codebase. This is done using a deprecation cycle to allow for time to migrate input file syntax.

Aria uses a 6 to 12 month deprecation cycle (or longer for larger feature deprecations). When a deprecation is introduced, it will result in a warning immediately, which shows up both in the log file and in the stderr/slurm stream. For example, if you used a deprecated post-processor in Aria version 5.10 you would see the following warning in the stderr/slurm output

WARNING: Deprecated feature removed in Version 5.11 detected.
The 'POST PROCESS FLUX' solution options command is deprecated in favor of the new postprocessing syntax. Replace
  BEGIN SOLUTION OPTIONS
    post process FLUX heat_conduction on fluid_interface as HeatFlux
  END SOLUTION OPTIONS
with
  postprocess expression_flux of expression heat_conduction on fluid_interface as HeatFlux
for a direct replacement. See Sections 21.5 of the user manual for more information.
Please email sierra-help@sandia.gov if you have issues converting post process commands.

The same warning would also show up in the log file

demo.i:82: Warning: Deprecated feature removed in Version 5.11 detected.
demo.i:82: The 'POST PROCESS FLUX' solution options command is deprecated in favor of the new postprocessing
demo.i:82:   syntax. Replace
demo.i:82:   BEGIN SOLUTION OPTIONS
demo.i:82:     post process FLUX heat_conduction on fluid_interface as HeatFlux
demo.i:82:   END SOLUTION OPTIONS
demo.i:82: with
demo.i:82:   postprocess expression_flux of expression heat_conduction on fluid_interface as HeatFlux
demo.i:82: for a direct replacement. See Sections 21.5 of the user manual for more information.
demo.i:82: Please email sierra-help@sandia.gov if you have issues converting post process commands.
demo.i:82: in line command POST PROCESS FLUX

Deprecation version numbers are always odd-numbered (non-release). When the 5.10 version is released, the version of sierra/daily is set to 5.11. In this example, this means that using the deprecated post-processor will be an error if you run sierra/daily during the 5.11 release cycle, and in the next release (5.12) it will be an error. Pay attention to these warnings and update your inputs in a timely manner to avoid getting errors at future releases.

6.5.3. Non-Physical Temperatures

The conservation of energy equation in its continuous form (and in the absence of source terms) satisfies a maximum principle. That is, for a steady-state problem the maximum and minimum temperatures in the domain must occur on the boundary of the domain, and for a transient problem on either the boundary or the initial condition. However, the discretized form of the energy equation does not necessarily satisfy a discrete version of the maximum principle (DMP). In particular for the standard Galerkin finite element method used in Aria there are conditions on the mesh quality that must be met for the diffusion operator to satisfy a DMP.

The article “Don’t suppress the wiggles—They’re telling you something!” can provide useful context on the nature of oscillations in finite element solutions and provides valuable background information as to why they occur and what they mean for your solution.

In general, if you encounter non-physical high or low temperatures the troubleshooting steps you should start with are:

  1. Examine the mesh in the region of spurious temperature to ensure you have enough resolution in the direction of heat flow. Pay particular attention to thin regions and make sure they are more than 1 element thick. Also note that there can be issues on edge/corner elements that have multiple different boundary conditions applied to them, particularly with tetrahedral elements.

  2. Try using first order time integration, lumped_mass, and lumped_flux boundary conditions (although this will increase discretization error)

  3. Enable the flux limiter option if using a tetrahedral mesh (described below).

For linear tetrahedral elements the DMP condition is that there are no obtuse dihedral angles between faces of elements in the mesh [44]. Anisotropic thermal conductivity can also contribute to this effect [45]. Additionally, the consistent mass term can also lead to DMP violations in transient problems [46]. In practice this can lead to non-physical temperatures appearing in the solutions generated by Aria, most commonly when large heat fluxes are applied to an initially cold domain with a low thermal conductivity.

If non-physical temperature solutions are observed Aria supports an option to apply a nonlinear flux correction based on the work of Kuzmin et al. that restores a DMP on arbitrary meshes with arbitrary material properties [46, 47]. This option may be activated by using the APPLY FLUX LIMITER STABILIZATION line command. For transient problems it is also essential to use the LUMPED_MASS form of the time derivative term. At present, this option is only intended for use with linear elements, not second order 10 node tetrahedral elements or 27 node hexahedral elements. Source terms also do not have any limiters applied to them at present, as a result they may still cause non-physical temperatures in some cases.

Additionally, applying the stabilization has several downsides so that we only recommend enabling it if problematic non-physical temperatures are observed without it. Solutions with the stabilization activated are more diffusive and have higher error than unstabilized solutions (though they do converge at the same order with mesh refinement). The stabilization operator is also nonlinear and can adversely affect the convergence of the Newton-Raphson iteration within each time step. Combined with the cost of calculating the stabilization terms this can have noticeable impact on simulation runtime.

6.5.4. Repeated Time Step Failure

If your problem fails a timestep, Aria will decrease the step size and attempt the solve(s) again. If this keeps happening, eventually Aria will fail when the time step reaches the allowable minimum. Some strategies to try if you encounter this type of failure are:

  1. Ensure that your problem is mathematically well-posed and that the linear solves are completing with a sufficiently low final residual (they should be lower than your nonlinear residual target).

  2. Increase the number of nonlinear iterations if the problem was converging but not reaching your target residual before hitting the iteration limit.

  3. Change the nonlinear solution strategy from Newton to Line_search to enable an adaptive under-relaxation that attempts to ensure that each nonlinear step always results in a reduction in the nonlinear residual. This will increase the cost of each nonlinear iteration.

6.5.5. Linear Solver Failure

If the linear solver fails to solve your linear system there will be an error code reported in the log file. The Solver Error Codes section outlines the meaning behind these error codes. In general, if you see error codes you should re-examine your choice of linear solver or check whether the problem as-posed is non-singular. The Solver Selection Guidelines section has suggestions for how to incrementally adjust the solver settings for harder-to-solve problems.

6.5.6. Out of Memory

When running memory intensive computations like calculating view factors on memory limited HPC platforms, your simulation will be killed if the compute node runs out of memory. This can also happen if you run a conventional problem with too many elements per core. Unfortunately there is no hard guideline as to the maximum problem size since it depends on what options are used in the simulation and varies between HPC platforms, but without enclosure radiation it should be able to support several million elements per core before running into memory issues.

When a simulation on the HPC runs out of memory, the user who launched it will typically receive an automated email from the HPC system notifying them that their job ran out of memory, although this notification may not be sent immediately. If your jobs are running out of memory you generally have a few options:

  1. Run the job on a larger number of nodes or cores

  2. Use fewer cores per node than the total on an HPC platform (e.g. request twice as many nodes but use half as many cores per node)

  3. Run the job on a different system that has more memory per core available

Since most memory-limited simulations in Aria are due to view factor calculations for enclosure radiation, the Chapparal log file also includes estimates on memory use to help you size your job submission appropriately.