Janus error messages


The instructions below will assist users in deciphering the error messages output when a program abends.

First, you can run your application in the debugger and that should tell you where your program is abending. This page is helpful whether or not the debugger is available. Particularly in the case where for some reason the debugger isn't able to show you where the program is abending. Also, with this utility, you don't need to recompile with '-g'. You can do 'post-mortem' analysis.

This page is actually two parts. The first talks about the easiest to fix abend - too little stack specified on the yod command.

The section following this describes a utility I have which takes the addresses from the abend message, disassebles the program and locates the instructions from the abend message. It also displays info that will help you relate the assembler instructions to the source code. Note that you don't have to recompile and rerun your program to use mktrace.

Part 1: looking at an abend message


Note that this assumes you are already logged into janus.sandia.gov.

The things to note are the lines 'error_code' and 'VECTOR'.

janus ~/tst 255 > yod -sz 1 test3c
----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------

PROCESSOR [ 0]
log_nid  =    0     phys_nid  =  946    host_id =  33665   host_pid  =   40
group_id =   23     num_procs =    1    log_pid =      0   local_pid =    2
base_node_index =    0   last_node_index =    0

text_base  = 0x00020000   text_len  = 0x00021000
data_base  = 0x00820000   data_len  = 0x0001b000
stack_base = 0x7fc00000   stack_len = 0x00043000
heap_base  = 0x00c00000   heap_len  = 0x0780b000   comm_len  = 0x00040000

ss  = 0x0000001f     es  = 0x0000001f   ds  = 0x0000001f
edi = 0x00000000    esi  = 0x00000000   ebp = 0x7fc0f6e0   esp = 0x7fbde990
ebx = 0x00002710    edx  = 0x00000000   ecx = 0x0000000c   eax = 0x00000014
cs  = 0x00000017    eflg = 0x00010206   prev_sp = 0x7fbde990
error_code = 6

VECTOR #[14][PAGE FAULT]  fault_address = 0x7fbde99c
Stack Trace:  ------------------------------
[ 0][0x00020284] restart_address
[ 1][0x00020199]
[ 2][0x0002136f]
[ 3][0x00020120]
------------------------------------------------------------
Application process STOPPED  -----------------


Acknowledgement and Disclaimer
PROCESSOR [ 0] indicates that it was the first (of two) processors on the board that abended. The 'log_pid = 0' indicates that it was the node 0 that abended (out of 'num_procs = 1' total nodes in the yod process).

Interrupt Vector 14 is a page fault exception. From Vol 3 of the Pentium Pro Developers manual, Section 5.12, page 5-39, the description of Page fault is:

If the stack_address had been > the stack_base+stack_len, the error would have been that the stack is not large enough. Use the 'size your_program_name' command (see below) to find what size stack you need and include it on the 'yod' command line. The size command is a lower bound for how much stack space you need. The 'dec' column is the total of the first 3 columns. I just round it up to the nearest megabyte. This error message seems to occur for other memory errors such as out-of-bound arrays, malloc'ing too much memory,etc. For these last cases, -stack won't help you. See the part 2 below for info on finding where your program is abending.
janus ~/tst 258 > size test3c
text    data    bss     dec     hex     filename
132257  28164   79464   239885  3a90d   test3c

janus ~/tst 256 > yod -sz 1 -stack 1M test3c
heapsize=908525536
heapsize1=908525536
 x(1957)=0.001986, mflops=0.000000
 x(8719)=0.000359, mflops=0.000000
 x(8760)=0.000360, mflops=0.000000
STOP

janus ~/tst 257 >

Part 2: using the mktrace utility


I have a routine which reads the abend error message and finds the instruction in the program that corresponds. Below is an example of using this utility.

Subroutine mysum has the memory error of an out of bounds memory reference: z(-20000) = 0.d0

janus ~/trace 131 > cat test.f
       real*8 z(100)
       print *,'hello you'
       x = sin(0.d0) - 1.d0
       y = 3.d0/x
c      z(-20000) = 0.d0
c      x = sqrt(z(-20000))
       call mysum(x,y)
       print *,'y=',y,x
       call exit
       end

       subroutine mysum(x,y)
       real*8 z(100)
       print *,'hello you,1'
       x = sin(0.d0) - 1.d0
       y = 3.d0/x
       z(-20000) = 0.d0
c      x = sqrt(z(-20000))
       print *,'y=',y,x
       return
       end
Compiling this into executable testf and yod'ing it onto a compute node gives:
janus ~/trace 127 > yod -sz 1 testf
 hello you
 hello you,1
----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------

PROCESSOR [ 0]
log_nid  =    0     phys_nid  =  899    host_id =  33665   host_pid  =    1
group_id =   20     num_procs =    1    log_pid =      0   local_pid =    2
base_node_index =    0   last_node_index =    0

text_base  = 0x00020000   text_len  = 0x0002c000
data_base  = 0x00820000   data_len  = 0x0001d000
stack_base = 0x7fc00000   stack_len = 0x00043000
heap_base  = 0x00c00000   heap_len  = 0x077fe000   comm_len  = 0x00040000

ss  = 0x0000001f     es  = 0x0000001f   ds  = 0x0000001f
edi = 0x00000000    esi  = 0x00000000   ebp = 0x7fc40420   esp = 0x7fc40408
ebx = 0x00828a20    edx  = 0x00000000   ecx = 0x00000009   eax = 0x00000000
cs  = 0x00000017    eflg = 0x00010246   prev_sp = 0x7fc40408
error_code = 6

VECTOR #[14][PAGE FAULT]  fault_address = 0x00801920
Stack Trace:  ------------------------------
[ 0][0x000203cb] restart_address
[ 1][0x0002024d]
[ 2][0x0002016a]
[ 3][0x0002150f]
[ 4][0x00020120]
------------------------------------------------------------
Application process STOPPED  -----------------

These error messages give me no clue as to where the program abended.

The mktrace utility will disassemble the program and find the location of the instructions (all the Stack Trace addresses). This may or may not be helpful but it is much better than nothing and costs you zero.

To use mktrace, copy the error message above (everything between the lines

 ----- DEBUG: PCB, CONTEXT, STACK TRACE --------------------- 
and
 Application process STOPPED  ----------------- 
into a file named 'abend'. You can run mktrace from janus or sasn100.

The syntax of mktrace is:

janus ~/trace 129 > mktrace your_abending_program_name [name_of_abend_msg_file {num_of_lines_of_assembly_to_dump]]
Note: mktrace is in /usr/community/bin. The default for name_of_abend_msg_file is 'abend'. The default for num_of_lines_of_assembly_to_dump is 20. If you want to specify the num_of_lines_of_assembly_to_dump, you must specify the name_of_abend_msg_file.
Then type 
janus ~/trace 129 > mktrace testf

This will generate a lot of info which is shown below. The last part of the info is a call tree summary by stack strace address. I show it first.

 frame stack summary 
     Note that for all stack_trace addresses except the
      first one, the address points to the NEXT line that would have
      been executed.
   stack_trace        instr_addr instr_label  instruction                in_procedure
        000203cb,       000203cb <.EN2_291+9c> movl   %eax,0x801920,      
        0002024d,       00020248 <.EN1_308+ad> call   00020320 ,  
        0002016a,       00020165 <.B92>        call   00020190 ,   
        0002150f,       0002150a <.B2143+38>   call   00020130 
, 00020120, 0002011b call 00020b50 ,
You have to read this from the bottom up to get the call order. cstart calls main, main calls MAIN_, and MAIN_ calls mysum_. Note that pgi f77 puts an '_' at the end of all your subroutine names. So now you at least know that the program bombed in subroutine mysum. Now we will look at the full output:
janus ~/trace 129 > mktrace testf
input trace address=000203cb
input trace address=0002024d
input trace address=0002016a
input trace address=0002150f
input trace address=00020120
This shows you the stack trace addresses found in the file 'abend'. You must copy the abend message from the screen and put them into a file named 'abend'.
Show up to 20 calls above tr_addr=000203cb
      call   00020190 
      call   0003a460 
      call   0003a1f0 
      call   000395b0 
      call   00039780 
      call   00039c20 
      call   0004b4e0 <__mth_i_dsin>
      call   00020320 
      call   0003a1f0 
      call   000395b0 
      call   00039780 
      call   00039700 
      call   00039700 
      call   00039c20 
      call   0003a460 
      call   0003a1f0 
      call   000395b0 
      call   00039780 
      call   00039c20 
      call   0004b4e0 <__mth_i_dsin>
One of the main problems with assembler is that you can't really tell where you are in the program. We know we are somewhere in the subroutine mysum but where? We also know the address of the assembler instruction that caused the abend. The 20 assembler instructions before the abending instruction are shown below. The '20 calls above' shown above lists the (up to) 20 'call' statements preceeding the abend instruction. This is not a call tree, it is just a display of call instructions found in the disassembled program whose address preceeds the current stack trace address. The purpose is to try and give you more context as to where you might be in your subroutine. In the above example, we see that a __mth_i_dsin routine was call before the abending instruction. If we look back at the source code above, we see that there is only one call to 'sin', so that tells us that we are below that call. The next part displayed is the 20 instructions preceeding the abending instruction.
tr_addr=000203cb    last_big_label=
     00020372 <.EN2_291+43> movl   $0x0,0xc(%esp,1)
     0002037a <.EN2_291+4b> movl   $0x1,0x8(%esp,1)
     00020382 <.EN2_291+53> movl   $0xf,0x4(%esp,1)
     0002038a <.EN2_291+5b> movl   $0x820092,(%esp,1)
     00020391 <.EN2_291+62> call   00039780 
     00020396 <.EN2_291+67> call   00039c20 
     0002039b <.EN2_291+6c> movl   0x82008c,%edx
     000203a1 <.EN2_291+72> movl   0x820088,%eax
     000203a6 <.EN2_291+77> movl   %edx,0x4(%esp,1)
     000203aa <.EN2_291+7b> movl   %eax,(%esp,1)
     000203ad <.EN2_291+7e> call   0004b4e0 <__mth_i_dsin>
     000203b2 <.EN2_291+83> fsubl  0x820080
     000203b8 <.EN2_291+89> fstps  (%ebx)
     000203ba <.EN2_291+8b> flds   (%ebx)
     000203bc <.EN2_291+8d> fdivrl 0x820078
     000203c2 <.EN2_291+93> movl   0xc(%ebp),%eax
     000203c5 <.EN2_291+96> fstps  (%eax)
     000203c7 <.EN2_291+98> xorl   %eax,%eax
     000203c9 <.EN2_291+9a> xorl   %edx,%edx
     000203cb <.EN2_291+9c> movl   %eax,0x801920
OK, so there are calls to 'fio_ld*' routines (this is the print statement). to 'sin' and it looks like something is subtracted,then something divided,then they clear registers eax and edx, Lastly, the abending instruction (tr_addr=000203cb) tries to move register eax to some big address. This address is out of our memory bound. Not bad for the subroutine:
       print *,'hello you,1'
       x = sin(0.d0) - 1.d0
       y = 3.d0/x
       z(-20000) = 0.d0
Now we do the same thing for each address in the stack trace from the file abend. The next stack trace is 0002024d. All the hard work has really been done now however.The rest of the stack trace info would be helpful to see which routine was calling mysum. Typically, a subroutine might get called from many other subroutines so the call tree is important. For each address, we show (up to) 20 calls that where found at addresses above the current trace address, and the 20 instructions above the trace_address.
Show up to 20 calls above tr_addr=0002024d
      call   00020b50 
      call   00020190 
      call   0003a460 
      call   0003a1f0 
      call   000395b0 
      call   00039780 
      call   00039c20 
      call   0004b4e0 <__mth_i_dsin>
      call   00020320 
tr_addr=0002024d    last_big_label=
     000201db <.EN1_308+40> movl   $0x0,0xc(%esp,1)
     000201e3 <.EN1_308+48> movl   $0x1,0x8(%esp,1)
     000201eb <.EN1_308+50> movl   $0xf,0x4(%esp,1)
     000201f3 <.EN1_308+58> movl   $0x820062,(%esp,1)
     000201fa <.EN1_308+5f> call   00039780 
     000201ff <.EN1_308+64> call   00039c20 
     00020204 <.EN1_308+69> movl   0x82005c,%edx
     0002020a <.EN1_308+6f> movl   0x820058,%eax
     0002020f <.EN1_308+74> movl   %edx,0x4(%esp,1)
     00020213 <.EN1_308+78> movl   %eax,(%esp,1)
     00020216 <.EN1_308+7b> call   0004b4e0 <__mth_i_dsin>
     0002021b <.EN1_308+80> fsubl  0x820050
     00020221 <.EN1_308+86> fstps  0x828a20
     00020227 <.EN1_308+8c> flds   0x828a20
     0002022d <.EN1_308+92> fdivrl 0x820048
     00020233 <.EN1_308+98> fstps  0x828a24
     00020239 <.EN1_308+9e> movl   $0x828a24,0x4(%esp,1)
     00020241 <.EN1_308+a6> movl   $0x828a20,(%esp,1)
     00020248 <.EN1_308+ad> call   00020320 
     0002024d <.EN1_308+b2> movl   $0x6,0x8(%esp,1)
Show up to 20 calls above tr_addr=0002016a
      call   00020b50 
      call   00020190 
tr_addr=0002016a    last_big_label=
     00020120  addb   %al,(%eax)
     00020122  addb   %cl,%al
     00020124  leal   0x0(%esi),%esi
     0002012a  leal   0x0(%esi),%esi
     00020130 
subl $0xc,%esp 00020133 movl %ebp,0x8(%esp,1) 00020137 leal 0x8(%esp,1),%ebp 0002013b movl 0x8(%ebp),%eax 0002013e movl %eax,0x82d5bc 00020143 movl 0xc(%ebp),%eax 00020146 movl %eax,0x82d5b8 0002014b movl 0x82354c,%eax 00020150 testl %eax,%eax 00020152 je 00020165 <.B92> 00020154 xorl %eax,%eax 00020156 movl %eax,0x823550 0002015b movl %eax,0x82354c 00020160 movl %eax,0x823548 00020165 <.B92> call 00020190 0002016a <.B92+5> xorl %eax,%eax how up to 20 calls above tr_addr=0002150f call 000218f0 call 00021b70 call 00024f50 call 00024f50 call 00047c70 call 000472c0 call 0002fbf0 call 00024f50 call 00047c70 call 000472c0 call 0002aef0 call 00024f50 call 000472c0 call 00025eb0 call 000204f0 <_bcast_start_main> call 00045010 call 00031810 call 0004b510 <_init> call 00047430 call 00020130
tr_addr=0002150f last_big_label= 000214b3 <.B2119+2f> xorl %edx,%edx 000214b5 <.B2119+31> movw 0x12(%eax),%dx 000214b9 <.B2119+35> cmpl $0x1,%edx 000214bc <.B2119+38> je 000214d2 <.B2143> 000214be <.B2119+3a> movl $0x21640,0x4(%esp,1) 000214c6 <.B2119+42> movl $0x17,(%esp,1) 000214cd <.B2119+49> call 00031810 000214d2 <.B2143> call 0004b510 <_init> 000214d7 <.B2143+5> movl $0x4b520,(%esp,1) 000214de <.B2143+c> call 00047430 000214e3 <.B2143+11> movl 0xfffffd4c(%ebp),%eax 000214e9 <.B2143+17> movl %eax,0xc(%esp,1) 000214ed <.B2143+1b> movl 0xfffffd40(%ebp),%eax 000214f3 <.B2143+21> movl %eax,0x8(%esp,1) 000214f7 <.B2143+25> movl 0xfffffd44(%ebp),%eax 000214fd <.B2143+2b> movl %eax,0x4(%esp,1) 00021501 <.B2143+2f> movl 0xfffffd50(%ebp),%eax 00021507 <.B2143+35> movl %eax,(%esp,1) 0002150a <.B2143+38> call 00020130
0002150f <.B2143+3d> movl %eax,%edi Show up to 20 calls above tr_addr=00020120 call 00020b50 tr_addr=00020120 last_big_label= 000200f8 nop 000200f9 nop 000200fa nop 000200fb nop 000200fc nop 000200fd nop 000200fe nop 000200ff nop 00020100 nop 00020101 movl 0x8(%esp,1),%eax 00020105 movl %eax,0x837634 0002010a movl $0x0,%eax 0002010f movl %eax,%ebx 00020111 movl %eax,%ecx 00020113 movl %eax,%edx 00020115 movl %eax,%esi 00020117 movl %eax,%edi 00020119 movl %eax,%ebp 0002011b call 00020b50 00020120 addb %al,(%eax) frame stack summary Note that for all stack_trace addresses except the first one, the address points to the NEXT line to be executed. stack_trace instr_addr instr_label instruction in_procedure 000203cb, 000203cb <.EN2_291+9c> movl %eax,0x801920, 0002024d, 00020248 <.EN1_308+ad> call 00020320 , 0002016a, 00020165 <.B92> call 00020190 , 0002150f, 0002150a <.B2143+38> call 00020130
, 00020120, 0002011b call 00020b50 , You can generate an assembler listing of your program with cif77 -c -Manno -S -Mkeepasm test.f -o test.s This will intermix the source code with assembly in test.s
Email if you have any questions. Note: This facility is provided on an "as is" basis. Eventually the debugger will do this (only much better). This utility allows 'post-mortem' analysis however. You can do it after the abend, not have to recompile your application with debug switches and then rerun it and hope that it still abends.

Updated 12/3/2002 by Gerry Quinlan