First, you can run your application in the debugger and that should tell you where your program is abending. This page is helpful whether or not the debugger is available. Particularly in the case where for some reason the debugger isn't able to show you where the program is abending. Also, with this utility, you don't need to recompile with '-g'. You can do 'post-mortem' analysis.
This page is actually two parts. The first talks about the easiest to fix abend - too little stack specified on the yod command.
The section following this describes a utility I have which takes the addresses from the abend message, disassebles the program and locates the instructions from the abend message. It also displays info that will help you relate the assembler instructions to the source code. Note that you don't have to recompile and rerun your program to use mktrace.
Note that this assumes you are already logged into janus.sandia.gov.
The things to note are the lines 'error_code' and 'VECTOR'.
janus ~/tst 255 > yod -sz 1 test3c ----- DEBUG: PCB, CONTEXT, STACK TRACE --------------------- PROCESSOR [ 0] log_nid = 0 phys_nid = 946 host_id = 33665 host_pid = 40 group_id = 23 num_procs = 1 log_pid = 0 local_pid = 2 base_node_index = 0 last_node_index = 0 text_base = 0x00020000 text_len = 0x00021000 data_base = 0x00820000 data_len = 0x0001b000 stack_base = 0x7fc00000 stack_len = 0x00043000 heap_base = 0x00c00000 heap_len = 0x0780b000 comm_len = 0x00040000 ss = 0x0000001f es = 0x0000001f ds = 0x0000001f edi = 0x00000000 esi = 0x00000000 ebp = 0x7fc0f6e0 esp = 0x7fbde990 ebx = 0x00002710 edx = 0x00000000 ecx = 0x0000000c eax = 0x00000014 cs = 0x00000017 eflg = 0x00010206 prev_sp = 0x7fbde990 error_code = 6 VECTOR #[14][PAGE FAULT] fault_address = 0x7fbde99c Stack Trace: ------------------------------ [ 0][0x00020284] restart_address [ 1][0x00020199] [ 2][0x0002136f] [ 3][0x00020120] ------------------------------------------------------------ Application process STOPPED -----------------PROCESSOR [ 0] indicates that it was the first (of two) processors on the board that abended. The 'log_pid = 0' indicates that it was the node 0 that abended (out of 'num_procs = 1' total nodes in the yod process).
Acknowledgement and Disclaimer
Interrupt Vector 14 is a page fault exception. From Vol 3 of the Pentium Pro Developers manual, Section 5.12, page 5-39, the description of Page fault is:
So for the abend error_code above, we have:
error_code 6 = (P flag=0)*1 + (W/R flag=1)*2 + (U/S flag=1)*4 + (RSVD flag=0)*8The page was not present (P flag=0). I was attempting to write to it (W/R=1). The program was in user mode (U/S flag=1), not doing a system call. The exception was not caused by the program modifying some system reserved bits (RSVD=0).
Looking the address that caused the fault: fault_address = 0x7fbde99c and the line from the abend message:
stack_base = 0x7fc00000 stack_len = 0x00043000The stack has the range 0x7fc00000 to 0x7fc43000 and the fault address is outside this range. Actually, my test case did something like 'z(-20000) = 0.0' so the fault_address should be less than the stack_base, which it is.
janus ~/tst 258 > size test3c text data bss dec hex filename 132257 28164 79464 239885 3a90d test3c janus ~/tst 256 > yod -sz 1 -stack 1M test3c heapsize=908525536 heapsize1=908525536 x(1957)=0.001986, mflops=0.000000 x(8719)=0.000359, mflops=0.000000 x(8760)=0.000360, mflops=0.000000 STOP janus ~/tst 257 >
I have a routine which reads the abend error message and finds the instruction in the program that corresponds. Below is an example of using this utility.
Subroutine mysum has the memory error of an out of bounds memory reference: z(-20000) = 0.d0
janus ~/trace 131 > cat test.f
real*8 z(100)
print *,'hello you'
x = sin(0.d0) - 1.d0
y = 3.d0/x
c z(-20000) = 0.d0
c x = sqrt(z(-20000))
call mysum(x,y)
print *,'y=',y,x
call exit
end
subroutine mysum(x,y)
real*8 z(100)
print *,'hello you,1'
x = sin(0.d0) - 1.d0
y = 3.d0/x
z(-20000) = 0.d0
c x = sqrt(z(-20000))
print *,'y=',y,x
return
end
Compiling this into executable testf and yod'ing it onto a compute node gives:
janus ~/trace 127 > yod -sz 1 testf hello you hello you,1 ----- DEBUG: PCB, CONTEXT, STACK TRACE --------------------- PROCESSOR [ 0] log_nid = 0 phys_nid = 899 host_id = 33665 host_pid = 1 group_id = 20 num_procs = 1 log_pid = 0 local_pid = 2 base_node_index = 0 last_node_index = 0 text_base = 0x00020000 text_len = 0x0002c000 data_base = 0x00820000 data_len = 0x0001d000 stack_base = 0x7fc00000 stack_len = 0x00043000 heap_base = 0x00c00000 heap_len = 0x077fe000 comm_len = 0x00040000 ss = 0x0000001f es = 0x0000001f ds = 0x0000001f edi = 0x00000000 esi = 0x00000000 ebp = 0x7fc40420 esp = 0x7fc40408 ebx = 0x00828a20 edx = 0x00000000 ecx = 0x00000009 eax = 0x00000000 cs = 0x00000017 eflg = 0x00010246 prev_sp = 0x7fc40408 error_code = 6 VECTOR #[14][PAGE FAULT] fault_address = 0x00801920 Stack Trace: ------------------------------ [ 0][0x000203cb] restart_address [ 1][0x0002024d] [ 2][0x0002016a] [ 3][0x0002150f] [ 4][0x00020120] ------------------------------------------------------------ Application process STOPPED -----------------These error messages give me no clue as to where the program abended.
The mktrace utility will disassemble the program and find the location of the instructions (all the Stack Trace addresses). This may or may not be helpful but it is much better than nothing and costs you zero.
To use mktrace, copy the error message above (everything between the lines
----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------and
Application process STOPPED -----------------into a file named 'abend'. You can run mktrace from janus or sasn100.
The syntax of mktrace is:
janus ~/trace 129 > mktrace your_abending_program_name [name_of_abend_msg_file {num_of_lines_of_assembly_to_dump]]
Note: mktrace is in /usr/community/bin.
The default for name_of_abend_msg_file is 'abend'. The default for
num_of_lines_of_assembly_to_dump is 20. If you want to specify the
num_of_lines_of_assembly_to_dump, you must specify the name_of_abend_msg_file.
Then type janus ~/trace 129 > mktrace testf
This will generate a lot of info which is shown below. The last part of the info is a call tree summary by stack strace address. I show it first.
frame stack summary
Note that for all stack_trace addresses except the
first one, the address points to the NEXT line that would have
been executed.
stack_trace instr_addr instr_label instruction in_procedure
000203cb, 000203cb <.EN2_291+9c> movl %eax,0x801920,
0002024d, 00020248 <.EN1_308+ad> call 00020320 ,
0002016a, 00020165 <.B92> call 00020190 ,
0002150f, 0002150a <.B2143+38> call 00020130 ,
00020120, 0002011b call 00020b50 ,
You have to read this from the bottom up to get the call order.
cstart calls main, main calls MAIN_, and MAIN_ calls mysum_. Note that
pgi f77 puts an '_' at the end of all your subroutine names. So now
you at least know that the program bombed in subroutine mysum.
Now we will look at the full output:
janus ~/trace 129 > mktrace testf input trace address=000203cb input trace address=0002024d input trace address=0002016a input trace address=0002150f input trace address=00020120This shows you the stack trace addresses found in the file 'abend'. You must copy the abend message from the screen and put them into a file named 'abend'.
Show up to 20 calls above tr_addr=000203cb
call 00020190
call 0003a460
call 0003a1f0
call 000395b0
call 00039780
call 00039c20
call 0004b4e0 <__mth_i_dsin>
call 00020320
call 0003a1f0
call 000395b0
call 00039780
call 00039700
call 00039700
call 00039c20
call 0003a460
call 0003a1f0
call 000395b0
call 00039780
call 00039c20
call 0004b4e0 <__mth_i_dsin>
One of the main problems with assembler is that you can't really tell
where you are in the program. We know we are somewhere in the subroutine
mysum but where? We also know the address of the assembler instruction
that caused the abend. The 20 assembler instructions before the abending
instruction are shown below. The '20 calls above' shown above lists
the (up to) 20 'call' statements preceeding the abend instruction.
This is not a call tree, it is just a display of call instructions
found in the disassembled program whose address preceeds the current
stack trace address. The purpose is to try and give you more context
as to where you might be in your subroutine. In the above example,
we see that a __mth_i_dsin routine was call before the abending instruction.
If we look back at the source code above, we see that there is only
one call to 'sin', so that tells us that we are below that call.
The next part displayed is the 20 instructions preceeding the
abending instruction.
tr_addr=000203cb last_big_label=OK, so there are calls to 'fio_ld*' routines (this is the print statement). to 'sin' and it looks like something is subtracted,then something divided,then they clear registers eax and edx, Lastly, the abending instruction (tr_addr=000203cb) tries to move register eax to some big address. This address is out of our memory bound. Not bad for the subroutine:00020372 <.EN2_291+43> movl $0x0,0xc(%esp,1) 0002037a <.EN2_291+4b> movl $0x1,0x8(%esp,1) 00020382 <.EN2_291+53> movl $0xf,0x4(%esp,1) 0002038a <.EN2_291+5b> movl $0x820092,(%esp,1) 00020391 <.EN2_291+62> call 00039780 00020396 <.EN2_291+67> call 00039c20 0002039b <.EN2_291+6c> movl 0x82008c,%edx 000203a1 <.EN2_291+72> movl 0x820088,%eax 000203a6 <.EN2_291+77> movl %edx,0x4(%esp,1) 000203aa <.EN2_291+7b> movl %eax,(%esp,1) 000203ad <.EN2_291+7e> call 0004b4e0 <__mth_i_dsin> 000203b2 <.EN2_291+83> fsubl 0x820080 000203b8 <.EN2_291+89> fstps (%ebx) 000203ba <.EN2_291+8b> flds (%ebx) 000203bc <.EN2_291+8d> fdivrl 0x820078 000203c2 <.EN2_291+93> movl 0xc(%ebp),%eax 000203c5 <.EN2_291+96> fstps (%eax) 000203c7 <.EN2_291+98> xorl %eax,%eax 000203c9 <.EN2_291+9a> xorl %edx,%edx 000203cb <.EN2_291+9c> movl %eax,0x801920
print *,'hello you,1'
x = sin(0.d0) - 1.d0
y = 3.d0/x
z(-20000) = 0.d0
Now we do the same thing for each address in the stack trace from the file
abend. The next stack trace is 0002024d. All the hard work has really been
done now however.The rest of the stack trace info would be helpful to
see which routine was calling mysum. Typically, a subroutine might get
called from many other subroutines so the call tree is important.
For each address, we show (up to) 20 calls that where found at
addresses above the current trace address, and the 20 instructions
above the trace_address.
Show up to 20 calls above tr_addr=0002024d
call 00020b50
call 00020190
call 0003a460
call 0003a1f0
call 000395b0
call 00039780
call 00039c20
call 0004b4e0 <__mth_i_dsin>
call 00020320
tr_addr=0002024d last_big_label=
000201db <.EN1_308+40> movl $0x0,0xc(%esp,1)
000201e3 <.EN1_308+48> movl $0x1,0x8(%esp,1)
000201eb <.EN1_308+50> movl $0xf,0x4(%esp,1)
000201f3 <.EN1_308+58> movl $0x820062,(%esp,1)
000201fa <.EN1_308+5f> call 00039780
000201ff <.EN1_308+64> call 00039c20
00020204 <.EN1_308+69> movl 0x82005c,%edx
0002020a <.EN1_308+6f> movl 0x820058,%eax
0002020f <.EN1_308+74> movl %edx,0x4(%esp,1)
00020213 <.EN1_308+78> movl %eax,(%esp,1)
00020216 <.EN1_308+7b> call 0004b4e0 <__mth_i_dsin>
0002021b <.EN1_308+80> fsubl 0x820050
00020221 <.EN1_308+86> fstps 0x828a20
00020227 <.EN1_308+8c> flds 0x828a20
0002022d <.EN1_308+92> fdivrl 0x820048
00020233 <.EN1_308+98> fstps 0x828a24
00020239 <.EN1_308+9e> movl $0x828a24,0x4(%esp,1)
00020241 <.EN1_308+a6> movl $0x828a20,(%esp,1)
00020248 <.EN1_308+ad> call 00020320
0002024d <.EN1_308+b2> movl $0x6,0x8(%esp,1)
Show up to 20 calls above tr_addr=0002016a
call 00020b50
call 00020190
tr_addr=0002016a last_big_label=
00020120 addb %al,(%eax)
00020122 addb %cl,%al
00020124 leal 0x0(%esi),%esi
0002012a leal 0x0(%esi),%esi
00020130 subl $0xc,%esp
00020133 movl %ebp,0x8(%esp,1)
00020137 leal 0x8(%esp,1),%ebp
0002013b movl 0x8(%ebp),%eax
0002013e movl %eax,0x82d5bc
00020143 movl 0xc(%ebp),%eax
00020146 movl %eax,0x82d5b8
0002014b movl 0x82354c,%eax
00020150 testl %eax,%eax
00020152 je 00020165 <.B92>
00020154 xorl %eax,%eax
00020156 movl %eax,0x823550
0002015b movl %eax,0x82354c
00020160 movl %eax,0x823548
00020165 <.B92> call 00020190
0002016a <.B92+5> xorl %eax,%eax
how up to 20 calls above tr_addr=0002150f
call 000218f0
call 00021b70
call 00024f50
call 00024f50
call 00047c70
call 000472c0
call 0002fbf0
call 00024f50
call 00047c70
call 000472c0
call 0002aef0
call 00024f50
call 000472c0
call 00025eb0
call 000204f0 <_bcast_start_main>
call 00045010
call 00031810
call 0004b510 <_init>
call 00047430
call 00020130
tr_addr=0002150f last_big_label=
000214b3 <.B2119+2f> xorl %edx,%edx
000214b5 <.B2119+31> movw 0x12(%eax),%dx
000214b9 <.B2119+35> cmpl $0x1,%edx
000214bc <.B2119+38> je 000214d2 <.B2143>
000214be <.B2119+3a> movl $0x21640,0x4(%esp,1)
000214c6 <.B2119+42> movl $0x17,(%esp,1)
000214cd <.B2119+49> call 00031810
000214d2 <.B2143> call 0004b510 <_init>
000214d7 <.B2143+5> movl $0x4b520,(%esp,1)
000214de <.B2143+c> call 00047430
000214e3 <.B2143+11> movl 0xfffffd4c(%ebp),%eax
000214e9 <.B2143+17> movl %eax,0xc(%esp,1)
000214ed <.B2143+1b> movl 0xfffffd40(%ebp),%eax
000214f3 <.B2143+21> movl %eax,0x8(%esp,1)
000214f7 <.B2143+25> movl 0xfffffd44(%ebp),%eax
000214fd <.B2143+2b> movl %eax,0x4(%esp,1)
00021501 <.B2143+2f> movl 0xfffffd50(%ebp),%eax
00021507 <.B2143+35> movl %eax,(%esp,1)
0002150a <.B2143+38> call 00020130
0002150f <.B2143+3d> movl %eax,%edi
Show up to 20 calls above tr_addr=00020120
call 00020b50
tr_addr=00020120 last_big_label=
000200f8 nop
000200f9 nop
000200fa nop
000200fb nop
000200fc nop
000200fd nop
000200fe nop
000200ff nop
00020100 nop
00020101 movl 0x8(%esp,1),%eax
00020105 movl %eax,0x837634
0002010a movl $0x0,%eax
0002010f movl %eax,%ebx
00020111 movl %eax,%ecx
00020113 movl %eax,%edx
00020115 movl %eax,%esi
00020117 movl %eax,%edi
00020119 movl %eax,%ebp
0002011b call 00020b50
00020120 addb %al,(%eax)
frame stack summary
Note that for all stack_trace addresses except the
first one, the address points to the NEXT line to be
executed.
stack_trace instr_addr instr_label instruction in_procedure
000203cb, 000203cb <.EN2_291+9c> movl %eax,0x801920,
0002024d, 00020248 <.EN1_308+ad> call 00020320 ,
0002016a, 00020165 <.B92> call 00020190 ,
0002150f, 0002150a <.B2143+38> call 00020130 ,
00020120, 0002011b call 00020b50 ,
You can generate an assembler listing of your
program with cif77 -c -Manno -S -Mkeepasm test.f -o test.s
This will intermix the source code with assembly in test.s
Email if you have any questions. Note: This facility is provided on an "as is" basis.
Eventually the debugger will do this (only much better). This
utility allows 'post-mortem' analysis however. You can do it
after the abend, not have to recompile your application with
debug switches and then rerun it and hope that it still abends.
Updated 12/3/2002 by Gerry Quinlan