The genperf routines

All of these routines are designed to make it simple to access the hw counters. They handle the overflow of the counters correctly.

The perf* routines are a higher level way to access the hw counters. The perf* routines are a layer on top of genperf. This is for the user who wants to just do:

for(each_hw_event_in_PP_EVENTS_DESC) 
{
do_my_sub();
}
print_a_report_on_each_hw_event.


Acknowledgement and Disclaimer
The perf* routines basically do this. The user doesn't have to anything about hex event id's, event descriptions etc. The user can do:
perfinit();
for(i=0; i<=PP_EVENTS_MAX;i+2) 
{
perfsel();
perfi();
do_my_sub();
perff();
}
perfrep();
perfinit Sets up the perf structures

perfrst Resets the perf structures

perfsel Select the next pair of events to monitor

perfi Zero and start the hw counters

perff Stop the counters and accumulate the change

perfrep print a report on all the values collected

Another set of routines monitor the flops on all the nodes in an application. It is about as easy to use as possible. It also monitors the flops on the 2nd processor (but you have to be using 'cop' to access the 2nd processor). These are global operations so every node must call the routines. The interface is simple:

beginflopmon();
do_your_work();
endflopmon();
printflopmon();
The fortran interface:
call beginflopmon
call do_your_work
call endflopmon
call printflopmon
Since the routines don't take any arguments and don't return anything, I'll omit the synopsis section.

The following genperf routines take no arguments and return no value. They print the results to stdout.

Here is an example using mflops in fortran:

      implicit none
      include 'fgenperf.h'
      integer*4 nmax,imax
      real*8 mega
      parameter (imax=1024*124,mega=1024.0*1024.0)
      real*8 a(imax),b(imax),x,y
      integer*4 i,j ,k,l,m
      real*8 dclock,begtime,endtime
      external dclock
      integer*8 iflop
      integer*8 ilng,jlng,ilng0,jlng0
      integer*8 ilng2,jlng2, ilng3,jlng3

      nmax=imax
      x=0.0
      do i=1,nmax
         x=x+1.1
         a(i) = x
         b(i) = 1.0
      end do

      x=1.10
      call genbeginmflops()
      begtime=dclock()
 330  continue
      do i=1,1000,1
         x = x + a(i)
      end do
      endtime=dclock()
      if(endtime-begtime.lt.5.0)then
         x = 0.0
         iflop=iflop+i
         goto 330
      endif
 350  continue
      call genprintmflops()
      call genendmflops()
      print *,'Mflop/s=',iflop/(endtime-begtime)/1.0d6
      print *,'flop=',i,iflop,x+y
       print *,' '
genbeginmflops Start counting mflops

genendmflops End counting mflops

genprintmflops Print the mflops

genrebegindcachehit Restart the L1 cache hit calculation

genbegindcachehit Start the L1 cache hit estimate

genprintdcachehit Print the L1 cache hit estimate

genbeginl2hit Start the L2 cache hit estimate

genendl2hit End the L2 cache hit estimate

genprintl2hit Print the L2 cache hit estimate

genbeginmemspeed Start the memory speed estimate

genendmemspeed End the memory speed estimate

genprintmemspeed Printthe memory speed estimate

genbeginbranchpred Start the branch prediction estimate

genendbranchpred End the branch predicition estimate

genprintbranchpred Print the branch prediction estimate

The following routines are lower level utility routines to use the hw counters.

Here is a fortran example which uses most of the routines:


      implicit none
      include 'fgenperf.h'
      integer*4 nmax,imax
      real*8 mega
      parameter (imax=1024*124,mega=1024.0*1024.0)
      real*8 a(imax),b(imax),x,y
      integer*4 i,j ,k,l,m
      real*8 dclock,begtime,endtime
      external dclock
      integer*8 ilng,jlng,ilng0,jlng0
      integer*8 ilng2,jlng2, ilng3,jlng3

      nmax=imax
      x=0.0
      do i=1,nmax
         x=x+1.1
         a(i) = x
         b(i) = 1.0
      end do
c     loop over PP_EVENTS_MAX by 2.
c     k is index into the PP_EVENTS_LIST table in cgenperf.h
c     Should start at 0.
      do k = 0,PP_EVENTS_MAX,2
       x=0.00
       do i=1,nmax
         x=x+1.00
         a(i) = x
       end do
       do i=1,nmax
         a(i) = b(i)
       end do
c      monitor counter value k and k+1.
c      monitor PP_EVENT_ID[k] and PP_EVENT_ID[k+1]
       call gensetsimple(k)
c      get initial values of counters , arg '0' is currently ignored.
       call gengetperf(0,ilng2,jlng2)
       do i=2,nmax
         a(i) = b(i)*1.1d0
       end do
c      get ending values of counters
       call gengetperf(0,ilng3,jlng3)
c      stop the counters
       call genstopperf(0)
c      print a description of event k
       call genprintperfbyind(k)
c      print the change in counter 0
       write(6,108)ilng3-ilng2
       call genprintperfbyind(k+1)
c      print the change in counter 1
       write(6,108)jlng3-jlng2
      enddo
gengetevent returns the i_th element from the PP_EVENTS_LIST

genstopperf stop the perfmon counters

genstartperf starts the perfmon counter on cpu iproc

gensetperf selects which counters to monitor and more

gengetdescbyhex returns a pointer to a description of event

gengethexbydesc returns the event id given the event description

gengetperf gets the two 64 bit counter values

genlprintperfbyhex prints a long description of 'event id' to stdout

genprintperfbyhex prints a short description of 'event id' to stdout

genprintperfbyind prints the i_th description of PP_EVENT_DESC

gensetsimple

gengetevent

Description

gengetevent returns the i'th event in the PP_EVENTS_LIST if the argument i is in the right range. This helps users loop over all the events without having to know each event value.

Parameters

a pointer to an integer. The integer corresponds to an offset into the PP_EVENTS_LIST.

Returns

int PP_EVENTS_LIST[*which_event] if *which_event is valid. otherwise returns PP_EVENTS_LIST[0] int gengetevent_(const int *which_event)

Synopsis

int gengetevent(const int which_event)

genstopperf

Description

reset the overflow handler to the default and stop the perfmon counters

Parameters

for the f77 genstopperf_: an int pointer for the c genstopperf : an int

Returns

N/A

Synopsis

void genstopperf(const int iproc)
void genstopperf_(const int *iproc)

genstartperf

Description

genstartperf sets the handler, initializes the counters, initializes the overflow buckets, and starts the counters.

Parameters

integer processor # for genstartperf, pointer to int for getstartperf_

Returns

N/A

Synopsis

void genstartperf(const int iproc)
void genstartperf_(const int *iproc)

gensetperf

Description

gensetperf is a hides the perfmon dirty work. gensetperf will start the perfmon counters on cpu iproc, monitoring events ievent0 and ievent1, with the mask specified by the user (cmask0 and cmask1). The routine checks that ievent0 is valid on register 0 and ievent1 is valid on register1.

Parameters

iproc, the processor to start hw counter on, ievent0 and ievent1, the event id to monitor, See PP_EVENTS_LIST for the complete list cmask0 the mask for event0 cmask1 the mask for event1 An appropriate mask will be supplied if the user supplies a mask <= 0;

Returns

N/A

Synopsis

void gensetperf_(const int *iproc, const int *ievent0, const int *ievent1, const int *cmask0, const int *cmask1)
void gensetperf(const int iproc, const int ievent0, const int ievent1, const int cmask0, const int cmask1)

gengetdescbyhex

Description

gengetdescbyhex (gen_get_desc_by_hex) returns the address of a string (from PP_EVENTS_DESC) describing event id ihex (where ihex is an event in the list PP_EVENTS_LIST).

Parameters

gengetdescbyhex_ takes an int pointer to the event id gengetdescbyhex takes an int to the event id

Returns

the address of the string describing event ihex if ihex is one of the events in the list PP_EVENTS_LIST. Otherwise, returns pointer to PP_EVENT_DESC[0].

Synopsis

char * gengetdescbyhex_(const int *event_id)
char * gengetdescbyhex(const int event_id)

gengethexbydesc

Description

returns the event id from PP_EVENTS_LIST corresponding to the event's string descriptor in PP_EVENTS_DESC.

Parameters

a pointer to a char string (the event's descriptor)

Returns

the event id from PP_EVENTS_LIST if a match is found on the input string. otherwise returns 0.

Synopsis

int gengethexbydesc_(const char *str)
int gengethexbydesc(const char *str)

gengetperf

Description

reads the perfmon counters off the iproc cpu using getppcounter put the values into the two long long arguments. After we read the counters, we add in the overflow. This routine should be fast if iproc==cpu_number.

Parameters

const int *iproc, the cpu number from which we want the values. This is used by getppcounter (if that rtn works). long long *cntr_event0, *cntr_event1 pointers to the long long ints to hold the counter values.

Returns

N/A

Synopsis

void gengetperf_(const int *iproc, long long *cntr_event0, long long *cntr_event1)
void gengetperf(const int iproc, long long *cntr_event0, long long *cntr_event1)

genlprintperfbyhex

Description

this is like genprintperfbyhex except that it prints the long description of an event (from PP_EVENTS_LDESC) given a valid input event id (valid as in, ihex appears in PP_EVENTS_LIST). prints to stdout. breaks the desc up into 60 characters max and tries to break on a ' '.

Parameters

For genlprintperfbyhex: const int ihex, the event id For genlprintperfbyhex_: const int *ihex, ptr to the event id

Returns

n/a

Synopsis

void genlprintperfbyhex_(const int *ihex)
void genlprintperfbyhex(const int ihex)

genprintperfbyhex

Description

prints the short description (form PP_EVENTS_DESC) corresponding to the input event id. prints to stdout. Note that I don't tack on a '\n' when printing. The app has to add that.

Parameters

For genprintperfbyhex: const int ihex, the event id For genprintperfbyhex_: const int *ihex, ptr to the event id

Returns

N/A

Synopsis

void genprintperfbyhex_(const int *ihex)
void genprintperfbyhex(const int ihex)

genprintperfbyind

Description

prints a the i'th description in PP_EVENTS_DESC where i is the input argument. prints to stdout.

Parameters

for genprintperfbyind: const int iindex, the index into PP_EVENTS_DESC for genprintperfbyind_: const int *iindex, ptr to the index into PP_EVENTS_DESC

Returns

N/A

Synopsis

void genprintperfbyind_(const int *iindex)
void genprintperfbyind(const int iindex)

perfinit

Description

mallocs/initializes an array to hold all PP_EVENTS_MAX counter values

Parameters

none

Returns

N/A

Synopsis

void perfinit_(void)
void perfinit(void)

perfrst

Description

reset curevt to 0. curevt is the current event we are monitoring

Parameters

none

Returns

N/A

Synopsis

void perfrst_(void)
void perfrst(void)

perfsel

Description

start monitoring hw event = PP_EVENTS_LIST[curevt] on pmc reg0 and hw event = PP_EVENTS_LIST[curevt+1] on pmc reg1.

Parameters

none

Returns

N/A

Synopsis

void perfsel_(void)
void perfsel(void)

perfi

Description

get the current values of hw counters and put reg0 in perfptr[curevt] and reg1 in perfptr[curevt+1]

Parameters

none

Returns

N/A

Synopsis

void perfi_(void)
void perfi(void)

perff

Description

gets the current hw counter values and puts the change (from last call to perfi()) into perfptr.

Parameters

none

Returns

N/A

Synopsis

void perff_(void)
void perff(void)

perfrep

Description

prints a report based on all the perfptr values. now we have perfptr full (data for each hw event). We can calculate all sorts of things mflops/cachemiss/ uops retired/cycle

Parameters

none

Returns

N/A

Synopsis

void perfrep_(void)
void perfrep()

gensetsimple

Description

continuing the theme of making it simple gensetsimple(int k) starts the hw counters: reg0 gets PP_EVENTS_LIST[k] reg1 gets PP_EVENTS_LIST[k+1] This is for a loop like:
long long ll_arr[PP_EVENTS_MAX+2],lla,llb;
for(i=0; i<= PP_EVENTS_MAX; i+=2) 
{
gensetsimple(i);
gengetperf(0, &ll_arr[i], &ll_arr[i+1]);
do_my_sub();
gengetperf(0, &lla, &llb);
ll_arr[i] = lla - ll_arr[i];
ll_arr[i+1] = llb - ll_arr[i+1];
}
So the perf* routines hide all the accumulating of the counters.

Parameters

for gensetsimple: int index index into PP_EVENTS_LIST for gensetsimple_: int *index ptr to index into PP_EVENTS_LIST

Returns

N/A

Synopsis

void gensetsimple_(const int *k)
void gensetsimple(const int k)

genbeginmflops

genendmflops

genprintmflops

Description

This is an extension of perfmon beginm/end/print/mflops. The main difference is that the gen* routines go through gensetperf (which sets an overflow handler) so the counters are correct for long runs. All of the routines called either exit on an error or they can't detect an error.

Parameters

None

Returns

None

Synopsis

void genbeginmflops_(void)
void genbeginmflops(void)

Synopsis

void genprintmflops_(void)
void genprintmflops(void)

Synopsis

void genendmflops_(void)
void genendmflops(void)

genrebegindcachehit

Description

Reset the starting values for the dcache hit events. This is useful since the routines in genbegindcachehit pretty much wipe out the L1. So we give a routine to re-initialize the starting value.

Parameters

None

Returns

N/A

Synopsis

void genrebegindcachehit_(void)
void genrebegindcachehit(void)

genbegindcachehit

Description

Starts monitoring events to estimate dcache hit rate.

Parameters

None

Returns

N/A

Synopsis

void genbegindcachehit_(void)
void genbegindcachehit(void)

genprintdcachehit

Description

prints the current estimate of dcache hit rate

Parameters

None

Returns

N/a

Synopsis

void genprintdcachehit_(void)
void genprintdcachehit(void)

Synopsis

void genenddcachehit_(void)
void genenddcachehit(void)

genbeginl2hit

genendl2hit

genprintl2hit

Description

Estimate l2 hit ratio. Based on PP_DCU_LINES_IN (lines brought into L1 from L2) and PP_L2_LINES_IN (lines brought into L2 from memory) This should pretty accurately reflect L2 hit ratio.

Parameters

None

Returns

N/A

Synopsis

void genbeginl2hit(void)
void genbeginl2hit_(void)
void genrebeginl2hit(void)
void genrebeginl2hit_(void)
void genprintl2hit(void)
void genprintl2hit_(void)
void genendl2hit(void)
void genendl2hit_(void)

genbeginmemspeed

genendmemspeed

genprintmemspeed

Description

Estimate how many MB/sec we get off the bus. Use PP_BUS_TRAN_MEM which is # of cachelines transferred to/from bus. So 32*this is # of bytes xferred. Get starting and end dclock() time for one measure of MB/sec. Also get starting and ending PP_RESOURCE_STALLS. PP_RESOURCE_STALLS is clock cycles the cpu is stalled waiting on something. Usually it is waiting on memory. So we can use PP_RESOURCE_STALLS as a sort of minimum time that the cpu is transferring to/from memory.

Parameters

None

Returns

N/A

Synopsis

void genbeginmemspeed(void)
void genbeginmemspeed_(void)
void genprintmemspeed(void)
void genprintmemspeed_(void)
void genendmemspeed(void)
void genendmemspeed_(void)

genbeginbranchpred

genendbranchpred

genprintbranchpred

Description

estimate the % of branches that are predicted correctly. Use PP_BR_MISS_PRED_TAKEN_RET (total mispredicted branches taken) which is branches taken incorrectly. Compare to PP_BR_INST_DECODED which is total # of branches. This is based on PP OPTimization guide but I don't have the exact reference.

Parameters

None

Returns

N/A

Synopsis

void genbeginbranchpred(void)
void genbeginbranchpred_(void)
void genprintbranchpred(void)
void genprintbranchpred_(void)
void genendbranchpred(void)
void genendbranchpred_(void)