Below is a list of the hardware counters available on the asci-red Intel cpus. The complete authoritative list is in the Intel manuals.

Unit

Event Number.

Mnemonic Event Name

Unit Mask

Description

Comments

Data Cache Unit (DCU)

43H

PP_DATA_MEM_REFS

00H

All loads from any memory type. All stores to any memory type. Each part of a split is counted separately.

Note: 80 bit floating point accesses are double counted, since they are decomposed into a 16 bit exponent load and a 64 bit mantissa load.
Memory accesses are only counted when they are actually performed. E.g. a load that gets squashed because a previous cache miss is outstanding to the same address, and which finally gets performed, is only counted once.
Does not include I/O accesses, or other non-memory accesses.

 

 

45H

PP_DCU_LINES_IN

00H

Total number of lines that have been allocated in the DCU.

 

 

46H

PP_DCU_M_LINES_IN

00H

Number of Modified state lines that have been allocated in the DCU.

 

 

47H

PP_DCU_M_LINES_OUT

00H

Number of Modified state lines that have been evicted from the DCU. This includes evictions as a result of external snoops, internal intervention or the natural replacement algorithm.

 

 

48H

PP_DCU_MISS_OUTSTANDING

00H

Weighted number of cycles while a DCU miss is outstanding. Incremented by the number of outstanding cache misses at any particular time. Cacheable read requests only are considered. Uncacheable requests are excluded. Read-for-ownerships are counted as well as line fills, invalidates, stores.

An access that also misses the L2 is short-changed by 2 cycles. (i.e. if count is N cycles, should be N+2 cycles.) Subsequent loads to the same cache line will not result in any additional counts.Count value not precise, but still useful.

Instruction Fetch Unit (IFU)

80H

PP_IFU_IFETCH

00H

Number of instruction fetches, both cacheable and non-cacheable. Including UC fetches.

Will be incremented by 1 for each cacheable line fetched and by 1 for each uncached instruction fetched

 

81H

PP_IFU_IFETCH_MISS

00H

Number of instruction fetch misses. All instruction fetches that do not hit the IFU i.e. that produce memory requests. Includes UC accesses.

 

 

85H

PP_ITLB_MISS

00H

Number of ITLB misses.

 

 

86H

PP_IFU_MEM_STALL

00H

Number of cycles instruction fetch is stalled, for any reason. Includes IFU cache misses, ITLB misses, ITLB faults and and other minor stalls.

 

 

87H

PP_ILD_STALL

00H

Number of cycles that the instruction length decoder stage of the processors pipeline is stalled.

 

L2 Cache

28H

PP_L2_IFETCH

MESI 0FH

Number of L2 instruction fetches. This event indicates that a normal instruction fetch was received by the L2. The count includes only L2 cacheable instruction fetches; it does not include UC instruction fetches. It does not include ITLB miss accesses.

 

 

29H

PP_L2_LD

MESI 0FH

Number of L2 data loads. This event indicates that a normal, unlocked, load memory access was received by the L2. It includes only L2 cacheable memory accesses; it does not include I/O accesses, other non-memory accesses, or memory accesses such as UC/WT memory accesses. It does include L2 cacheable TLB miss memory accesses.

 

 

2AH

PP_L2_ST

MESI 0FH

Number of L2 data stores. This event indicates that a normal, unlocked, store memory access was received by the L2. Specifically, it indicates that the DCU sent a read-for-ownership request to the L2. It also includes Invalid to Modified requests sent by the DCU to the L2. It includes only L2 cacheable store memory accesses; it does not include I/O accesses, other non-memory accesses, or memory accesses like UC/WT stores. It includes TLB miss memory accesses.

 

 

24H

PP_L2_LINES_IN

00H

Number of lines allocated in the L2.

 

 

26H

PP_L2_LINES_OUT

00H

Number of lines removed from the L2 for any reason.

 

 

25H

PP_L2_M_LINES_INM

00H

Number of Modified state lines allocated in the L2.

 

 

27H

PP_L2_M_LINES_OUTM

00H

Number of Modified state lines removed from the L2 for any reason.

 

 

2EH

PP_L2_RQSTS

MESI 0FH

Total number of all L2 requests.

 

 

21H

PP_L2_ADS

00H

Number of L2 address strobes.

 

 

22H

PP_L2_DBUS_BUSY

00H

Number of cycles during which the L2 cache data bus was busy.

 

 

23H

PP_L2_DBUS_BUSY_RD

00H

Number of cycles during which the data bus was busy transferring read data from L2 to the processor.

 

External Bus Logic (EBL)2

62H

PP_BUS_DRDY_CLOCKS

00H (Self) 20H (Any)

Number of clocks during which DRDY# is asserted. Essentially, utilization of the external system data bus

Unit Mask = 00H counts bus clocks when the processor is driving DRDY.Unit Mask = 20H counts in processor clocks when any agent is driving DRDY.

 

63H

PP_BUS_LOCK_CLOCKS

00H (Self) 20H (Any)

Number of clocks during which LOCK# is asserted on the external system bus.

Always counts in processor clocks

 

60H

PP_BUS_REQ_OUTSTANDING

00H (Self)

Number of bus requests outstanding. This counter is incremented by the number of cacheable read bus requests outstanding in any given cycle

Counts only DCU full-line cacheable reads, not RFOs, writes, instruction fetches, or anything else. Counts "waiting for bus to complete" (last data chunk received).

 

65H

PP_BUS_TRAN_BRD

00H (Self) 20H (Any)

Number of bus burst read transactions.

 

 

66H

PP_BUS_TRAN_RFO

00H (Self) 20H (Any)

Number of completed bus read for ownership transactions.

 

 

67H

PP_BUS_TRANS_WB

00H (Self) 20H (Any)

Number of completed bus write back transactions.

 

 

68H

PP_BUS_TRAN_IFETCH

00H (Self) 20H (Any)

Number of completed bus instruction fetch transactions.

 

 

69H

PP_BUS_TRAN_INVAL

00H (Self) 20H (Any)

Number of completed bus invalidate transactions.

 

 

6AH

PP_BUS_TRAN_PWR

00H (Self) 20H (Any)

Number of completed bus partial write transactions.

 

 

6BH

PP_BUS_TRANS_P

00H (Self) 20H (Any)

Number of completed bus partial transactions.

 

 

6CH

PP_BUS_TRANS_IO

00H (Self) 20H (Any)

Number of completed bus I/O transactions.

 

 

6DH

PP_BUS_TRAN_DEF

00H (Self) 20H (Any)

Number of completed bus deferred transactions.

 

 

6EH

PP_BUS_TRAN_BURST

00H (Self) 20H (Any)

Number of completed bus burst transactions.

 

 

70H

PP_BUS_TRAN_ANY

00H (Self) 20H (Any)

Number of all completed bus transactions. Address bus utilization can be calculated knowing the minimum address bus occupancy. Includes special cycles etc.

 

 

6FH

PP_BUS_TRAN_MEM

00H (Self) 20H (Any)

Number of completed memory transactions.

 

 

64H

PP_BUS_DATA_RCV

00H (Self)

Number of bus clock cycles during which this processor is receiving data.

 

 

61H

PP_BUS_BNR_DRV

00H (Self)

Number of bus clock cycles during which this processor is driving the BNR pin.

 

 

7AH

PP_BUS_HIT_DRV

00H (Self)

Number of bus clock cycles during which this processor is driving the HIT pin.

Includes cycles due to snoop stalls.

 

7BH

PP_BUS_HITM_DRV

00H (Self)

Number of bus clock cycles during which this processor is driving the HITM pin.

Includes cycles due to snoop stalls.

 

7EH

PP_BUS_SNOOP_STALL

00H (Self)

Number of clock cycles during which the bus is snoop stalled.

 

Floating Point Unit

C1H

PP_FLOPS

00H

Number of computational floating-point operations retired. Excludes floating point computational operations that cause traps or assists. Includes floating point computational operations executed by the assist handler.
Includes internal sub-operations of complex floating point instructions like transcendentals. Excludes floating point loads and stores.

Counter 0 only

 

10H

PP_FP_COMP_OPS_EXE

00H

Number of computational floating-point operations executed. The number of FADD, FSUB, FCOM, FMULs, integer MULs and IMULs, FDIVs, FPREMs, FSQRTS, integer DIVs and IDIVs. Note not the number of cycles but, the number of operations. This event does not distinguish an FADD used in the middle of a transcendental flow from a seperate FADD instruction.

Counter 0 only

 

11H

PP_FP_ASSIST

00H

Number of floating-point exception cases handled by microcode.

Counter 1 only. This event includes counts due to speculative execution.

 

12H

PP_MUL

00H

Number of multiplies. Note: includes integer and FP multiplies.

Counter 1 only. This event includes counts due to speculative execution.

 

13H

PP_DIV

00H

Number of divides. Note: includes integer and FP multiplies.

Counter 1 only. This event includes counts due to speculative execution.

 

14H

PP_CYCLES_DIV_BUSY

00H

Number of cycles that the divider is busy, and cannot accept new divides. Note: includes integer and FP divides, FPREM, FPSQRT, etc.

Counter 0 only. This event includes counts due to speculative execution.

Memory Ordering

03H

PP_LD_BLOCKS

00H

Number of store buffer blocks. Includes counts caused by preceding stores whose addresses are unknown, preceding stores whose addresses are known to conflict, but whose data is unknown and preceding stores that conflicts with the load, but which incompletely overlap the load.

 

 

04H

PP_SB_DRAINS

00H

Number of store buffer drain cycles. Incremented during every cycle the store buffer is draining. Draining is caused by serializing operations like CPUID, synchronizing operations like XCHG, Interrupt acknowledgment as well as other conditions such as cache flushing.

 

 

05H

PP_MISALIGN_MEM_REF

00H

Number of misaligned data memory references. Incremented by 1 every cycle during which either the PPro load or store pipeline dispatches a misaligned uop. Counting is performed if its the first half or second half, or if it is blocked, squashed or misses.
Note in this context misaligned means crossing a 64 bit boundary.

It should be noted that MISALIGN_MEM_REF is only an approximation, to the true number of misaligned memory references. The value returned is roughly proportional to the number of misaligned memory accesses, i.e. the size of the problem

Instruction Decoding and Retirement

C0H

PP_INST_RETIRED

OOH

Total number of instructions retired.

 

 

C2H

PP_UOPS_RETIRED

00H

Total umber of UOPs retired.

 

 

D0H

PP_INST_DECODER

00H

Total number of instructions decoded.

 

Interrupts

C8H

PP_HW_INT_RX

00H

Total number of hardware interrupts received.

 

 

C6H

PP_CYCLES_INT_MASKED

00H

Total number of processor cycles for which interrupts are disabled.

 

 

C7H

PP_CYCLES_INT_PENDING_AND_MASKD

00H

Total number of processor cycles for which interrupts are disabled and interrupts are pending.

 

Branches

C4H

PP_BR_INST_RETIRED

00H

Total number of branch instructions retired.

 

 

C5H

PP_BR_MISS_PRED_RETIRED

00H

Total number of branch mispredictions that get to the point of retirement. Includes not taken conditional branches.

 

 

C9H

PP_BR_TAKEN_RETIRED

00H

Total number of taken branches retired.

 

 

CAH

PP_BR_MISS_PRED_TAKEN_RET

00H

Total number of taken but mispredicted branches that get to the point of retirement. Includes conditional branches only when taken.

 

 

E0H

PP_BR_INST_DECODED

00H

Total number of branch instructions decoded.

 

 

E2H

PP_BTB_MISSES

00H

Total number of branches that for which the BTB did not produce a prediction

 

 

E4H

PP_BR_BOGUS

00H

Total number of branch predictions that are generated but are not actually branches.

 

 

E6H

PP_BACLEARS

00H

Total number of time BACLEAR is asserted. This is the number of times that a static branch prediction was made, where the branch decoder decided to make a branch prediction because the BTB did not.

 

Stalls

A2H

PP_RESOURCE_STALLS

00H

Incremented by one during every cycle that there is a resource related stall. Includes register renaming buffer entries (ROB entries), memory buffer entries(LB and SB entries). Does not include stalls due to bus queue full, too many cache misses, etc. In addition to resource related stalls, this event counts some other events.
Would have liked "pure" event counters for each, but we were denied thisto save hardware.
Includes stalls arising during branch misprediction recovery e.g. if retirement of the mispredicted branch is delayed (ROstall)and stalls arising while store buffer is draining from synchronizing operations. (MOSBdrain)

 

 

D2H

PP_PARTIAL_RAT_STALLS

00H

Number of cycles or events for partial stalls. Note Includes flag partial stalls.

 

Segment Register Loads

06H

PP_SEGMENT_REG_LOADS

00H

Number of segment register loads

 

Clocks

79H

PP_CPU_CLK_UNHALTED

00H

Number of cycles during which the processor is not halted.

 

Notes

1. Several L2 cache events, where noted, can be further qualified using the Unit Mask (UMSK) field in the PerfEvtSel0 and PerfEvtSel1 registers. The lower 4 bits of the Unit Mask field are used in conjunction with L2 events to indicate the cache state or cache states involved. The Pentium Pro processor identifies cache states using the "MESI" protocol and consequently each bit in the Unit Mask field represents one of the four states: UMSK[3] = M (8H) state, UMSK[2] = E (4H) state, UMSK[1] = S (2H) state, and UMSK[0] = I (1H) state. UMSK[3:0] = MES" (FH) should be used to collect data for all states; UMSK = 0H, for the applicable events, will result in nothing being counted.

2. All of the external bus logic (EBL) events, except where noted, can be further qualified using the Unit Mask (UMSK) field in the PerfEvtSel0 and PerfEvtSel1 registers. Bit 5 of the UMSK field is used in conjunction with the EBL events to indicate whether the processor should count transactions that are self generated (UMSK[5] = 0) or transactions that result from any processor on the bus (UMSK[5] = 1).

Acknowledgement and Disclaimer