Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten
Problem Statement Explore characteristics of the P4 Trace Cache using microbenchmarks and performance counters related to branching and Trace Cache
Approach Determine characteristics of the Pentium 4 processor that will help us evaluate the P4’s trace cache Using a performance monitoring tool (Intel’s Vtune Performance Analyzer) measure the data we need and analyze it to find limitations on the trace cache
Some P4 Characteristics Like most high performance processors, the P4 has special on-chip hardware for performance monitoring. This hardware typically includes Event detectors and counters Qualification of event detections and counting by privilege mode and event characteristics Support for event-based sampling
P4 characteristics cont. Common problems faces by modern processors Small number of counters Inability to distinguish between speculative and non- speculative events Imprecise event-based sampling With 42 million transistors (compared to 28 million of the P3), the P4 has overcome these problems 48 event detectors and 18 event counters Provides instruction-tagging to enable counting of nonspeculative performane events Provides support for imprecise event-based sampling (IEBS) and precise event-based sampling (PEBS)
Trace Cache Special instruction cache for capturing long dynamic instruction sequences. Each line stores a snapshot, or trace, of the dynamic instruction stream P4 executes trace caches when there is an L1 cache hit (which is over 90% of the time)
Characteristics of Trace Cache Stores instructions after they’ve already been decoded into μops (“micro-ops”). μops – RISC-style instructions Cache Line Size: 6 μops Trace Cache Size: 12K μops Branch Prediction hardware is used knows about any branch and fetch instructions that follow the branch. Conditional Branches can cause problems Won’t know if wrong until branch condition check in ALU0
Entering The Execution Pipeline - Pentium 4's Trace Cache Tom’s Hardware Guide
Advantages of Trace Cache More efficient use of limited cache space. Trace cache lines contain both branch instructions and the code after the branch instruction. No extra latency for branches Does not use TLB check
“Execute Mode" (when needed code is in L1 cache) The P4’s Critical Execution Path
Execute Mode Vs. Trace Segment Build Mode Execute Mode Trace cache feeds stored traces to the execution logic to be executed. Trace cache normally runs in this mode. Trace Segment Build Mode Used when there is an L1 cache miss Front end fetches x86 code from the L2 cache, Translates into μops, Builds a “trace segment” with it, Loads that segment into the trace cache to be executed.
Branch Prediction X86 code with a branch in it: The trace cache builds a trace from instructions up to and including the branch instruction Then picks which branch it thinks the program will take Continues to build the trace along that speculative branch.
Microcode ROM Used by P4 to process longer instructions Allows regular hardware decoder to concentrate on decoding the smaller, faster instructions. Stores a sequence of μops for each long instruction encountered. Inserts a tag into the trace segment that points to the section of the microcode ROM where the μop sequence is held. Trace Cache gives control to the Microcode ROM when a tag is encountered until the proper sequence of μops is produced. Execution Engine does not care if instructions come from the Trace Cache or the Microcode ROM
VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 mov eax, 20 }
VTune Experiment for(i=0; i<1M; i++) _asm { mov eax, 10 … mov eax, 4990 }
VTune Results Trace Cache Misses Trace Cache Delivery Mode mov eax, ,605,634 mov eax, 49902,356173,879,264 mov eax, 50003,945174,448,595
VTune Results cont. Dis- tanc e Ru n # Spec micro- code Uops Spec TC- built uops Spec TC- delived uops TC Build Mode TC Deliver Mode TC Missesuops Decodeduops Retired , ,636, ,973,480 4, ,140, ,671, , ,233, , ,451,130 5, ,390, ,080, , , ,599, , ,215,964 10, ,918, ,939, , ,929, ,872,716 1, ,609, ,960, , , ,086, ,210,494 5, ,178, ,336, , ,424, , ,107,503 6, ,964, ,790, , ,461, ,471,452 1, ,074, ,907, ,108 82, ,650, ,759,410 5, ,827, ,866, , ,591, , ,811,048 12, ,118, ,147,504
VTune Results for P4m Dis- tanceRun #Spec Uops retired Spec TC- built uops Spec TC-delived uops TC Build Mode TC Deliver Mode TC Missesuops Retired ,706, ,600,752053,391,1824,248158,219, ,352, ,005,262383,18355,624,0162,957157,856, ,698, ,680,678055,166,3197,248158,195, ,311, ,421,964389,10155,592,7685,192157,215, ,841, ,760,210048,314, ,856, ,101, ,808,330342,95548,707,7959,054138,242, ,317, ,527,055360,10050,786, ,032,684
Sources: M. Milenkovic, A. Milenkovic, J. Kulick, “Demystifying Intel Branch Predictors,” Proceedings of the Workshop on Duplicating, Deconstructing, and Debunking (held in conjunction with 29 th ISCA), Anchorage, Alaska, May 2002 E. Rotenberg, S. Bennett, J. E. Smith, “A Trace Cache Microarchitecture and Evaluation,” IEEE Transactions on Computers, (Vol. 48, No. 2) February htm 5.htm