Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical.

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical and Computer Engineering Department The University of Alabama in Huntsville {uzelacv | milenka}@ece.uah.edu

2 Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction Target Predictors Branch Target Buffer Indirect Branch Target Buffer Outcome Predictors Loop Predictor Global/Bimodal Predictors Conclusion

3 Motivation If we know branch predictor organization we could … Implement predictor-aware compiler optimizations Code alignment to avoid BTB conflicts in critical code sections Code split to replace long correlations with shorter ones Camino environment [PLDI `05] Have a “golden standard” for academic research Design tools for rapid BP design space exploration and verification But, details are rarely publicly disclosed In spite of hints in software optimization manuals  Develop microbenchmarks and mechanisms for reverse engineering of modern branch predictor units

4 Goals Microbenchmarks and mechanisms developed to reverse engineer Pentium M’s branch predictor including Target predictor BTB and IBTB Outcome predictor Loop predictor Global outcome predictor Bimodal predictor Branch predictor parameters Organization and size of all branch predictor structures Indexing, allocation, update, replacement policies Interdependencies between these structures Validation of our effort through a functional PIN model

5 Presentation Outline Motivation and Goals Reverse Engineering Flow Predictors Details Deconstruction Target Predictors Branch Target Buffer Indirect Branch Target Buffer Outcome Predictors Loop Predictor Global/Bimodal Predictors Conclusion

6 Reverse Engineering Flow Goal: determine a specific branch predictor parameter (e.g., BTB size) Design  benchmark(s) to stress the parameter Influenced by the type of observable events Build expectations for relevant event(s) based on back-of-the-envelope analysis Execute  benchmarks and collect events (Vtune) Compare expectations with actual results Retire findings or modify  benchmark Verify findings using functional PIN model

7 Outline Goals and Motivation Reverse Engineering Flow Predictors Details Deconstruction Target Predictors Branch Target Buffer Indirect Branch Target Buffer Outcome Predictors Loop Predictor Global/Bimodal Predictors Conclusion

Branch Target Buffer (BTB) Background: BTB is a cache structure Instructions are fetched in 16-byte blocks (Intel) Can have multiple branches per line BTB can have multiple hits (same tags) => Offset field in each entry => Offset algorithm selects the target among several offered 8 Try to find: Number of BTB entries (N BTB ) Number of sets (N SETS ) Number of ways (N WAYS ) Index, Tag bits Offset bits and presence of offset algorithm Bogus branches handling Replacement policy

Core BTB Test Use B taken branches at the distance D from each other Code executed many times to amplify effects on performance counters Control how these branches are presented to BTB To cope with different allocation policies Here, we execute each branch twice consecutively Missprediction rate (MPR) as function of B and D is sufficient to conclude on BTB parameters 9

10 BTB Capacity Tests Try to fill whole BTB using very small distances between branches Example: 4-way BTB with 512 entries, BTB index = IP[10:4] N BTB branches can fit for three distances Branches fill sets consecutively For larger D, MPR = f(B,D) Branches jump over sets For very small D, there are more branches in the line than sets MPR exist for any D if B>N BTB MPR = f(B,D, BTB parameters) can be mathematically formalized

BTB Set Tests Try to fill one BTB set varying distance D When D > N SET all branches collide in one set MPR is a function of B only (only 4 branches can fit) Helps finding N WAYS and Index MSB When D > N SET, change D’ between last two to find Index LSB D’ for which MPR disappear determines Index LSB When D over Tag MSB distance, false hits occur Only two branches produce MPR 11

12 BTB Findings Number of BTB entries: 2048 Number of sets: 512 Number of ways : 4 Index= IP[12:4], Tag=IP[21:13], Offset=IP[3:0] Offset algorithm: When multiple hits, selects the target with the lowest offset yet no smaller than the current IP Bogus branches handling: Evict whole set Replacement policy: Tree based pseudo LRU

Indirect Branch Target Buffer (IBTB) Background: Target predictor indexed by program-path information Try to find: 1. Which branch parts affect the PIR during update? 2. How is PIR updated? 3. Which branch IP bits affect the hash access function? 4. What is hash access function? 5. What are Index and Tag fields? 6. What is IBTB organization? 14

Path Information Register: Background PIR is a (shift) register – updated with program branches Different ways to allocate newly occurred branch : Shift and Add (add to lowest PIR bits) Shift and Add with interleave (better indexing) Shift and XOR 15

16 PIR Organization Test PIR is the same prior to both Target1 and Target2 Branches are at large distance from each other (> 2 q ) P1.SB1 and P2.SB1 differ in one bit – k = log 2 D If bit k affects the PIR there is no collisions and opposite H block – H branches that affect the PIR For large H, P1.SB1 and P2.SB1 shifted out of PIR Analysis MPR = f(H, D) gives following answers PIR History depth Which branch address/target bits affect the PIR PIR Update mechanism details (XOR or Add…) P1.SB1 and P1.SB1 replaced with different types of branches Both address and target bits tested in this way

17 IBTB Access Hash Function Test Find which PIR and branch IP bits are XORed in the iBTB access hash function Previously we found XOR Reuse previous test Difference at P1.SB1 and P2.SB2 bit k makes targets not to collide Use two Spies at distance D IP = 2 l If bits l and k are XORed in the hash function difference in PIR values is annulated

18 IBTB Organization Test Employ N indirect branch targets to fill iBTB in different ways By using N different PIR values SB1…SBN create N different PIRs to the each of iSpy target SB1…SBN are at distance D=2 k from each other MPR = f(D,N) sufficient to find IBTB organization Similarly as for the BTB

19 IBTB Predictor Findings 1. Which branch parts affect the PIR during update?  15 IP bits from conditional branch IP  Combined 15 bits from indirect branch target and IP 2. How is PIR updated?  Shifted for two bits left prior to update (XOR) 3. Which branch IP bits affect the hash access function?  15 bits, IP[18:4] 4. What is hash access function?  XOR 5. What are Index and Tag fields?  Index = HASH[13:6], Tag = IP[14,5:0] 6. What is IBTB organization?  A direct-mapped cache with 256 entries

21 Loop Predictor What do we know? Each entry has two counters Counter MAX_VAL stores the loop branch maximum count value Counter CURR_VAL stores the loop branch current iteration Assumptions: Loop BP is an IP indexed cache Try to find: Counters’ length Size and organization of the loop branch predictor buffer (Loop BPB) Allocation policy (when a branch becomes a candidate for a loop branch) Training policy – how new loop branch MAX_VAL is set

22 Loop Counters Size Test Test: “spy” loop (LSpy) has loop modulo L MPR exists if L > MAX_VAL counter length Results: Maximum predictable L is 64 (6-bit counters)

23 Loop BPB Capacity and Set Tests Similar to the BTB Capacity/Set tests Employ B loops at the distance D from each other MPR is a function of B, D and Loop BPB parameters similarly as for the BTB

24 Loop BPB Capacity and Set Tests Counters’ length: 6 bits Size and organization of the loop branch predictor buffer Two-way cache with 128 entries Index = IP[9:4], Tag = IP[15:10] Allocation policy: Branch allocated on first opposite outcome Training policy: Set MAX_VAL during 2 nd loop iteration

26 Global and Bimodal Predictor What do we know? All branches predicted dynamically At least one predictor not tagged Assumptions: Cascade organization Bimodal predictor is not tagged Global predictor can correct Bimodal Global is path indexed (BHR register) Try to find: Organization of Global Predictor Indexing to Global predictor (BHR and hashing function details) Bimodal predictor details Size only (not tagged) Indexing bits (IP indexed)

27 BHR Organization Test Similar to PIR Organization test iSpy with two targets replaced with the conditional branch (cSpy) with two outcomes MPR =f(D, H) sufficient to find BHR organization Results: BHR affected in the same way as the PIR BHR and PIR are the same register

28 Global Predictor Organization Test Similar to IBTB Organization test N different paths to cSpyN (always not taken) PIR values depend on distance D cSpyN allocated to up to N different entries Similar to IBTB, MPR=f(D,N) is sufficient to determine the predictor organization Eliminate correct prediction from Bimodal predictor: cSpyT distance from SpyN is large – target the same Bimodal entry Paths occurrence pattern: T*PT, PN1, T*PT, PN2, …, T*PT, PNN, … Eliminate correct prediction from Loop Predictor if needed

29 Bimodal Predictor Organization Test Reuse the previous test Make contentions in Global predictor Change distance between cSpyT and cSpyN to try predicting branches with the Bimodal predictor D G =2 k No contentions in Bimodal Predictor if bit k is used for Bimodal Index

30 Global and Bimodal Predictor Findings Global: 4-way cache structure with 2048 entries Accessed with the hash function - PIR XORed with conditional branch IP 9 bits used as the index, 6 bits as the tag Bimodal: A table with 4096 bimodal counters Indexed with IP [11:0]

Limitations and Verification Generalization of reverse engineering flow is difficult Different branch prediction organizations Implementation of microbenchmarks is a challenging task Balance of observability of certain parameters and isolation of different parameters that share the same event Certain knowledge on targeted predictor is needed E.g. Prediction in cache lines (AMD K8) Tests must cover large design space Verification Using PIN model – achieved more than 95% accuracy 32

Conclusion Microbenchmarks and mechanisms for reverse engineering of path- or IP- indexed predictor structures Demonstrated on Pentium M BTB, IBTB, Loop, Global/Bimodal 33

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical.

Similar presentations

Presentation on theme: "Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical.

Similar presentations

Presentation on theme: "Microbenchmarks and Mechanisms for Reverse Engineering of Branch Predictor Structures Vladimir Uzelac and Aleksandar Milenković LaCASA Laboratory Electrical."— Presentation transcript:

Similar presentations

About project

Feedback