® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

Computer Architecture Instruction-Level Parallel Processors
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Compiler techniques for exposing ILP
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Register Renaming & Value Prediction. Overview ► Need for Post-RISC ► Register Renaming vs. Allocation Strategies ► How to compile for Post-RISC machines.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Basics and Architectures
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
® Compiling for the Intel® Itanium™ Architecture Steve Skedzielewski Intel Corporation Compiler Tricks.
 Arun Hariharan (N.M.S.U). MOTIVATION  Need for high speed computing and Architecture More complex compilers (JAVA) Large Database Systems Distributed.
Comparing High-End Computer Architectures for Business Applications Presentation: 493 Track: HP-UX Dr. Frank Baetke HP.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
M. Mateen Yaqoob The University of Lahore Spring 2014.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Topics to be covered Instruction Execution Characteristics
Computer Architecture Principles Dr. Mike Frank
Visit for more Learning Resources
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COSC3330 Computer Architecture
Instruction-Level Parallelism
Henk Corporaal TUEindhoven 2009
Alex Chiang Hewlett-Packard
The EPIC-VLIW Approach
Computer Architecture Lecture 4 17th May, 2006
CS 704 Advanced Computer Architecture
Yingmin Li Ting Yan Qi Zhao
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
The Vector-Thread Architecture
Additional ILP topic #5: VLIW Also: ISA topics Prof. Eric Rotenberg
VLIW direct descendant of horizontal microprogramming
How to improve (decrease) CPI
DE NAYER INSTITUUT Hogeschool voor Wetenschap & Kunst
IA-64 Vincent D. Capaccio.
Predication ECE 721 Prof. Rotenberg.
Presentation transcript:

® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

® Agenda Architecture Principles Architecture Principles Predication & Speculation Predication & Speculation Branch Architecture Branch Architecture Software Pipelining Software Pipelining

® Today’s Processors often 60% Idle parallelizedcode parallelizedcodeparallelizedcode HardwareCompiler multiple functional units functional units Original Source Code Sequential Machine Code Execution Units Available Used Inefficiently Traditional Architectures: Limited Parallelism

® Increases Parallel Execution IA-64 Compiler Views Wider Scope Parallel Machine Code Compiler Original Source Code Compile Hardware multiple functional units More efficient use of execution resources IA-64 Architecture: Explicit Parallelism

® IA-64 Principles Explicitly parallel: Explicitly parallel: –Instruction level parallelism (ILP) in machine code –Compiler schedules across a wider scope Enhanced ILP : Enhanced ILP : –Predication, Speculation, Software pipelining,... Fully compatible: Fully compatible: –Across all IA-64 family members –IA-32 in hardware and PA-RISC through instruction mapping –Inherently scalable Massively resourced: Massively resourced: –Many registers –Many functional units

® Predication cmp p1 p2 Traditional Architectures IA-64 Removes branches, converts to predicated execution Removes branches, converts to predicated execution –Executes multiple paths simultaneously Increases performance by exposing parallelism and reducing critical path Increases performance by exposing parallelism and reducing critical path –Better utilization of wider machines –Reduces mispredicted branches else then cmp

® (p2) p3= (p3)... (p1) p3= Regular: p3 is set just once Unconditional: p3 and p4 are AND’ed with p2 p1,p2,<-... (p2) p3,p4 <-cmp.unc... (p3)... (p4)... p2&p3 p2&p4 Opportunity for Even More Parallelism Predication Review Two kinds of normal compares Two kinds of normal compares –Regular –Unconditional (nested IF’s)

® Reduces Critical Path B A C D B AC D Introducing Parallel Compares Three new types of compares: Three new types of compares: –AND: both target predicates set FALSE if compare is false –OR: both target predicates set TRUE if compare is true –ANDOR: if true, sets one TRUE, sets other FALSE

® if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] ld R2=[R1] ld.s R4=[R3] ld.s R6=[R5] P1,P2 <-cmp.unc(R2==true) (p1) chk.s R4 (p1) P3,P4 <-cmp.unc(R4==true) (p3) chk.s R6 (p3) P5,P6 <-cmp.unc(R5==true) (P5) br then else Then Else P1 P2 P5 P3 P4 P6 8 queens control flow Unconditional Compares Eight Queens Example

® if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Then Else P1 P2 P5 P3 P4 P6 Parallel Compares R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else P1 5 8 queens control flow

® Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Then Else P1 P2 P5 P3 P4 P6 Parallel Compares R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else Reduced from 7 cycles to 5 8 queens control flow Then Else P1= true P1=False

® Tbit (Test Bit) Also Sets Predicates Five Predicate Compare Types (qp) p1,p2 <- cmp.relation (qp) p1,p2 <- cmp.relation –if(qp) {p1 = relation; p2 = !relation}; (qp) p1,p2 <- cmp.relation.unc (qp) p1,p2 <- cmp.relation.unc –p1 = qp&relation; p2 = qp&!relation; (qp) p1,p2 <- cmp.relation.and (qp) p1,p2 <- cmp.relation.and –if(qp & (relation==FALSE)) { p1=0; p2=0; } (qp) p1,p2 <- cmp.relation.or (qp) p1,p2 <- cmp.relation.or –if(qp & (relation==TRUE)) { p1=1; p2=1; } (qp) p1,p2 <- cmp.relation.or.andcm (qp) p1,p2 <- cmp.relation.or.andcm –if(qp & (relation==TRUE)) { p1=1; p2=0; }

® * Source: S. Mahlke, 1995 Predication Benefits Reduces branches and mispredict penalties Reduces branches and mispredict penalties –50% fewer branches and 37% faster code* Parallel compares further reduce critical paths Parallel compares further reduce critical paths Greatly improves code with hard to predict branches Greatly improves code with hard to predict branches –Large server apps- capacity limited –Sorting, data mining- large database apps –Data compression Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication –Cmove: 39% more instructions, 23% slower performance* –Instructions must all be speculative

® ld.s instr 1 instr 2 br chk.suse IA-64 instr 1 instr 2... br Loaduse Traditional Architectures Allows elevation of load, even above a branch Speculation Review Memory latency is a major performance bottleneck in today’s systems Memory latency is a major performance bottleneck in today’s systems –CPU to memory gap increasing Barrier

® Enables Further Parallelism Hoisting Uses The uses of speculative data can also be executed speculatively The uses of speculative data can also be executed speculatively –distinguishes speculation from simple prefetch ld.s instr 1 instr 2 br chk.suse IA-64

® ld.s instr 1 instr 2 br chk.suse PropagateException ;Exception Detection ;Exception Delivery IA-64 Introducing the NaT (“Not a Thing”) NaT is the GR’s 65th bit that indicates: NaT is the GR’s 65th bit that indicates: –whether or not an exception has occurred –branch to fixup code required NaT set during ld.s, checked by Chk.s NaT set during ld.s, checked by Chk.s

® All computation instructions propagate NaTs to reduce number of checks All computation instructions propagate NaTs to reduce number of checks Cmp propagates “false” when writing predicates Cmp propagates “false” when writing predicates RISC architectures require more instructions for equivalent integrity RISC architectures require more instructions for equivalent integrity –e.g., non faulting load Propagation chk.s r5 sub r7 = r5,r2 ld8.s r3 = (r9) ld8.s r4 = (r10) addr6 = r3, r4 ld8.s r5 = (r6) p1,p2 = cmp(...) Allows single chk on result

® ld.s instr 1 instr 2 usesbr chk.s (Home Block) lduses br home Recovery code Complete Solution for Exception Management Exception Deferral: More Than Skin Deep Deferral allows the efficient delay of costly exceptions Deferral allows the efficient delay of costly exceptions OS controlled deferral by hardware of: OS controlled deferral by hardware of: –Page faults –Protection violations –… NaTs enable deferral with recovery NaTs enable deferral with recovery Efficiently support structured exception handling in C/C++ Efficiently support structured exception handling in C/C++

® Control Speculation Summary All loads have a speculative form that sets the NaT bit when deferring exceptions All loads have a speculative form that sets the NaT bit when deferring exceptions Computational instructions propagate NaTs Computational instructions propagate NaTs OS controls deferral of faults but supported directly in HW - “no-fault speculation” OS controls deferral of faults but supported directly in HW - “no-fault speculation” –Minimizes overhead of data that is not used Chk more effective than non-faulting load Chk more effective than non-faulting load

® Store Barrier Traditional architectures limited by the Store Barrier instr 1 instr 2... Store(*) Load (*) use Barrier Traditional Architectures

® Introducing Data Speculation Compiler can issue a load prior to a preceding, possibly-conflicting store Compiler can issue a load prior to a preceding, possibly-conflicting store Unique feature to IA-64 instr 1 instr 2... st8 ld8use Barrier Traditional Architectures ld8.a instr 1 instr 2 st8 ld.cuse IA-64

® Data Speculation Uses can be hoisted Uses can be hoisted Synergy with control speculation yields greater performance ld8.a instr 1 instr 2 st8 ld.cuse ld8.a instr 1 use instr 2 st8 chk.a ld8uses br home Recovery code

® Advanced Load Address Table - ALAT ld.a inserts entries. ld.a inserts entries. Conflicting stores remove entries Conflicting stores remove entries –Also: ld.c.clr, chk.a.clr, Presence of entry indicates success Presence of entry indicates success –chk.a branches when no entry is found reg # Address reg # Address reg # Address... ld.a reg# =... st chk.a reg# ?

® Architectural Support for Data Speculation Instructions Instructions –ld.a - advanced loads –ld.c - check loads –chk.a - advance load checks Speculative Advanced loads - ld.sa - is an advanced load with deferral Speculative Advanced loads - ld.sa - is an advanced load with deferral ALAT - HW structure containing outstanding advanced loads ALAT - HW structure containing outstanding advanced loads

® Speculation Benefits Reduces impact of memory latency Reduces impact of memory latency –Study demonstrates performance improvement of 79% when combined with predication* Greatest improvement to code with many cache accesses Greatest improvement to code with many cache accesses –Large databases –Operating systems Scheduling flexibility enables new levels of performance headroom Scheduling flexibility enables new levels of performance headroom * August, et.al, 1998

® Agenda Architecture Principles Architecture Principles Predication & Speculation Predication & Speculation Branch Architecture Branch Architecture Software Pipelining Software Pipelining

® Instruction 1 Instruction 0 Template 128-bit bundle 0127 QP IP-Offset Branch 21-bits Branch Instruction Two basic branch formats Two basic branch formats –Relative: IP := IP + Offset21 –Indirect: IP := BR[I] –8 branch registers for efficient branch execution –Call/Return linking through branch registers Loop branches with 64-bit loopcount register (LC) Loop branches with 64-bit loopcount register (LC) –Enables perfect branch prediction of counted loops –Traditional architectures always mispredict last iteration –Incurs misprediction stall costing many cycles 41-bits

® (p1) BR #label_A; Conditional branches (p0) BR #label_A; Unconditional branches AB A “always true” Branch Predicates Compiler directed static prediction augments dynamic prediction Compiler directed static prediction augments dynamic prediction –Better predict highly correlated branches (always/never taken) –Frees space in H/W predictor –Can give hint for dynamic predictor P1=true P1=false

® Compare & Branch in Same Cycle Queens Loop: Parallel Compares & Compare-branch R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else From 5 Cycles Down to 4

® 3 branch cycles 1 branch cycle w/o Speculation Hoisting Loads IA-64 ld8 r6 = (ra) (p1) br exit1 ld8 r7 = (rb) (p3) br exit2 ld8 r8 = (rc) (p5) br exit3 chk r6, rec0 (p1) br exit1 Chk r7, rec1 (p3) br exit2 Chk r8, rec2 (p5) br exit3 ld8.s r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) ld8.s r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) chk r6, rec0 (p2) chk r7, rec1 (p4) chk r8, rec2 }{ (p1) br exit1 (p3) br exit2 (p5) br exit3 }P1 P6 P5 P2 P4 P3 Multiway branches: more than 1 branch in a single cycle Multiway branches: more than 1 branch in a single cycle Allows n-way branching Allows n-way branching Supports Aggressive Speculation Multi-way Branch

® Software Pipelining Overlapping execution of different loop iterations Overlapping execution of different loop iterations vs. More iterations in same amount of time More iterations in same amount of time

® Especially Useful for Integer Code With Small Number of Loop Iterations Especially Useful for Integer Code With Small Number of Loop Iterations Software Pipelining IA-64 features that make this possible IA-64 features that make this possible –Full Predication –Special branch handling features –Register rotation: removes loop copy overhead –Predicate rotation: removes prologue & epilogue Traditional architectures use loop unrolling Traditional architectures use loop unrolling –High overhead: extra code for loop body, prologue, and epilogue

® Execution (Cycles) For (i=0; i<n; i++) { *b++ = *a++; *b++ = *a++; } /* MemCopy */ // setup ra/rb/lc,.label loop { ld8 r35 = [ra],8 ld8 r35 = [ra],8}{ st8 [rb],8 = r35 st8 [rb],8 = r35 br.cloop #loop // check n!=0 br.cloop #loop // check n!=0} ld 1 st 1 br.cloop ld 2 st 2 br. cloop ld 3 st 3 br. cloop ld 4 st 4 br. cloop Basic Copy Loop 3 ops Basic Loop Example Simple Non-overlapping iterations Simple Non-overlapping iterations – 2 cycles per iteration – 3 operations in loop body

® Epilogue Prologue Main loop ld 1 st 1 ld 2 st 2 br.cloop ld 3 st Test for loop count 0,1 ld8 r34 = [ra],8.label loop ld8r35 = [ra],8 ld8r35 = [ra],8 st8 [rb],8 = r34 st8 [rb],8 = r34 br.cle #e-exit br.cle #e-exit ld8r34 = [ra],8 ld8r34 = [ra],8 st8 [rb],8 = r35 st8 [rb],8 = r35 br.cloop #loop br.cloop #loop st8[rb],8 = r34 st8[rb],8 = r34 br #thru br #thru.label e-exit st8 [rb],8 = r35 st8 [rb],8 = r35.label thru Unrolled Copy Loop Execution cycles ld 4 st 4 br.cle 10 ops Loop Support: Unrolling Overlapped iterations Overlapped iterations –1 cycle per word –1.6X performance improvement –3.3X code expansion Incurs Code Expansion Penalties Incurs Code Expansion Penalties

® Software Register Renaming TraditionalArchitecture R32 R33 R34 R35 ld 1 r34

® Software Register Renaming TraditionalArchitecture R32 R33 R34 R35 st 1 r34 ld 2 r35

® Software Register Renaming TraditionalArchitecture R32 R33 R34 R35 st 2 r35 ld 3 r34

® Software Register Renaming TraditionalArchitecture R32 R33 R34 R35 ld 4 r35 st 3 r34

® Software Register Renaming TraditionalArchitecture R32 R33 R34 R35 st 4 r35

® PalmSunny isSprings RRB=0 Introducing Rotating Registers GR , FR can rotate GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. ld 1 R : 34: 33: 32: 36:... Palm

® PalmSunny isSprings IA : 34: 33: 32: 36:... RRB=0 Introducing Rotating Registers GR , FR can rotate GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm ld 2 R34 st 1 R35 Springs Palm

® PalmSunny isSprings IA : 33: 32: 127: 35:... RRB=-1 Introducing Rotating Registers GR , FR can rotate GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs ld 3 R34 st 2 R35 is Springs Palm

® PalmSunny isSprings IA : 32: 127: 126: 34:... RRB=-2 Introducing Rotating Registers GR , FR can rotate GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs ld 4 R34 st 3 R35 Sunny is Springs is

® PalmSunny isSprings IA : 127: 126: 125: 33:... RRB=-3 Introducing Rotating Registers GR , FR can rotate GR , FR can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs st 4 R35 Sunny is is Sunny

® // setup ra/rb/lc/ec, check n > 2 { ld8 r35 = [ra],8 }.label loop { ld8r34 = [ra],8 st8 [rb] = r35,8 br.ctop #loop }{ st8[rb] = r35,8 } Software Pipelined Copy Loop Epilogue Prologue Main loop ld 1 st 1 ld 2 st 2 br. ctop ld 3 st Execution cycles ld 4 st 4 br.ctop 5 ops Loop Support: Rotating Registers Modulo Scheduled Iterations Modulo Scheduled Iterations –1 cycle per word –1.6X performance improvement –additional upside for higher latency conditions –1.7X code expansion

® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 RRB=0 LC=3EC=2IA : 16: 63: 62: 18: Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 InitializeInitialize IA : 16: 63: 62: 18:

® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 LC=2EC=2IA : 16: 63: 62: 18: Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 1 IA : 16: 63: 62: 18: RRB=-1IA : 16: 63: 62: 18: IA : 63: 62: 61: 17: (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34

® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 IA : 16: 63: 62: 18: Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 2 IA : 16: 63: 62: 18: IA : 16: 63: 62: 18: IA : 63: 62: 61: 17: (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34 LC=1EC=2IA : 63: 62: 61: 17: RRB=-2 IA : 62: 61: 60: 16: (p17) st 2 R35 (p16) ld 3 R34

® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 IA : 16: 63: 62: 18: Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 3 IA : 16: 63: 62: 18: IA : 16: 63: 62: 18: IA : 63: 62: 61: 17: (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34 IA : 63: 62: 61: 17: IA : 62: 61: 60: 16: (p17) st 2 R35 (p16) ld 3 R34 LC=0EC=2IA : 62: 61: 60: 16: RRB=-3 IA : 61: 60: 59: 63: (p17) st 3 R35 (p16) ld 4 R34

® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 IA : 16: 63: 62: 18: Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 4 IA : 16: 63: 62: 18: IA : 16: 63: 62: 18: IA : 63: 62: 61: 17: (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34 IA : 63: 62: 61: 17: IA : 62: 61: 60: 16: (p17) st 2 R35 (p16) ld 3 R34 IA : 62: 61: 60: 16: IA : 61: 60: 59: 63: (p17) st 3 R35 (p16) ld 4 R34 LC=0EC=1IA : 61: 60: 59: 63: IA : 60: 59: 58: 62: RRB=-4 (p16) ld R34 (p17) st 3 R35 (p16) ld R34

® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16)IA : 16: 63: 62: 18: Code (p17) st R35 (p16) ld 1 R34 Fall Through IA : 16: 63: 62: 18: IA : 16: 63: 62: 18: IA : 63: 62: 61: 17: (p17) (p17) st 1 R35 (p16) ld 2 R34 IA : 63: 62: 61: 17: IA : 62: 61: 60: 16: (p17) st 2 R35 (p16) ld 3 R34 IA : 62: 61: 60: 16: IA : 61: 60: 59: 63: (p17) st 3 R35 (p16) ld 4 R34 IA : 61: 60: 59: 63: IA : 60: 59: 58: 62: (p16) (p17) st 4 R35 (p16) ld R34 LC=0EC=0IA : 60: 59: 58: 62: IA : 59: 58: 57: 61: RRB=-5 Fall Through

® // setup ra/rb/lc/ec, check n > 1.label loop { (p16) ld8r34 = [ra],8 (p17) st8 [rb] = r35,8 br.ctop #loop } Software Pipelined Copy Loop Main loop ld 1 st ld 2 st 1 br. ctop ld 3 st Execution cycles ld 4 st 3 br.ctop 3 ops ld st 4 br. ctop Efficient Loop, Efficient Code Size Loop Support: Rotating Predicates Software Pipelined MemCopy Software Pipelined MemCopy –1 cycle per word –1.6X performance improvement –no code expansion

® Software Pipelining Benefits Loop pipelining maximizes performance; minimizes overhead Loop pipelining maximizes performance; minimizes overhead –Avoids code expansion of unrolling and code explosion of prologue and epilogue –Smaller code means fewer cache misses –Greater performance improvements in higher latency conditions Reduced overhead allows S/W pipelining of small loops with unknown trip counts Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes

® Reviewing What’s New: Parallel compares Parallel compares Tbit Tbit Nat bits Nat bits Deferral Deferral Hoisting uses Hoisting uses Propagation Propagation Branch instructions Branch instructions Static prediction Static prediction Advanced loads Advanced loads ALAT ALAT Loop branches Loop branches LC register LC register EC register EC register Multiway branch Multiway branch Branch registers Branch registers Register rotation Register rotation Predicate rotation Predicate rotation RRB RRB

® Summary Speculation reduces memory latency impact Speculation reduces memory latency impact –IA-64 removes recovery from critical path –Benefits applications with poor cache locality: server applications, OS Predication removes branches Predication removes branches –Parallel compares increase parallelism –Benefits complex control flow: large databases S/W pipelining support with minimal overhead enables broad usage S/W pipelining support with minimal overhead enables broad usage –Performance for small integer loops with unknown trip counts as well as monster FP loops