Presentation is loading. Please wait.

Presentation is loading. Please wait.

® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

Similar presentations


Presentation on theme: "® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co."— Presentation transcript:

1 ® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.

2 ® Agenda Architecture Principles Architecture Principles Predication & Speculation Predication & Speculation Branch Architecture Branch Architecture Software Pipelining Software Pipelining

3 ® Today’s Processors often 60% Idle parallelizedcode parallelizedcodeparallelizedcode HardwareCompiler multiple functional units functional units Original Source Code Sequential Machine Code........................ Execution Units Available Used Inefficiently Traditional Architectures: Limited Parallelism

4 ® Increases Parallel Execution IA-64 Compiler Views Wider Scope Parallel Machine Code Compiler Original Source Code Compile Hardware multiple functional units........................ More efficient use of execution resources IA-64 Architecture: Explicit Parallelism

5 ® IA-64 Principles Explicitly parallel: Explicitly parallel: –Instruction level parallelism (ILP) in machine code –Compiler schedules across a wider scope Enhanced ILP : Enhanced ILP : –Predication, Speculation, Software pipelining,... Fully compatible: Fully compatible: –Across all IA-64 family members –IA-32 in hardware and PA-RISC through instruction mapping –Inherently scalable Massively resourced: Massively resourced: –Many registers –Many functional units

6 ® Predication cmp p1 p2 Traditional Architectures IA-64 Removes branches, converts to predicated execution Removes branches, converts to predicated execution –Executes multiple paths simultaneously Increases performance by exposing parallelism and reducing critical path Increases performance by exposing parallelism and reducing critical path –Better utilization of wider machines –Reduces mispredicted branches else then cmp

7 ® (p2) p3= (p3)... (p1) p3= Regular: p3 is set just once Unconditional: p3 and p4 are AND’ed with p2 p1,p2,<-... (p2) p3,p4 <-cmp.unc... (p3)... (p4)... p2&p3 p2&p4 Opportunity for Even More Parallelism Predication Review Two kinds of normal compares Two kinds of normal compares –Regular –Unconditional (nested IF’s)

8 ® Reduces Critical Path B A C D B AC D Introducing Parallel Compares Three new types of compares: Three new types of compares: –AND: both target predicates set FALSE if compare is false –OR: both target predicates set TRUE if compare is true –ANDOR: if true, sets one TRUE, sets other FALSE

9 ® if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] ld R2=[R1] ld.s R4=[R3] ld.s R6=[R5] P1,P2 <-cmp.unc(R2==true) (p1) chk.s R4 (p1) P3,P4 <-cmp.unc(R4==true) (p3) chk.s R6 (p3) P5,P6 <-cmp.unc(R5==true) (P5) br then else 1 2 4 5 6 7 Then Else P1 P2 P5 P3 P4 P6 8 queens control flow Unconditional Compares Eight Queens Example

10 ® if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Then Else P1 P2 P5 P3 P4 P6 Parallel Compares R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else 1 2 4 P1 5 8 queens control flow

11 ® Eight Queens Example if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true)) Then Else P1 P2 P5 P3 P4 P6 Parallel Compares R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else 1 2 4 5 Reduced from 7 cycles to 5 8 queens control flow Then Else P1= true P1=False

12 ® Tbit (Test Bit) Also Sets Predicates Five Predicate Compare Types (qp) p1,p2 <- cmp.relation (qp) p1,p2 <- cmp.relation –if(qp) {p1 = relation; p2 = !relation}; (qp) p1,p2 <- cmp.relation.unc (qp) p1,p2 <- cmp.relation.unc –p1 = qp&relation; p2 = qp&!relation; (qp) p1,p2 <- cmp.relation.and (qp) p1,p2 <- cmp.relation.and –if(qp & (relation==FALSE)) { p1=0; p2=0; } (qp) p1,p2 <- cmp.relation.or (qp) p1,p2 <- cmp.relation.or –if(qp & (relation==TRUE)) { p1=1; p2=1; } (qp) p1,p2 <- cmp.relation.or.andcm (qp) p1,p2 <- cmp.relation.or.andcm –if(qp & (relation==TRUE)) { p1=1; p2=0; }

13 ® * Source: S. Mahlke, 1995 Predication Benefits Reduces branches and mispredict penalties Reduces branches and mispredict penalties –50% fewer branches and 37% faster code* Parallel compares further reduce critical paths Parallel compares further reduce critical paths Greatly improves code with hard to predict branches Greatly improves code with hard to predict branches –Large server apps- capacity limited –Sorting, data mining- large database apps –Data compression Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication –Cmove: 39% more instructions, 23% slower performance* –Instructions must all be speculative

14 ® ld.s instr 1 instr 2 br chk.suse IA-64 instr 1 instr 2... br Loaduse Traditional Architectures Allows elevation of load, even above a branch Speculation Review Memory latency is a major performance bottleneck in today’s systems Memory latency is a major performance bottleneck in today’s systems –CPU to memory gap increasing Barrier

15 ® Enables Further Parallelism Hoisting Uses The uses of speculative data can also be executed speculatively The uses of speculative data can also be executed speculatively –distinguishes speculation from simple prefetch ld.s instr 1 instr 2 br chk.suse IA-64

16 ® ld.s instr 1 instr 2 br chk.suse PropagateException ;Exception Detection ;Exception Delivery IA-64 Introducing the NaT (“Not a Thing”) NaT is the GR’s 65th bit that indicates: NaT is the GR’s 65th bit that indicates: –whether or not an exception has occurred –branch to fixup code required NaT set during ld.s, checked by Chk.s NaT set during ld.s, checked by Chk.s

17 ® All computation instructions propagate NaTs to reduce number of checks All computation instructions propagate NaTs to reduce number of checks Cmp propagates “false” when writing predicates Cmp propagates “false” when writing predicates RISC architectures require more instructions for equivalent integrity RISC architectures require more instructions for equivalent integrity –e.g., non faulting load Propagation chk.s r5 sub r7 = r5,r2 ld8.s r3 = (r9) ld8.s r4 = (r10) addr6 = r3, r4 ld8.s r5 = (r6) p1,p2 = cmp(...) Allows single chk on result

18 ® ld.s instr 1 instr 2 usesbr chk.s (Home Block) lduses br home Recovery code Complete Solution for Exception Management Exception Deferral: More Than Skin Deep Deferral allows the efficient delay of costly exceptions Deferral allows the efficient delay of costly exceptions OS controlled deferral by hardware of: OS controlled deferral by hardware of: –Page faults –Protection violations –… NaTs enable deferral with recovery NaTs enable deferral with recovery Efficiently support structured exception handling in C/C++ Efficiently support structured exception handling in C/C++

19 ® Control Speculation Summary All loads have a speculative form that sets the NaT bit when deferring exceptions All loads have a speculative form that sets the NaT bit when deferring exceptions Computational instructions propagate NaTs Computational instructions propagate NaTs OS controls deferral of faults but supported directly in HW - “no-fault speculation” OS controls deferral of faults but supported directly in HW - “no-fault speculation” –Minimizes overhead of data that is not used Chk more effective than non-faulting load Chk more effective than non-faulting load

20 ® Store Barrier Traditional architectures limited by the Store Barrier instr 1 instr 2... Store(*) Load (*) use Barrier Traditional Architectures

21 ® Introducing Data Speculation Compiler can issue a load prior to a preceding, possibly-conflicting store Compiler can issue a load prior to a preceding, possibly-conflicting store Unique feature to IA-64 instr 1 instr 2... st8 ld8use Barrier Traditional Architectures ld8.a instr 1 instr 2 st8 ld.cuse IA-64

22 ® Data Speculation Uses can be hoisted Uses can be hoisted Synergy with control speculation yields greater performance ld8.a instr 1 instr 2 st8 ld.cuse ld8.a instr 1 use instr 2 st8 chk.a ld8uses br home Recovery code

23 ® Advanced Load Address Table - ALAT ld.a inserts entries. ld.a inserts entries. Conflicting stores remove entries Conflicting stores remove entries –Also: ld.c.clr, chk.a.clr, Presence of entry indicates success Presence of entry indicates success –chk.a branches when no entry is found reg # Address reg # Address reg # Address... ld.a reg# =... st chk.a reg# ?

24 ® Architectural Support for Data Speculation Instructions Instructions –ld.a - advanced loads –ld.c - check loads –chk.a - advance load checks Speculative Advanced loads - ld.sa - is an advanced load with deferral Speculative Advanced loads - ld.sa - is an advanced load with deferral ALAT - HW structure containing outstanding advanced loads ALAT - HW structure containing outstanding advanced loads

25 ® Speculation Benefits Reduces impact of memory latency Reduces impact of memory latency –Study demonstrates performance improvement of 79% when combined with predication* Greatest improvement to code with many cache accesses Greatest improvement to code with many cache accesses –Large databases –Operating systems Scheduling flexibility enables new levels of performance headroom Scheduling flexibility enables new levels of performance headroom * August, et.al, 1998

26 ® Agenda Architecture Principles Architecture Principles Predication & Speculation Predication & Speculation Branch Architecture Branch Architecture Software Pipelining Software Pipelining

27 ® Instruction 1 Instruction 0 Template 128-bit bundle 0127 QP IP-Offset Branch 21-bits Branch Instruction Two basic branch formats Two basic branch formats –Relative: IP := IP + Offset21 –Indirect: IP := BR[I] –8 branch registers for efficient branch execution –Call/Return linking through branch registers Loop branches with 64-bit loopcount register (LC) Loop branches with 64-bit loopcount register (LC) –Enables perfect branch prediction of counted loops –Traditional architectures always mispredict last iteration –Incurs misprediction stall costing many cycles 41-bits

28 ® (p1) BR #label_A; Conditional branches (p0) BR #label_A; Unconditional branches AB A “always true” Branch Predicates Compiler directed static prediction augments dynamic prediction Compiler directed static prediction augments dynamic prediction –Better predict highly correlated branches (always/never taken) –Frees space in H/W predictor –Can give hint for dynamic predictor P1=true P1=false

29 ® Compare & Branch in Same Cycle Queens Loop: Parallel Compares & Compare-branch R1=&b[j]R3=&a[i+j]R5=&c[i-j+7] p1 <- true ld R2=[R1] ld R4=[R3] ld R6=[R5] p1,p2 <- cmp.and(R2==true) p1,p2 <- cmp.and(R4==true) p1,p2 <- cmp.and(R6==true) (p1) br then else 1 2 4 From 5 Cycles Down to 4

30 ® 3 branch cycles 1 branch cycle w/o Speculation Hoisting Loads IA-64 ld8 r6 = (ra) (p1) br exit1 ld8 r7 = (rb) (p3) br exit2 ld8 r8 = (rc) (p5) br exit3 chk r6, rec0 (p1) br exit1 Chk r7, rec1 (p3) br exit2 Chk r8, rec2 (p5) br exit3 ld8.s r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) ld8.s r6 = (ra) ld8.s r7 = (rb) ld8.s r8 = (rc) chk r6, rec0 (p2) chk r7, rec1 (p4) chk r8, rec2 }{ (p1) br exit1 (p3) br exit2 (p5) br exit3 }P1 P6 P5 P2 P4 P3 Multiway branches: more than 1 branch in a single cycle Multiway branches: more than 1 branch in a single cycle Allows n-way branching Allows n-way branching Supports Aggressive Speculation Multi-way Branch

31 ® Software Pipelining Overlapping execution of different loop iterations Overlapping execution of different loop iterations vs. More iterations in same amount of time More iterations in same amount of time

32 ® Especially Useful for Integer Code With Small Number of Loop Iterations Especially Useful for Integer Code With Small Number of Loop Iterations Software Pipelining IA-64 features that make this possible IA-64 features that make this possible –Full Predication –Special branch handling features –Register rotation: removes loop copy overhead –Predicate rotation: removes prologue & epilogue Traditional architectures use loop unrolling Traditional architectures use loop unrolling –High overhead: extra code for loop body, prologue, and epilogue

33 ® Execution (Cycles) 1 2 3 4 5 6 7 8 For (i=0; i<n; i++) { *b++ = *a++; *b++ = *a++; } /* MemCopy */ // setup ra/rb/lc,.label loop { ld8 r35 = [ra],8 ld8 r35 = [ra],8}{ st8 [rb],8 = r35 st8 [rb],8 = r35 br.cloop #loop // check n!=0 br.cloop #loop // check n!=0} ld 1 st 1 br.cloop ld 2 st 2 br. cloop ld 3 st 3 br. cloop ld 4 st 4 br. cloop Basic Copy Loop 3 ops Basic Loop Example Simple Non-overlapping iterations Simple Non-overlapping iterations – 2 cycles per iteration – 3 operations in loop body

34 ® Epilogue Prologue Main loop ld 1 st 1 ld 2 st 2 br.cloop ld 3 st 3 1 2 3 4 5 Test for loop count 0,1 ld8 r34 = [ra],8.label loop ld8r35 = [ra],8 ld8r35 = [ra],8 st8 [rb],8 = r34 st8 [rb],8 = r34 br.cle #e-exit br.cle #e-exit ld8r34 = [ra],8 ld8r34 = [ra],8 st8 [rb],8 = r35 st8 [rb],8 = r35 br.cloop #loop br.cloop #loop st8[rb],8 = r34 st8[rb],8 = r34 br #thru br #thru.label e-exit st8 [rb],8 = r35 st8 [rb],8 = r35.label thru Unrolled Copy Loop Execution cycles ld 4 st 4 br.cle 10 ops Loop Support: Unrolling Overlapped iterations Overlapped iterations –1 cycle per word –1.6X performance improvement –3.3X code expansion Incurs Code Expansion Penalties Incurs Code Expansion Penalties

35 ® Software Register Renaming TraditionalArchitecture...... R32 R33 R34 R35 ld 1 r34

36 ® Software Register Renaming TraditionalArchitecture...... R32 R33 R34 R35 st 1 r34 ld 2 r35

37 ® Software Register Renaming TraditionalArchitecture...... R32 R33 R34 R35 st 2 r35 ld 3 r34

38 ® Software Register Renaming TraditionalArchitecture...... R32 R33 R34 R35 ld 4 r35 st 3 r34

39 ® Software Register Renaming TraditionalArchitecture...... R32 R33 R34 R35 st 4 r35

40 ® PalmSunny isSprings RRB=0 Introducing Rotating Registers GR 32-127, FR32-127 can rotate GR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. ld 1 R35... 35: 34: 33: 32: 36:... Palm

41 ® PalmSunny isSprings IA-64... 35: 34: 33: 32: 36:... RRB=0 Introducing Rotating Registers GR 32-127, FR32-127 can rotate GR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm ld 2 R34 st 1 R35 Springs Palm

42 ® PalmSunny isSprings IA-64... 34: 33: 32: 127: 35:... RRB=-1 Introducing Rotating Registers GR 32-127, FR32-127 can rotate GR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs ld 3 R34 st 2 R35 is Springs Palm

43 ® PalmSunny isSprings IA-64... 33: 32: 127: 126: 34:... RRB=-2 Introducing Rotating Registers GR 32-127, FR32-127 can rotate GR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs ld 4 R34 st 3 R35 Sunny is Springs is

44 ® PalmSunny isSprings IA-64... 32: 127: 126: 125: 33:... RRB=-3 Introducing Rotating Registers GR 32-127, FR32-127 can rotate GR 32-127, FR32-127 can rotate Separate Rotating Register Base for each: GRs, FRs Separate Rotating Register Base for each: GRs, FRs Loop branches decrement all register rotating bases (RRB) Loop branches decrement all register rotating bases (RRB) Instructions contain a “virtual” register number Instructions contain a “virtual” register number –RRB + virtual register number = physical register number. Palm Springs st 4 R35 Sunny is is Sunny

45 ® // setup ra/rb/lc/ec, check n > 2 { ld8 r35 = [ra],8 }.label loop { ld8r34 = [ra],8 st8 [rb] = r35,8 br.ctop #loop }{ st8[rb] = r35,8 } Software Pipelined Copy Loop Epilogue Prologue Main loop ld 1 st 1 ld 2 st 2 br. ctop ld 3 st 3 1 2 3 4 5 Execution cycles ld 4 st 4 br.ctop 5 ops Loop Support: Rotating Registers Modulo Scheduled Iterations Modulo Scheduled Iterations –1 cycle per word –1.6X performance improvement –additional upside for higher latency conditions –1.7X code expansion

46 ® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 RRB=0 LC=3EC=2IA-64... 17: 16: 63: 62: 18:... 0 0 0 0 0 Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 InitializeInitialize IA-64... 17: 16: 63: 62: 18:... 0 0 1 0 0

47 ® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 LC=2EC=2IA-64... 17: 16: 63: 62: 18:... 0 0 0 0 0 Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 1 IA-64... 17: 16: 63: 62: 18:... 0 0 1 0 0 RRB=-1IA-64... 17: 16: 63: 62: 18:... 0 0 1 1 0 IA-64... 16: 63: 62: 61: 17:... 1 0 1 0 0 (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34

48 ® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 IA-64... 17: 16: 63: 62: 18:... 0 0 0 0 0 Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 2 IA-64... 17: 16: 63: 62: 18:... 0 0 1 0 0IA-64... 17: 16: 63: 62: 18:... 0 0 1 1 0IA-64... 16: 63: 62: 61: 17:... 1 0 1 0 0 (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34 LC=1EC=2IA-64... 16: 63: 62: 61: 17:... 1 0 1 1 0 RRB=-2 IA-64... 63: 62: 61: 60: 16:... 1 1 1 0 0 (p17) st 2 R35 (p16) ld 3 R34

49 ® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 IA-64... 17: 16: 63: 62: 18:... 0 0 0 0 0 Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 3 IA-64... 17: 16: 63: 62: 18:... 0 0 1 0 0IA-64... 17: 16: 63: 62: 18:... 0 0 1 1 0IA-64... 16: 63: 62: 61: 17:... 1 0 1 0 0 (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34 IA-64... 16: 63: 62: 61: 17:... 1 0 1 1 0IA-64... 63: 62: 61: 60: 16:... 1 1 1 0 0 (p17) st 2 R35 (p16) ld 3 R34 LC=0EC=2IA-64... 63: 62: 61: 60: 16:... 1 1 1 1 0 RRB=-3 IA-64... 62: 61: 60: 59: 63:... 1 1 1 0 0 (p17) st 3 R35 (p16) ld 4 R34

50 ® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16) ld R34 (p17) st R35 IA-64... 17: 16: 63: 62: 18:... 0 0 0 0 0 Code (p16) ld R34 (p17) st R35 (p16) ld 1 R34 Branch 4 IA-64... 17: 16: 63: 62: 18:... 0 0 1 0 0IA-64... 17: 16: 63: 62: 18:... 0 0 1 1 0IA-64... 16: 63: 62: 61: 17:... 1 0 1 0 0 (p17) st R35 (p17) st 1 R35 (p16) ld 2 R34 IA-64... 16: 63: 62: 61: 17:... 1 0 1 1 0IA-64... 63: 62: 61: 60: 16:... 1 1 1 0 0 (p17) st 2 R35 (p16) ld 3 R34 IA-64... 63: 62: 61: 60: 16:... 1 1 1 1 0IA-64... 62: 61: 60: 59: 63:... 1 1 1 0 0 (p17) st 3 R35 (p16) ld 4 R34 LC=0EC=1IA-64... 62: 61: 60: 59: 63:... 1 1 1 0 0 IA-64... 61: 60: 59: 58: 62:... 1 1 0 0 0 RRB=-4 (p16) ld R34 (p17) st 3 R35 (p16) ld R34

51 ® Introducing Rotating Predicate Registers PR16-63 can rotate, with separate Rotating Register Base PR16-63 can rotate, with separate Rotating Register Base Loop branches decrement all register rotating base (RRB) Loop branches decrement all register rotating base (RRB) Instructions contain a “virtual” predicate register number Instructions contain a “virtual” predicate register number –RRB + virtual register number = physical register number. (p16)IA-64... 17: 16: 63: 62: 18:... 0 0 0 0 0 Code (p17) st R35 (p16) ld 1 R34 Fall Through IA-64... 17: 16: 63: 62: 18:... 0 0 1 0 0IA-64... 17: 16: 63: 62: 18:... 0 0 1 1 0IA-64... 16: 63: 62: 61: 17:... 1 0 1 0 0 (p17) (p17) st 1 R35 (p16) ld 2 R34 IA-64... 16: 63: 62: 61: 17:... 1 0 1 1 0IA-64... 63: 62: 61: 60: 16:... 1 1 1 0 0 (p17) st 2 R35 (p16) ld 3 R34 IA-64... 63: 62: 61: 60: 16:... 1 1 1 1 0IA-64... 62: 61: 60: 59: 63:... 1 1 1 0 0 (p17) st 3 R35 (p16) ld 4 R34 IA-64... 62: 61: 60: 59: 63:... 1 1 1 0 0IA-64... 61: 60: 59: 58: 62:... 1 1 0 0 0 (p16) (p17) st 4 R35 (p16) ld R34 LC=0EC=0IA-64... 61: 60: 59: 58: 62:... 1 1 0 0 0 IA-64... 60: 59: 58: 57: 61:... 0 1 0 0 0 RRB=-5 Fall Through

52 ® // setup ra/rb/lc/ec, check n > 1.label loop { (p16) ld8r34 = [ra],8 (p17) st8 [rb] = r35,8 br.ctop #loop } Software Pipelined Copy Loop Main loop ld 1 st ld 2 st 1 br. ctop ld 3 st 2 1 2 3 4 5 Execution cycles ld 4 st 3 br.ctop 3 ops ld st 4 br. ctop Efficient Loop, Efficient Code Size Loop Support: Rotating Predicates Software Pipelined MemCopy Software Pipelined MemCopy –1 cycle per word –1.6X performance improvement –no code expansion

53 ® Software Pipelining Benefits Loop pipelining maximizes performance; minimizes overhead Loop pipelining maximizes performance; minimizes overhead –Avoids code expansion of unrolling and code explosion of prologue and epilogue –Smaller code means fewer cache misses –Greater performance improvements in higher latency conditions Reduced overhead allows S/W pipelining of small loops with unknown trip counts Reduced overhead allows S/W pipelining of small loops with unknown trip counts – Typical of integer scalar codes

54 ® Reviewing What’s New: Parallel compares Parallel compares Tbit Tbit Nat bits Nat bits Deferral Deferral Hoisting uses Hoisting uses Propagation Propagation Branch instructions Branch instructions Static prediction Static prediction Advanced loads Advanced loads ALAT ALAT Loop branches Loop branches LC register LC register EC register EC register Multiway branch Multiway branch Branch registers Branch registers Register rotation Register rotation Predicate rotation Predicate rotation RRB RRB

55 ® Summary Speculation reduces memory latency impact Speculation reduces memory latency impact –IA-64 removes recovery from critical path –Benefits applications with poor cache locality: server applications, OS Predication removes branches Predication removes branches –Parallel compares increase parallelism –Benefits complex control flow: large databases S/W pipelining support with minimal overhead enables broad usage S/W pipelining support with minimal overhead enables broad usage –Performance for small integer loops with unknown trip counts as well as monster FP loops


Download ppt "® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co."

Similar presentations


Ads by Google