IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004
IA-64 Architecture Agenda Architecture Basics Predication Speculation Software Pipelining
Architectural Basics A full 64-bit address space Large directly accessible register files Enough instruction bits to communicate information from the compiler to the hardware The ability to express arbitrariliy large amounts of ILP
Register Resources bit general registers bit floating-point registers Space for up to bit special-purpose application registers 8 64-bit branch registers for function call linkage and return 64 one-bit predicate registers that hold the result of conditional expression evaluation
Register Resources
Instruction Encoding 14 bits for opcode 7 bits for registers 5 bits for template to help decode and route instruction and indicate the location of stops that mark end of groups of instructions that can execute in parallel.
Instruction Encoding
IA-64 virtual memory Model Can map 16 million Gbytes of virtual space. Bits of a virtual address index into eigth region registers that contain 24-bit region identifiers(RIDs) The 24-bit RID is concatenated with the virtual page number(VPN) to form a unique lookup into the Translation look-aside buffer(TLB)
IA-64 virtual memory Model The TLB lookup generates two main items:the physical page number and access privileges.
IA-64 virtual memory Model
Predication Removes branches, converts to predicated execution Executes multiple paths simultaneously Increases performance by exposing parallelism and reducing critical path Better utilization of wider machines Reduces mispredicted branches
Predication
Predication Benefits Reduces branches and mispredict penalties 50% fewer branches and 37% faster code* Parallel compares further reduce critical paths Greatly improves code with hard to predict branches Large server apps- capacity limited Sorting, data mining- large database apps Data compression Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication Cmove: 39% more instructions, 23% slower performance* Instructions must all be speculative
Data Speculation Compiler can issue a load prior to a preceding, possibly-conflicting store
Architectural Support for Data Speculation Instructions ld.a - advanced loads ld.c - check loads chk.a - advance load checks Speculative Advanced loads - ld.sa - is an advanced load with deferral ALAT - HW structure containing outstanding advanced loads
Speculation Benefits Reduces impact of memory latency Study demonstrates performance improvement of 79% when combined with predication* Greatest improvement to code with many cache accesses Large databases Operating systems Scheduling flexibility enables new levels of performance headroom
Speculation Drawbacks Drawbacks even if speculation is correct Registers used speculatively must be kept alive until the check (increases register pressure) For each speculation, recovery code is needed, which increases code size Drawbacks if speculation is incorrect Recovery code has to be executed; additional cycles Recovery code may not be in cache; loading delay
Software Pipelining Overlapping execution of different loop iterations More iterations in same amount of time
Software Pipelining IA-64 features that make this possible Full Predication Special branch handling features Register rotation: removes loop copy overhead Predicate rotation: removes prologue & epilogue
Software Pipelining Benefits Loop pipelining maximizes performance; minimizes overhead Avoids code expansion of unrolling and code explosion of prologue and epilogue Smaller code means fewer cache misses Greater performance improvements in higher latency conditions Reduced overhead allows S/W pipelining of small loops with unknown trip counts Typical of integer scalar codes
Performance Backwardly compatible through emulation with previous instruction sets (RISC – IA32), although performs badly IA64 code (EPIC instruction set) will run on any member of the Itanium family To get optimum performance, code must be recompiled with processor-specific information (different numbers of functional units/pipeline changes) Itanium 2 is two times faster than Itanium
Performance
Target Market High end servers Database machines Development shops NOT suitable for home PCs
Questions ?