Compiler Optimizations for Modern Hardware Architectures - Part II Bob Wall CS 550 (Fall 2003) Class Presentation Compiling for the Intel® Itanium® – A.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Computer Organization and Architecture
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
NYU DARPA DIS kick-off September 24, Comparing IA-64 and HPL-PD NYU.
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
 Arun Hariharan (N.M.S.U). MOTIVATION  Need for high speed computing and Architecture More complex compilers (JAVA) Large Database Systems Distributed.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
CS 352H: Computer Systems Architecture
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
COSC3330 Computer Architecture
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
The EPIC-VLIW Approach
Yingmin Li Ting Yan Qi Zhao
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
VLIW direct descendant of horizontal microprogramming
Chapter 12 Pipelining and RISC
Control unit extension for data hazards
Control unit extension for data hazards
Presentation transcript:

Compiler Optimizations for Modern Hardware Architectures - Part II Bob Wall CS 550 (Fall 2003) Class Presentation Compiling for the Intel® Itanium® – A Processor of EPIC Proportions

Acknowledgements As part of the presentation, I refer to slides from a presentation by Dr. Yong-fong Lee of Intel, "An Overview of IA-64 Architectural Features and Compiler Optimization“, presented at the 14th International Conference on Parallel and Distributed Computing Systems (PDCS-2001), Aug. 8-10, 2001, Richardson, TX. A link to the PDF file is on my presentation site – and here too:

Explicitly Parallel Instruction-set Computing – a new hardware architecture paradigm that is the successor to VLIW (Very Long Instruction Word). EPIC features: –Parallel instruction encoding –Flexible instruction grouping –Large register file –Predicated instruction set –Control and data speculation –Compiler control of memory hierarchy What is EPIC?

History A timeline of architectures: Hennessy and Patterson were early RISC proponents. VLIW (very long instruction word) architectures, –Alan Charlesworth (Floating Point Systems) –Josh Fisher (Yale / Multiflow – Multiflow Trace machines) –Bob Rau (TRW / Cydrome – Cydra-5 machine) Hard-wired discrete logic Microprocessor (advent of VLSI) CISC (microcode) RISC (microcode bad!) VLIW Superpipelined, Superscalar EPIC

History (cont.) Superscalar processors –Tilak Agerwala and John Cocke of IBM coined term –First microprocessor (Intel i960CA) and workstation (IBM RS/6000) introduced in –Intel Pentium in Foundations of EPIC - HP –HP FAST (Fine-grained Architecture and Software Technologies) research project in 1989 included Bob Rau; this work later developed into the PlayDoh architecture. – In 1990 Bill Worley started the PA-Wide Word project (PA-WW). Josh Fisher, also hired by HP, made contributions to these projects. –HP teamed with Intel in 1994 – announced EPIC, IA-64 architecture in –First implementation was Merced / Itanium, announced in 1999.

IA-64 / Itanium IA-64 is not a complete architecture – it is more a set of high-level requirements for an architecture and instruction set, co-developed by HP and Intel. Itanium is Intel’s implementation of IA-64. HP announced an IA-64 successor to their PA-RISC line, but apparently abandoned it in favor of joint development of the Itanium 2. Architectural features: –No complex out-of-order logic in processor – leave it for compiler –Provide large register file, multiple execution units (again, compiler has visibility to explicitly manage) –Add hardware support for advanced ILP optimization: predication, control and data speculation, register rotation, multi-way branches, parallel compares, visible memory hierarchy –Load / store architecture – no direct-to-memory instructions

Hardware Resources General (user-accessible) registers – bit general registers; each has 1-bit NaT flag attached (GR0 hardwired to 0) – bit floating-point registers –64 1-bit predicate registers (p0 hardwired to 0) –8 64-bit branch registers Special application registers –Current Frame Marker (CFM), Loop Counter (LC), Epilog Counter (EC) Advanced Load Address Table (ALAT) Register Stack Engine (RSE)

IA-64 Instruction Formats Instruction Bundle – three instructions per 128-bit “word” Template indicates how each instruction is used –Instructions are divided into types: –Predefined templates Instr 2 – 41 bitsInstr 1 – 41 bitsInstr 0 – 41 bitstemplate – 5 bits F (float) L+X (long immed.) M (memory) I (integer) B (branch) Branch – MIB, MMB, MFB, MBB, BBB All come with or without stop at end Regular – MII, MLX, MMI, MFI, MMF Stop – MI_I, M_MI (treated as separate groups)

Instruction Groups Sequence of instructions with no register dependencies (can all be executed in parallel) –Write after read dependencies are allowed –Memory operations still require sequential execution Stops delimit groups, so a group of parallel instructions can span bundles (allows new hardware implementation to add functional units – it would still be able to execute code compiled for a smaller number of units). –Itanium 2 can issue up to six instructions simultaneously

The IA-64 includes 64 special-purpose registers that can be used to predicate most of the normal instructions. For example, this allows the following conversion: Most special compare / test instructions write the result and its complement into two different predicate registers Predication converts control dependencies into data dependencies, allowing more scheduling options Predication if (a > b) then c = c + a else c = c + b cmp.lt p1, p2 = a, b (p1) c = c + a (p2) c = c + b

Predication (cont.) The removal of branches has the following effects: –Reduces branch mis-predicts and pipeline bubbles –Improves instruction fetch efficiency –Better utilizes wide-issue instruction efficiency The down-side –Instruction cache “polluted” with non-executed code –Execution units used to execute instructions with false predicates –Need to determine when it will be beneficial – determine hard-to- predict branches

Parallel Compares Used in conjunction with predication – allows multiple checks to be executed in a single instruction group Add.and and.or suffixes to comparison instructions Example: –If ( A & B & C ){ S }; –cmp.eq p1 = r0,r0 ;; // initialize p1=1 cmp.ne.and p1 = rA,0 cmp.ne.and p1 = rB,0 cmp.ne.and p1 = rC,0 (p1) S

Multi-way Branches Also used in conjunction with predication – allows multiple branch targets within a single instruction group Example: –(p1) br.cond target_1 (p2) br.cond target_2 (p3) br.call b1 Takes the branch associated with the first true condition, or falls through if no predicate set

Control Speculation The IA-64 instruction set allows speculative loads – load a register with a value from memory, but do not generate any exceptions. Exceptions can be deferred; when the program is ready to use the register, it executes an instruction that checks - if an exception was generated, it branches to compiler-generated patch-up code to re-execute the load and jump back to the point where execution should continue. Allows “hoisting” of loads above branches – provides more options for concurrent instruction scheduling. In conjunction with parallel instruction execution, masks memory latency.

Control Speculation (cont.) Hardware implementation –Add extra NaT bit to each general register, special NaTVal value for floating point register –Add speculative load (ld.s) instruction (sets NaT if exception occurs), and speculation check (chk.s) instruction to determine if exception occurred earlier (conditional branch if NaT). –NaT propagates through operations The down-side –Increased code size –Extended live range for register –Might unnecessarily execute ld.s –Don’t speculatively move load out of unlikely block

Data Speculation The IA-64 instruction set also allows advanced loads – load a register from memory, but remember that it might contain a value that is subsequently overwritten by a store. The program can later check to see if the overwrite took place – if so, it can reload the value and continue. Allows “hoisting” of loads above stores – also provides more opportunities for parallelism and masking of memory latency.

Data Speculation (cont.) Hardware implementation –Add advance load (ld.a) instruction – perform the load, but store the physical address and target register in the ALAT. (Also combined ld.sa.) –Add check load (ld.c) instruction – check whether the specified address / register are in the ALAT. If not, branch to compiler- generated patch-up code to re-execute the load and jump back. –ALAT is an associative lookup table. Note that it works on physical addresses, rather than virtual ones. The down-side –Increased code size –Extended live range for register –Need to avoid for likely conflicts

Software Pipelining Similar to hardware pipelining – overlap executions of different pieces of a loop. Requires multiple execution units. Problem with loop unrolling / pipelining – requires a lot of registers. Hardware support: –Rotating predicate registers to embed the prolog and epilog of the software pipeline within the main loop – minimizes code expansion. –Rotating registers can be used to avoid register renaming within unrolled loops, further minimizing code expansion –Loop Count (LC) and Epilog Count (EC) registers, special loop branch instruction to rotate registers

Software Pipelining (cont.) Modulo scheduling algorithm –Divide loop into prolog (filling pipeline), kernel (steady state), and epilog (flush pipeline) Issues –Loops with multiple exits (either need to convert to single exit or generate additional epilogs) –Complicates control and data speculation

The Register Stack Portion of process address space set aside as backing store for registers – spill to this area. Processor supports a register stack frame similar to normal memory stack frame. –First 32 registers are static. Remaining 96 form the register stack. –Current Frame Marker (CFM) and Top of Stack (TOS) registers used to manipulate stack frames –Called routine calls alloc instruction to indicate how many registers it requires. This allocates the next frame and updates the registers. –Code uses virtual register names (i.e. always uses register 32 as the first register in its frame) – processor maps to physical register

The Register Stack (cont.) –Register frame divided into fixed region, rotating region (if needed), and output region (for passing parameters to other functions) –The incoming / outgoing registers allow up to 8 parameters – can eliminate any memory access for procedure calls. When register stack overflows (wraps around all 96 registers), the Register Stack Engine (RSE) automatically stalls the processor and spills registers to the backing store. The RSE also automatically fills registers from backing store when room is available. Flexibility in IA-64 spec to make RSE more aggressive – can asynchronously spill / fill when memory bandwidth available.

Memory Hierarchy Control Compiler can insert prefetch hints (lfetch) in instruction stream. Load, store, and prefetch instructions can include post- increment – will also cause cache load of line containing post-incremented address

What’s a Compiler to Do? Many new architectural features are based on improving opportunities for ILP. Obviously, there are high expectations for the compiler to perform good instruction scheduling. Scheduler should probably drive many other optimizations. Tradeoff between ILP and code expansion. The choice of making the compiler do a lot of the work is probably a good one – it has much more context available to make decisions. However, it lacks the knowledge of the dynamic run-time environment.

Conclusions x86 architecture – NOT COOL!!! IA-64 architecture – WAY COOL!!! Could fuel a new round of “compiler wars”? Future directions (some speculation on my part): –Out of order execution in hardware –Possibly introduce some dynamic optimization capability – base speculation on runtime profile (i.e. remove speculation if inside a rarely executed block)

And Finally … Q: How many Intel architects does it take to change a lightbulb ? A: None, they have a predicating compiler that eliminates lightbulb dependencies. If the dependencies are not entirely eliminated, they have four levels of prediction to determine if you need to replace the lightbulb. Q: How many AMD architects does it take to change a lightbulb? A: I have no idea, but they'll do anything to stick twice as many lightbulbs into the same socket and ensure that it glows brighter than the bulb Intel replaced. Q: How many Apple PowerPC architects does it take to change a lightbulb? A: Our Power G4 processor at 733MHz, with the phenomenally powerful velocity engine(TM) technology, has the power to burn. It'll burn CDs, DVDs, and probably light up your room as well. You don't need a lightbulb. Oh, by the way, it'll burn Pentiums too, so you don't need Intel to replace your lightbulb. (Borrowed from