OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism
Advertisements

Computer Organization and Architecture
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
1 Advanced Computer Architecture Limits to ILP Lecture 3.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
OOE v.s. EPIC Hridesh Rajan Zhendong Yu Weilin Zhong.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Instruction Level Parallelism (ILP) Colin Stevens.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC Architecture RISC vs CISC Sherwin Chan.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
CS 352H: Computer Systems Architecture
Advanced Architectures
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Instruction Scheduling for Instruction-Level Parallelism
Levels of Parallelism within a Single Processor
IA-64 Microarchitecture --- Itanium Processor
Yingmin Li Ting Yan Qi Zhao
Lecture 23: Static Scheduling for High ILP
How to improve (decrease) CPI
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Levels of Parallelism within a Single Processor
CSC3050 – Computer Architecture
8 – Simultaneous Multithreading
How to improve (decrease) CPI
Presentation transcript:

OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu

Objective Our objective is to evaluate the claims and counterclaims about OOE and EPIC made in: – “Is Out-of-Order Out of Date?” by William S. Worley and Jerry Huck – “A Critical Look at IA-64” by Martin Hopkins

Outline Analysis of ILP Analysis of Code Size Analysis of Hardware Complexity Analysis of Compiler Complexity Analysis of Power Consumption Comparison Methodology Conclusion

What is EPIC? “One of our goals for EPIC was to retain VLIW's philosophy of statically constructing the POE, but to augment it with features, akin to those in a superscalar processor, that would permit it to better cope with these dynamic factors. The EPIC philosophy has the following key aspects to it.” “Providing the ability to design the desired POE at compile- time.” “Providing features that permit the compiler to "play" the statistics.” “Providing the ability to communicate the POE to the hardware.” *From EPIC: An architecture for instruction-level parallel processors by Michael S. Schlansker and B. Ramakrishna Rau.

Analysis of ILP MH: Hardware provides good ILP because it dynamically adjusts the instruction schedule based on the actual execution path and cache misses, with the use of: – Large reorder buffers – Register renaming – Branch prediction – Alias detection WW & JH: Compiler can exploit ILP more effectively with the use of: – Massive resources -- large register set, more function units – Predication – Speculation

Analysis of ILP (cont.) Our observation: – From H&P book: The SPECint benchmark shows that the Alpha and Pentium 4 considerably outperform the Itanium. The SPECfp benchmark shows that the Itanium slightly outperforms the Alpha and Pentium 4. – These diagrams are not an absolute measurement of the performance of OOE and EPIC. A different implementations of the architectures may perform differently. As EPIC compilers improve over time, these performance figures will change.

Analysis of Code Size MH: Code size for IA-64 could be as much as 4 times that of x86 to perform the same work. WW & JH: Code size will be larger, but the instruction stream will contain fewer branches. Also, there are mechanisms to efficiently deliver instructions to the processor.

Analysis of Code Size (cont.) Our observation: – Both sides agree that code size increases overall, however they disagree on the extent to which it affects performance. – EPIC code size will expand dramatically in some cases. – EPIC code size can also be smaller than OOE code size in some cases. – We expect that a mature optimizing compiler will be able to deliver code with reasonable size and, after all, code size doesn’t necessarily reflect performance loss linearly.

Analysis of Hardware Complexity MH: To support features for greater ILP, EPIC hardware will be quite complex. – Predication requires more functional units – NaT bits to allow deferring exceptions – ALAT to allow loads before stores WW & JH: IA-64 makes the hardware less complex because it is not responsible for detecting and scheduling the parallelism. – Reorder buffer, register renaming, etc

Analysis of Hardware Complexity (cont.) Our observation: – Is EPIC processor more complex than OOE processor? Example: Alpha 21264, two stages fewer (but more stages don't necessarily mean more complexity) – As mentioned in H&P book, good techniques in ‘enemy camp’ are often borrowed. EPIC processors are expected to be simple. However, to support better ILP, they will also invoke hardware support, which makes them more complex than expected.

Analysis of Compiler Complexity MH: It is very difficult to write a good EPIC compiler. Profiling is also a burden: – Not welcomed by programmers – Hard to get and maintain a test suite – Formidable task for large programs WW & JH: OOE compilers are difficult to write as well. – OOE processors still need good compilers to ensure performance gains. – OOE compiler writers must understand the limitations of the hardware and figure out how to work around them. – Code profiling is only “slightly” more important for EPIC processors.

Analysis of Compiler Complexity (cont.) Our observation: – Optimizing compiler can help performance for both OOE and EPIC processors. – Profiling, which is a non-trivial task, adds complexity to compiler. – An EPIC compiler has a much more responsibility than an OOE compiler, so it is likely to be more complex. – The EPIC philosophy aims to trade compiler complexity for hardware simplicity. Whether this is a critical disadvantage must be considered in the context of overall system complexity and performance.

Analysis of Power Consumption MH: Massive resources consume lots of power. – “Thus, IA-6 gambles that, in the future, power will not be the critical limitation, …” WW & JH: They left this issue out, perhaps because they do not think it is a big problem.

Analsysis of Power Consumption (cont.) Our observation: – The use of massive resources is likely to consume more power. – Whether or not this will be a problem depends on the aimed application area of the EPIC technology. For servers and high-end workstations, the power consumption is not as important. For embedded systems, power consumption is likely a very critical issue. – For EPIC really to be a ‘general purpose’ technology, power consumption control must be considered.

Comparison Methodology MH: Accumulating “facts” supporting a skeptical view of EPIC. – Example: EPIC stalls when OOE proceeds WW & JH: Accumulating “facts” supporting an optimistic view of EPIC. – Example: Dynamic translation Architecture design is a balance of CPI, frequency, instruction count, application limitation, and cost. There are always cases and countercases for every solution. They need to be considered in an integrated context.

Comparison Methodology (cont.) EPIC stalls when OOE proceeds – This will happen in some cases. – But, we must determine how this case actually hurts performance. Cache miss is not a common case. Speculation makes this case even less common. In cache miss, OOE is also not expected to proceed far enough.

Comparison Methodology (cont.) Dynamic translation – It rarely gives much performance gain with highly optimized code. – Dynamo example:

Conclusion – Both authors make claims about the EPIC architecture without providing any quantitative evidence. – Quantitative evidence is necessary to conclude that one architecture is superior to another. – EPIC is a useful effort in the exploration of higher ILP. When evaluating it, we need to isolate the usefulness of the architectural approach from a single specific implementation of it. The idea behind EPIC is good, but more time, effort, and calm calculation are needed to know whether it works.