Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

Slides:



Advertisements
Similar presentations
Computer Architecture Instruction-Level Parallel Processors
Advertisements

® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Optimizing single thread performance Dependence Loop transformations.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Parallell Processing Systems1 Chapter 4 Vector Processors.
OOE v.s. EPIC Hridesh Rajan Zhendong Yu Weilin Zhong.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
ELEC Fall 05 1 Very- Long Instruction Word (VLIW) Computer Architecture Fan Wang Department of Electrical and Computer Engineering Auburn.
Multiscalar processors
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
On How to Talk Mihai Budiu Monday seminar, Apr 12, 2004.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Generic Software Pipelining at the Assembly Level Markus Pister
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Outline Classification ILP Architectures Data Parallel Architectures
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Technical Seminar Introduction to networking with Linux Administration Amit Kumar Sahoo EC ADVANCED EMBEDDED MICROPROCESSORS AND APPLICATIONS.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Hardware Support for Compiler Speculation
Kyushu University Koji Inoue ICECS'061 Supporting A Dynamic Program Signature: An Intrusion Detection Framework for Microprocessors Koji Inoue Department.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.
Pipelining and Parallelism Mark Staveley
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
Computer Architecture Principles Dr. Mike Frank
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Henk Corporaal TUEindhoven 2009
Instructional Parallelism
Yingmin Li Ting Yan Qi Zhao
How to improve (decrease) CPI
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Mihai Budiu Monday seminar, Apr 12, 2004
How to improve (decrease) CPI
Design of Digital Circuits Lecture 19a: VLIW
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar

Embedded Systems GroupIIT Delhi Slide 2 Introduction  Need for high performance low power processors  Synergistic hardware -compiler design for EPIC or VLIW like architectures  A new variable instruction length scheme  Full predication support in hardware

Embedded Systems GroupIIT Delhi Slide 3 Outline  Instruction-Level Parallelism  Power Consumption in VLSI Circuits  A Look at Available Mobile and DSP Processors  High-Level Evaluation of A Low-Power VLIW Processor  The DEVIL Low-Power Processor  A Step Towards Predicated Execution  Conclusion

Embedded Systems GroupIIT Delhi Slide 4 ILP : Concepts and Limitations  Data Dependences Flow Dependence or RAW Anti Dependence or WAR Output Dependence or WAW  Reduction of critical path  Control Dependences  Resource Conflicts

Embedded Systems GroupIIT Delhi Slide 5

Embedded Systems GroupIIT Delhi Slide 6 Achieving ILP : Pipelining  Control dependencies affect pipelined execution  Data dependencies affect pipelined execution  Resource conflicts affect pipelined execution

Embedded Systems GroupIIT Delhi Slide 7 Achieving ILP: Superscalar Architectures In-order issue with in-order completion In-order issue with out-of-order completion Out-of-order issue with out-of-order completion

Embedded Systems GroupIIT Delhi Slide 8

Embedded Systems GroupIIT Delhi Slide 9

Embedded Systems GroupIIT Delhi Slide 10

Embedded Systems GroupIIT Delhi Slide 11 Achieving ILP: VLIW Processors  Low circuit overhead than Superscalar Processors  Limited number of resources  Explicit insertion of NOPs increases code size

Embedded Systems GroupIIT Delhi Slide 12

Embedded Systems GroupIIT Delhi Slide 13 Extracting ILP : BasicBlock Scheduling

Embedded Systems GroupIIT Delhi Slide 14 Extracting ILP: Superblock Scheduling

Embedded Systems GroupIIT Delhi Slide 15 Extracting ILP: Predicated Execution

Embedded Systems GroupIIT Delhi Slide 16 Power Consumption in CMOS Circuits : Parallelism for Energy Efficiency

Embedded Systems GroupIIT Delhi Slide 17

Embedded Systems GroupIIT Delhi Slide 18 Available Mobile and VLIW Processors  The ARM Family The ARM7 Generation The StrongARM The ARM Thumb Option The ARM Piccolo Option The ARM9 and ARM10

Embedded Systems GroupIIT Delhi Slide 19 Available Mobile and VLIW Processors  The Motorola M-Core  The LSI TinyRisc  The Hitachi SuperH Family  VLIW Processors The Motorola-Lucent Star*Core The Philips TriMedia The HP/Intel IA-64

Embedded Systems GroupIIT Delhi Slide 20 High Level Evaluation of A Low-Power VLIW Processor  Energy consumption distribution

Embedded Systems GroupIIT Delhi Slide 21 High Level Evaluation of A Low-Power VLIW Processor  NOP Elimination in VLIW Processor

Embedded Systems GroupIIT Delhi Slide 22 High Level Evaluation of A Low-Power VLIW Processor  Speed-up Comparison

Embedded Systems GroupIIT Delhi Slide 23 High Level Evaluation of A Low-Power VLIW Processor  Energy Comparison

Embedded Systems GroupIIT Delhi Slide 24 High Level Evaluation of A Low-Power VLIW Processor  Energy-Delay Product Comparison

Embedded Systems GroupIIT Delhi Slide 25 The DEVIL Low-Power Processor  Complexity in VLIW Architectures Hardware Duplication  FUs and number of registers as well as ports  Number of FUs versus type of FU  Number of FUs versus available ILP

Embedded Systems GroupIIT Delhi Slide 26 The DEVIL Low-Power Processor Code Memory

Embedded Systems GroupIIT Delhi Slide 27 The DEVIL Low-Power Processor

Embedded Systems GroupIIT Delhi Slide 28 The DEVIL Low-Power Processor  Instruction Fetch Mechanism

Embedded Systems GroupIIT Delhi Slide 29 The DEVIL Low-Power Processor  Branch Prediction Mechanism

Embedded Systems GroupIIT Delhi Slide 30 The DEVIL Low-Power Processor  Performance with and without superscalar optimizations

Embedded Systems GroupIIT Delhi Slide 31 The DEVIL Low-Power Processor  Effect of SuperScalar optimization on code size

Embedded Systems GroupIIT Delhi Slide 32 The DEVIL Low-Power Processor  Effect of NOP elimination on code size

Embedded Systems GroupIIT Delhi Slide 33 The DEVIL Low-Power Processor  Effect of NOP elimination on the number of accesses to code memory

Embedded Systems GroupIIT Delhi Slide 34 The DEVIL Low-Power Processor  Effect of instruction fetch mechanism on code size

Embedded Systems GroupIIT Delhi Slide 35 The DEVIL Low-Power Processor  Code size comparison with existing mobile processors

Embedded Systems GroupIIT Delhi Slide 36 A Step Towards Predicated Execution  Compiler techniques for reducing predicate code size Reduction of number of Control Instructions Predicate promotion and Instruction merging Instruction reduction for advanced code generation

Embedded Systems GroupIIT Delhi Slide 37 A Step Towards Predicated Execution: Reduction of number of Control Instructions

Embedded Systems GroupIIT Delhi Slide 38 A Step Towards Predicated Execution: Predicate promotion and Instruction merging

Embedded Systems GroupIIT Delhi Slide 39 A Step Towards Predicated Execution  Introducing predication support into processor Effect on code size of full predication Predication code size and Execution Characterstics Prefix based predication

Embedded Systems GroupIIT Delhi Slide 40 A Step Towards Predicated Execution  Relative number of predicated instructions

Embedded Systems GroupIIT Delhi Slide 41 A Step Towards Predicated Execution  Code expansion considering predication

Embedded Systems GroupIIT Delhi Slide 42 A Step Towards Predicated Execution  Code reductions due to predicated execution

Embedded Systems GroupIIT Delhi Slide 43 Conclusions  A synergistic hardware-compiler approach for low-power processors  A new VLIW architecture to reduce increase in code size  A prefix based predicated execution architecture framework