2015-05-26 Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
EECE476: Computer Architecture Lecture 23: Speculative Execution, Dynamic Superscalar (text 6.8 plus more) The University of British ColumbiaEECE 476©
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
Hiding Synchronization Delays in a GALS Processor Microarchitecture Greg Semeraro David H. Albonesi Grigorios Magklis Michael L. Scott Steven G. Dropsho.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Out-of-Order Speculative Execution Designing a Configurable Simulator for an OOO Microprocessor By Mustafa Imran Ali ID#
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.
1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Pipelining and Parallelism Mark Staveley
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Instruction Level Parallelism
Simultaneous Multithreading
Lecture: Out-of-order Processors
Microprocessor Microarchitecture Dynamic Pipeline
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores
Superscalar Processors & VLIW Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
How to improve (decrease) CPI
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung

For Single Thread Performance in Mobile – Out-of-Order Superscalar processors Consume much more energy than In-Order Processor – Dynamic Instruction Scheduling Issue Queue Reorder Buffer Load/Store Queue Propose Front-end Execution Architecture (FXA). – In-Order Execution Unit (IXU) – Out-of-Order Execution Unit (OXU) – The IXU and the OXU are placed in series. 2

3

4

The IXU functions as a filter for the OXU. In-Order Execution Unit (IXU) – Check whether instructions are ready. Read From the Physical Register File (PRF). Bypassed from the Functional Units(FU) in the IXU. – Depending on Whether an Instruction is ready, the instruction is processed as follows A ready instruction is executed and is not dispatched to the IQ (Issue Queue). A not-ready instruction goes through the IXU as a NOP. The instruction is dispatched to the IQ. (No Stall) – The instruction is committed as in conventional superscalar processor. (Reorder Buffer) Out-of-Order Execution Unit (OXU) – Same way as it is executed in conventional superscalar processor. 5

6

7

8

9

10

11

IXU Cannot Execute I3 – Because of a long and consecutive chain of dependent instructions. Generally, dependent instructions are rarely placed in a long and consecutive chain. -> IXU can execute many instructions. 12

Branch – The IXU can execute branch instructions with handling misprediction. Floating Point – The IXU cannot execute FP operations. – Long latency -> the pipeline length is prolonged. Load/Store – Use Load Store Queue (LSQ) 13

Bypassing between IXU and OXU – IXU -> OXU is not necessary. Order – OXU -> IXU is omitted. Performance degradation is not significant 14

Optimization of IXU – The latency of bypass network is increased because of FUs. Decrease the number of FUs in backward stages. [3, 1, 1] – Partially omit operand-bypassing in IXU. Bypassing between FUs that are more distant than two stages 15 FU

Instructions Executed in IXU – Instructions that are already ready when they are entered to the IXU Very small (5.5%) – Instructions that become newly ready in the IXU – 35% (1 Stage) to 54% (3 Stage, FU[3, 1, 1]) Performance Improvement – Effects of FUs in IXU 4 stage (Conventional Superscalar Processor) to 7 stage (FXA) FU : 4(4 issue OoO Superscalar) to 7 (5 in IXU, 2 in OXU) – Variable Branch Misprediction Penalty IXU / OXU 16

The number of FUs is increased. – IXU and OXU – Static energy consumption : increased. – Dynamic energy consumption : increased or equal. PRF – IXU/OXU access PRF simultaneously. The number of Issue Queue Access is decreased. – Because of IXU – Reduce 86% of energy consumption. 17

Evaluate IPCs using an in-house cycle-accurate processor simulator. Run SPEC CPU – Compiled using gcc with –O3 evaluated energy consumption and chip areas using the McPAT simulator (Parameter : Table 2) 18

BIG – Out-of-Order superscalar (ARM Cortex-A57 big Core) – baseline HALF – Issue width and IQ capacity are half those in BIG LITTLE – In-Order processor (ARM Cortex-A53 LITTE Core) HALF+FX – HALF with IXU (3 Stage, FU [3, 1, 1]) BIG+FX – BIG with IXU (3 Stage, FU [3, 1, 1]) 19

20

21 Maximum : 67%, geometric mean : 5.7%

22 Geometric mean : 7.4%

23 Geometric mean : 4.5%

24

25

26

27

Proposed FXA, which has two execution units, the IXU and OXU. 5.7% higher performance 17% lower energy consumption 25% higher performance/energy ratio 28

29