MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam.

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
COMP25212 Advanced Pipelining Out of Order Processors.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Multiscalar processors
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.
ECE 510 Brendan Crowley Paper Review October 31, 2006.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.
Classic Model of Parallel Processing
Hyper-Threading Technology Architecture and Microarchitecture
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Computer Architecture: Out-of-Order Execution
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Introduction to Computer Organization Pipelining.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
UW-Madison Computer Sciences Multifacet Group© 2010 Forwardflow A Scalable Core for Power-Constrained CMPs Dan Gibson and David A. Wood ISCA 2010 SAINT-MALO.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
PipeliningPipelining Computer Architecture (Fall 2006)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Dynamic Scheduling Why go out of style?
Multiscalar Processors
Prof. Onur Mutlu Carnegie Mellon University
Parallel Processing - introduction
Computer Structure Multi-Threading
CS203 – Advanced Computer Architecture
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Microprocessor Microarchitecture Dynamic Pipeline
Power-Aware Operand Delivery
High-level view Out-of-order pipeline
Superscalar Processors & VLIW Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Computer Architecture
The Vector-Thread Architecture
High-level view Out-of-order pipeline
Conceptual execution on a processor which exploits ILP
Presentation transcript:

MICROPROCESSOR ARCHITECTURE(ECE519) Prof. Lynn Choi Presented by Sam

 Main ideal  Motivation  Challenges  Core design A method for representing inter-instruction data dependences Forwardflow – Dataflow Queue (DQ) Forwardflow architecture  Related Work  Conclusion & problems Outline

Motivation and Challenges Consider this vision: Micro architects hope to improve applications’ overall efficiency by focusing on thread-level parallelism (TLP), rather than instruction-level parallelism (ILP) within a single thread.

Challenges  Parallel Speedup Limited by Parallel Fraction  i.e., Only ~10x speedup at N=512, f=90% [ ] (1 - f ) + f N Two fundamental problems: Amdahl’s Law ? All cores can simultaneously operate at full speed? Increase more cores, but speed up a little bit

Simultaneously Active Fraction (SAF): “the fraction of the entire chip resources that can be active simultaneously” Challenges The physical limits on power delivery and heat dissipation. In the long term, to maintain fixed power and area budgets as technology scales, The fraction of active transistors must decrease with each technology generation. Two fundamental problems: Amdahl’s Law ? All cores can simultaneously operate at full speed?

Motivation For single-thread performance  Exploit ILP For multiple threads  Save power  Exploit TLP

 CMPs will need Scalable Cores  Scale UP for Performance Use more resources for more performance (allowing single- threaded applications to aggressively exploit ILP and MLP to the limits of available power). Resources: (e.g., cores, caches, hardware accelerators, etc.).  Scale DOWN Motivation A scalable core is a processor capable of operating in several different configurations, each offering a varied power/ performance point.

 CMPs will need Scalable Cores  Scale UP  Scale DOWN for Energy Conservation Exploit TLP with many small cores when power is constrained, scalable cores can scale down to conserve per-core energy Motivation

Scalable Cores CMP equipped with scalable cores: Scaled up to run few threads quickly (left), and scaled down to run many threads in parallel (right). Scalable cores have the potential to adapt their behavior to best match their current workload and operating conditions.

 Ideal: Forward flow  New Scalable Core μArch  Uses pointers  Distributes values  Scales to large instruction window sizes  Full window scheduler  Scales dynamically  Variable-sized instruction window Ideas

Forwardflow architecture Problems: in a scalable core, resource allocation changes over time. Designers of scalable cores should avoid structures that are difficult to scale, like centralized register files and bypassing networks. This work focus on scaling window size

Serialized Successor Representation (SSR) A method for representing inter-instruction data dependences, called Serialized Successor Representation (SSR) Instead of maintaining value names, SSR describes values’ relationships to operands of other instructions. Instructions in SSR are represented as three-operand tuples: SOURCE1 (S1), SOURCE2(S2), and DESTINATION (D), Each operand consists of a value and a successor pointer. Operand pointers are used to represent data dependences

Serialized Successor Representation (SSR) D S1S2 The pointer field of the producing instruction’s D-operand designates the first successor operand, belonging to a subsequent successor—usually the S1- or S2- operand of a later instruction. If a second successor exists, the pointer field at the first successor operand will be used to designate the location of the second successor operand The locations of subsequent operands can be encoded in a linked-list fashion, relying on the pointer at successor i to designate the location of successor i+1.

Serialized Successor Representation (SSR) distributed chains of pointers NULL pointer Pros: not in renaming never requires a search or broadcast operation to locate a successor for any dynamic value it can be built from simple SRAMs

Forwardflow – Dataflow Queue (DQ) Instructions, values, and data dependences reside in a distributed Dataflow Queue (DQ) The DQ is comprised of independent banks and pipelines, which can be activated or de-activated by system software to scale a core’s execution resources to implement core scaling.

Forwardflow architecture

Fetch Read instructions from L1-I Cache Predict Branches Pass on to Decode phase Fetch proceeds no differently than other high-performance microarchitectures.

Decode Determine to which pointer chains, if any, each instruction belongs. It does this using the Register Consumer Table (the RCT resembles a traditional rename table).The RCT is implemented as an SRAM-based table. The RCT also identifies registers last written by a committed instruction Decode detects and handles potential data dependences, analogous to traditional renaming.

Dispatch Dispatch inserts instructions into the Dataflow Queue (DQ) and instructions issue when their operands become available

Dispatched/Executing the ld instruction is ready to issue because both source available in the ARF Decode updates the RCT to indicate that ld produces R3 Dispatch reads the ARF to obtain R1’s value, writes both operands into the DQ, and issues the ld

When the add is decoded, it consults the RCT and finds that R3’s previous use was as the ld’s destination field Dispatch updates the pointer from ld’s destination to the add’s first source operand. the add’s immediate operand (55) is written into the DQ at dispatch.

The mult’s decode consults the RCT, and discovers that both operands, R3 and R4, are not yet available and were last referenced by the add’s source 1 operand and the add’s destination operand Dispatch of the mult therefore checks for available results in both the add’s source 1 value array and destination value array, and appends the mult to R3’s and R4’s pointer chains.

the sub appends itself to the R3 pointer chain, and writes its dispatch-time ready operand (66) into the DQ.

Wakeup, Selection, and Issue completion of the ld, the memory value (99) is written into the DQ ld’s destination pointer is followed to the first successor

Wakeup, Selection, and Issue the add’s metadata and source 2 value are read, and, coupled with the arriving value of 99, the add now be issued. The update hardware reads the add’s source 1 pointer, discovering the mult as the next successor.

Wakeup, Selection, and Issue the mult’s metadata, other source operand, and next pointer field are read. the source 1 operand is unavailable, and the mult will issue at a later time

Wakeup, Selection, and Issue Finally, following the mult’s source 2 pointer to the sub delivers 99 to the sub’s first operand, enabling the sub to issue.

Methodology: target machine On each tile resides a single core, a private L1-I cache (32KB), a private write-through write- invalidate L1-D cache (32KB), a private L2 cache (1MB) which manages coherency in the L1-D via inclusion, and one bank of a shared L3 cache. It is assumed that cores and private caches can be powered off without affecting the shared L3, the L3 operates in its own voltage domain

 Scalable Schedulers  Direct Instruction Wakeup [Ramirez04]:  Scheduler has a pointer to the first successor  Secondary table for matrix of successors  Hybrid Wakeup [Huang02]:  Scheduler has a pointer to the first successor  Each entry has a broadcast bit for multiple successors  Half Price [Kim02]:  Slice the scheduler in half  Second operand often unneeded Related Work

 Dataflow & Distributed Machines  Tagged-Token [Arvind90]  Values (tokens) flow to successors  TRIPS [Sankaralingam03]:  Discrete Execution Tiles: X, RF, $, etc.  EDGE ISA  Clustered Designs [e.g. Palacharla97]  Independent execution queues Related Work

Conclusion:  Allowing the system to trade-off power and performance Problems:  What happened if DQ banks is larger (>8) or smaller (<8)  We have know ideal about software must change to accommodate concurrency Conclusion and problems