Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.

Slides:



Advertisements
Similar presentations
System Integration and Performance
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
TRIPS Primary Memory System Simha Sethumadhavan 1.
Dr. Rabie A. Ramadan Al-Azhar University Lecture 3
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
8086.  The 8086 is Intel’s first 16-bit microprocessor  The 8086 can run at different clock speeds  Standard 8086 – 5 MHz  –10 MHz 
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
A Design Space Evaluation of Grid Processor Architecture Jiening Jiang May, 2005 The presentation based on the paper written by Ramadass Nagarajan, Karthikeyan.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.
Behnam Robatmili, Katherine E. Coons, Kathryn S. McKinley, and Doug Burger Register Bank Assignment For Spatially Partitioned Processors.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
EECS 470 Cache and Memory Systems Lecture 14 Coverage: Chapter 5.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
NS Training Hardware. Memory Interface Support for SDRAM, asynchronous SRAM, ROM, asynchronous flash and Micron synchronous flash Support for 8,
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.
Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Top Level View of Computer Function and Interconnection.
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
The Alpha – Data Stream Matt Ziegler.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.
Advanced Caches Smruti R. Sarangi.
Dynamic Scheduling Why go out of style?
IClass – A Many-core processor based on RISC-V
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Multiscalar Processors
Lu Peng, Jih-Kwon Peir, Konrad Lai
Lecture on High Performance Processor Architecture (CS05162)
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches
* From AMD 1996 Publication #18522 Revision E
Presentation transcript:

Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08

Tera-op, Reliable, Intelligently adaptive Processing System (TRIPS) Trillions of operations on a single chip by 2012! Distributed Microarchitecture – Heterogenous Tiles - Uniprocessor – Distributed Control – Dynamic Execution ASIC Prototype Chip – 170M transistors, 130nm – 2 16-wide issue processor cores – 1MB distributed Non Uniform Cache Access (NUCA)

Why Tiled and Distributed? Issue width of superscalar cores constrained – On-chip wire delay – Power constraints – Growing complexity Use tiles to simplify design – Larger processors – Multi-cycle communication delay across the processor Use a distributed control system

TRIPS Processor Core Explicit Data Graph Execution (EDGE) ISA – Compiler-generated TRIPS blocks 5 types of tiles 7 micronets – 1 each data and instruction – 5 control Few global signals – Clock – Reset tree – Interrupt

EDGE Instruction Set Architecture TRIPS block – Compiler-generated dataflow graph Direct intra-block communication – Instructions can send results directly to dependent consumers Block-atomic execution – 128 instructions per TRIPS block – Fetch, execute, and commit

TRIPS Block Blocks of instructions built by compiler – One 128-byte header chunk – One to four 128-byte body chunks – All possible paths emit the same number of outputs (stores, register writes, one branch) Header chunk – Maximum 32 register reads, 32 register writes Body chunk – 32 instructions – Maximum 32 loads and stores per block

Processor Core Tiles Global Control Tile (1) Execution Tile (16) Register Tile (4) – 128 registers per tile – 2 read ports, 1 write port Data Tile (4) – Each has one 2-way 8KB L1 D-cache Instruction Tile (5) – Each has one 2-way 16KB bank of the L1 I-cache Secondary Memory System – 1MB, Non Uniform Cache Access (NUCA), 16 tiles, Miss Status Holding Register (MSHR) – Configurable as L2 cache or scratch-pad memory using On Chip Network (OCN) commands – Private port between memory and each IT/DT pair

Processor Core Micronetworks Operand Network – Connects all but the Instruction Tiles Global Dispatch Network – Instruction dispatch Global Control Network – Committing and flushing blocks Global Status Network – Information about block completion Global Refill Network – I-cache miss refills Data Status Network – Store completion information External Store Network – Store completion to L2 cache or memory information

TRIPS Block Diagram Composable at design time 16-wide out-of-order issue 64KB L1 I-cache 32KB L1 D-cache 4 SMT Threads 8 TRIPS blocks in flight

Distributed Protocols – Block Fetch GT sends instruction indices to ITs via Global Dispatch Network (GDN) Each IT takes 8 cycles to send 32 instructions to its row of ETs and RTs (via GDN) – 128 instructions total for the block Instructions enter read/write queues at RTs and reservation stations at Ets 16 instructions per cycle in steady state, 1 instruction per ET per cycle.

Block Fetch – I-cache miss GT maintains tags and status bits for cache lines On I-cache miss, GT transmits refill block’s address to every IT (via Global Refill Network) Each IT independently processes refill of its 2 64-byte cache chunks ITs signal refill completion to GT (via GSN) Once all refill signals complete, GT may issue dispatch for that block.

Distributed Protocols - Execution RT reads registers as given in read instruction RT forwards result to consumer ETs via OPN ET selects and executes enabled instructions ET forwards results (via OPN) to other ETs or to DTs

Distributed Protocols – Block/Pipeline Flush GT initiates flush wave on GCN on branch misprediction All ETs, DTs, and RTs are told which block(s) to flush Wave propagates at one hop per cycle GT may issue new dispatch command immediately – new command will never overtake flush command.

Distributed Protocols – Block Commit Block completion – block produced all outputs – 1 branch, <= 32 register writes, <= 32 stores – DTs use DSN to maintain completed store info – DT and RTs notify GT via GSN Block commit – GT broadcasts on GCN to RTs and DTs to commit Commit acknowledgement – DTs and RTs notify GT via GSN – GT deallocates the block

Prototype Evaluation - Area Area Expense – Operand Network (OPN): 12% – On Chip Network (OCN): 14% – Load Store Queues (LSQ) in DTs: 13% – Control protocol area overhead is light

Prototype Evaluation - Latency Cycle-level simulator (tsim-proc) Benchmark suite: – Microbenchmarks (dct8x8, sha, matrix, vadd), Signal processing library kernels, Subset of EEMBC suite, SPEC benchmarks Components of critical path latency – Operand routing largest contributor: Hop latencies: 34% Contention accounting: 25% Operand replication and fan out: up to 12% Control latencies overlap with useful execution Data networks need optimization

Prototype Evaluation - Comparison Compared to 267 MHz Alpha processor – Speedups range from 0.6 to over 8 – Serial benchmarks see performance degrade