UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

CSCI 4717/5717 Computer Architecture
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
The University of Adelaide, School of Computer Science
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Parallell Processing Systems1 Chapter 4 Vector Processors.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Streaming Supercomputer Strawman Architecture November 27, 2001 Ben Serebrin.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
Enhancing GPU for Scientific Computing Some thoughts.
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Bundled Execution.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.
My Coordinates Office EM G.27 contact time:
Background Computer System Architectures Computer System Software.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO MULTISCALAR ARCHITECTURE
CS 352H: Computer Systems Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Distributed Processors
CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains
Microbenchmarking the GT200 GPU
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
Vector Processing => Multimedia
Milad Hashemi, Onur Mutlu, Yale N. Patt
Lecture 24: Memory, VM, Multiproc
Advanced Computer Architecture
Mattan Erez The University of Texas at Austin
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
Presentation transcript:

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences The University of Texas at Austin August 22, 2003

UT-Austin CART 2 Diversity in Streaming Programs High variability in attributes of: –Computation/memory BW ratio –Control structure –Memory access: data stream, constants Examples –DSP/multimedia  Subwork parallelism, predictable control flow –Scientific  Vectorizable with regular or irregular data access –Communication (packet processing, encryption, etc.)  Regular control flow  Irregular control, data dependent control flow –Graphics  Regular control/data (rasterization)  Irregular control/data (shading)

UT-Austin CART 3 Challenges for Streaming Architectures Supporting regular and irregular control structures Different types of storage demands –Stream/vector register files (regular access) –Scalar constants –Indexed tables Scaling of kernel processors –Large bypass networks, instruction broadcasting –Amenable to pipelining – but decreases agility Traditional streaming architectures  regular for(i=0; i<n; i++ { read(r[i],g[i],b[i]); Y = K1*r+K2*g+K3*b; I = K4*r+K5*g+K6*b; Q = K7*r+K8*g+K9*b; store(Y[i],I[i],Q[i]); } for(i=0; i<n; i++ { read(D[i]); for(j=0; j<m[i]; j++) D = func(D, table[j]); store(D[i]); } RGB to YIQ conversion “Vertex skinning”

UT-Austin CART 4 DLP Control Characteristics BenchmarkKernel Size (# instr) Record size (bytes) ILPComp/ Mem Iterations in inner loop Convert DCT High pass filter FFT LU MD Blowfish Rijndael Vertex_simple_light Fragment_simple_light Vertex_reflection Fragment_reflection Vertex_skinning Variable Multimedia Network Graphics Scientific

UT-Austin CART 5 DLP Data Characteristics BenchmarkRead Record size (bytes) Write Record size (bytes) # scalar constants Lookup table size # of Table Accesses Convert24 90 DCT High pass filter72890 FFT LU16800 MD Blowfish Rijndael Vertex_simple_light Fragment_simple_lig ht Vertex_reflection Fragment_reflection Vertex_skinning Multimedia Network Graphics Scientific

UT-Austin CART 6 Desirable Attributes of a Stream Processor Performs well on DLP programs with different attributes –Synchronous core to minimize synchronization overheads on traditional vector/stream applications –MIMD-like capabilities for applications with irregular control –Support for different types of data structures Partitioned and scalable microarchitecture –Dataflow instruction execution –Limit/eliminate global broadcast of instructions/data Decoupled processor core –From memory system to enable memory fetch parallelism –From other processor cores to enable kernel pipelining Efficient instruction distribution and re-use –Exploit spatial/temporal locality

UT-Austin CART 7 Dataflow Execution in TRIPS Core Bank 0 Moves Bank M Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank SPDI: Static Placement, Dynamic Issue –Instructions execute in dataflow order Instructions stream in from left, data from right ALU Chaining Pipeline instruction distribution and execution Static unrolling for latency tolerance LD R,G,B SMUL K1-3 SHOR ST Y,I,Q SMUL K4-6SMUL K7-9 RADD LD SMUL SHOR ST SMUL RADD

UT-Austin CART 8 Instruction Fetch Efficiency Spatial loop unrolling –Copy same instruction sequence across multiple execution units LD... ST Revitalize Mapping reuse –Reset previously mapped instructions –Re-execution without refetch

UT-Austin CART 9 Memory Accesses Bank 0 Moves Bank M Bank 1 Bank 2 Bank 3 Bank 0 Bank 1 Bank 2 Bank L2 Cache Banks Stream register file (SRF) (accessed w/ LMW) Streaming channels Irregular memory access Map to hardware cache hierachy Regular data accesses Subset of L2 cache banks configured as Stream Register File (SRF) High bandwidth data channels to SRF, reduced address BW DMA engines transfer between SRF and DRAM or other SRFs Constants saved in reservation stations with corresponding instructions L1 Cache Banks

UT-Austin CART 10 MIMD Extensions Extensions to ALU nodes –Local control (PC)  Independent loops  Conditionals –Treat frame space as local instruction and data memory Tightly coupled MIMD –Central sequencer  Distributes kernel code to ALUs  Determines when to proceed to next kernel –Inter-ALU communication under investigation Issue: limited reservation station storage –Add more local instruction storage –Aggregate multiple ALUs into larger logical ALU PC

UT-Austin CART 11 Performance Comparisons

UT-Austin CART 12 TRIPS Chip 4 cores (with streaming support) L2 cache and SRF memory banks Pipelining across kernels mapped to different cores Extend to system through off-chip channels

UT-Austin CART 13 Summary Streaming applications: irregular and regular –Irregular control and data access –Irregularity increasing (particularly in graphics domain) Limitations of many stream/vector kernel processors –Execution model demands regular control flow  Clever extensions such as conditional streams –Poor support for irregular table access  Scatter/gather for sparse/irregular vectors Hybridization can extend range of applications –Efficient caching support –Flexible instruction execution  Decouple execution engines and memory  Single instruction stream for regular control flow, loop-carried dependencies  Tightly coupled MIMD for irregular control flow  Less synchronization means better scalability