EXPLORING THE TRADEOFFS BETWEEN PROGRAMMABILITY AND EFFICIENCY IN DATA-PARALLEL ACCELERATORS YUNSUP LEE, RIMAS AVIZIENIS, ALEX BISHARA, RICHARD XIA, DEREK.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter 17 Parallel Processing.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Automobiles The Scale Vector-Thread Processor Modern embedded systems Multiple programming languages and models Multiple distinct memories Multiple communication.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
EKT303/4 Superscalar vs Super-pipelined.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
My Coordinates Office EM G.27 contact time:
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
Vector computers.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
CS203 – Advanced Computer Architecture
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
15-740/ Computer Architecture Lecture 3: Performance
Lecturer: Alan Christopher
Vector Processing => Multimedia
Mattan Erez The University of Texas at Austin
Ka-Ming Keung Swamy D Ponpandi
The Vector-Thread Architecture
Overview Prof. Eric Rotenberg
CSE 502: Computer Architecture
Ka-Ming Keung Swamy D Ponpandi
Spring’19 Prof. Eric Rotenberg
Presentation transcript:

EXPLORING THE TRADEOFFS BETWEEN PROGRAMMABILITY AND EFFICIENCY IN DATA-PARALLEL ACCELERATORS YUNSUP LEE, RIMAS AVIZIENIS, ALEX BISHARA, RICHARD XIA, DEREK LOCKHART, CHRISTOPHER BATTEN, AND KRSTE ASANOVIC ISCA ’11

MOTIVATION Need for data-parallelism in various applications Trend of offloading DLP tasks to accelerators Capability to handle wider variety of DLP Need for programmability Tradeoff with implementation efficiency

IN THIS PAPER… Study of five Architectural patterns for DLP Present a parametrized synthesizable RTL implementation Evaluation for VLSI layouts of Micro-architectural patterns Evaluation of area, energy & performance

FOR REGULAR & IRREGULAR DLP… MIMD Vector-SIMD Subword-SIMD SIMT VT

MIMD Multiple cores – scalar or multi-threaded Single Host thread(HT) – interacts with OS, manages accelerator Several Micro-threads(uT) Easy Programmability More area & lower efficiency

VECTOR-SIMD Multiple vector processing elements – Control Threads(CTs) Each CT has several uTs in lock-step Straightforward for regular DLP CP executes redundant insts once, CP VIU amortize overhead, VMU efficient large blocks Complex irregular DLP may need vectored flags and arithmetic

SUBWORD-SIMD Vector-like with scalar wide datapaths Fixed-length unlike vector-SIMD, alignment constraints for memory Limited support for irregular DLP

SIMT No CTs, HT manages uTs Executes multiple scalar ops Execute redundant instructions, multiple scalar accesses, per uT stripmining calculations Simple to map data-dependent control flow, generates masks to disable inactive uTs

VT Executes efficient vector memory instructions Issues vector-fetch instructions to indicate start of scalar stream for uT execution uTs execute in a SIMD manner Vector-fetched scalar branch can cause divergence

MICRO-ARCHITECTURAL COMPONENTS Library of parametrized synthesizable RTL components Long latency functional units – int mul,div & single precision float add, mul, div and sqrt Scalar integer core – RISC ISA, can be multi-threaded Vector lane – unified 6r3w port regfile(dynamically reconfigurable), 2 VAUs(int ALU + long latency FU), VLU, VSU & VGU Vector Memory unit – coordinates data movement between memory and vector regfile Vector Issue unit – fetches instructions, handles control flow, differentiates vector-SIMD & VT Blocking & non-blocking $

TILES MIMD – scalar int core with int & FP long-latency FUs, 1-8 uTs per core Vector – single-threaded scalar int core, VIU, vector lanes, VMU, shared long-latency FUs Multi-core tiles – 4 MIMD or 4 single-lane vector cores Multi-lane tiles – single CP with 4 lane vector unit Each tile – shared 64KB 4-bank D$, request & response arbiters & crossbars, CP-16KB I$ & VT VIU-2KB vector I$

OPTIMIZATIONS Banked vector regfile – 4 banks 2r1w-port regfile Per bank int ALUs - Density-time execution – compresses vector fragment, active uTs only Dynamic fragment convergence – dynamically merge fragments in PVFB if PCs match [1-stack(single stack) & 2-stack(current & future)] Dynamic memory coalescer – Multilane VTs, attempts at satisfying multiple uT accesses through the same wide memory access(128bit)

HARDWARE TOOLFLOW Use a Synopsys-based ASIC toolflow for simulation, synthesis & PAR Power estimation using IC compiler & PrimeTime benchmarks Only gate-level energy results used DesignWare components with register retiming Use CACTI to model SRAMs and cache

TILE CONFIGS 22 tile configs explored Vector length capped at 32 – some structures scale quadratically in area

MICROBENCHMARKS & APP KERNELS Microbenchmarks - vvad[1000 element FP add], bsearch[1000 array lookups] Kernels - Viterbi[regular DLP], rsort, kmeans, dither[irregular mem access], physics & strsearch[irregular DLP]

PROGRAMMING METHODOLOGY Code for HT, explicitly annotate data- parallel tasks for uTs Develop an explicit-DLP programming environment Modified GNU C++ library for unified ISA for CT and uTs Develop VT library

CYCLE TIME & AREA COMPARISON

MICRO-ARCHITECTURAL TRADEOFFS Impact of additional physical registers per core Impact of regfile banking and per bank int ALUs

IMPACT OF DENSITY-TIME EXECUTION, DYNAMIC FRAGMENT CONVERGENCE & MEMORY COALESCING

APPLICATION KERNEL RESULTS Adding more uTs to MIMD ineffective Vector-based machines faster and more efficient compared to MIMD, area reduces relative advantage VT is more efficient than vector-SIMD for both multi-core and multi-lane Multi-lane vector designs more efficient compared to multi-core designs Single-core multi-lane VT with 2-stack PVFB, banked regfile and per bank int ALUs good starting point for Maven

CONCLUSIONS Data-parallel accelerators must be capable of handling a broader range of DLP Vector-based micro-architectures are more area and energy efficient compared to scalar micro-architectures Maven, VT micro-architecture based on vector-SIMD, more efficient and easier to program

LOOKING AHEAD More detailed comparison of VT with SIMT Improvements in programming vector-SIMD ? Hybrid machines with pure MIMD & pure SIMD more efficient for irregular DLP ?

COUPLE OF THINGS MISSING… Effect of inter-tile interconnect and memory system Comparison with similar VT micro-architectures eg. Scale VT ANYTHING ELSE ?

THANK YOU