CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler.

Slides:

Advertisements

Similar presentations

Lecture 6: Multicore Systems

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

LCTES 2010, Stockholm Sweden OPERATION AND DATA MAPPING FOR CGRA’S WITH MULTI-BANK MEMORY Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava ** and Yunheung.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Department of Electrical Engineering National Cheng Kung University

Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.

Computer System Architectures Computer System Software

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Multi-Core Architectures

Automated Design of Custom Architecture Tulika Mitra

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,

Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

Processor Architecture

DTM and Reliability High temperature greatly degrades reliability

Full and Para Virtualization

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Sunpyo Hong, Hyesoon Kim

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

My Coordinates Office EM G.27 contact time:

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Employing compression solutions under openacc

Improved Resource Sharing for FPGA DSP Blocks

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

CSE-591 Compilers for Embedded Systems Code transformations and compile time data management techniques for application mapping onto SIMD-style Coarse-grained.

EPIMap: Using Epimorphism to Map Applications on CGRAs

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Summary of my Thesis Power-efficient performance is the need of the future CGRAs can be used as programmable accelerators Power-efficient performance can.

URECA: A Compiler Solution to Manage Unified Register File for CGRAs

1. Arizona State University, Tempe, USA

ARM ORGANISATION.

Code Transformation for TLB Power Reduction

6- General Purpose GPU Programming

Sculptor: Flexible Approximation with

Presentation transcript:

CML Enabling Multithreading on CGRAs Reiley Jeyapaul, Aviral Shrivastava 1, Jared Pager 1, Reiley Jeyapaul, 1 Mahdi Hamzeh 12, Sarma Vrudhula 2 Compiler Microarchitecture Lab, VLSI Electronic Design Automation Laboratory, Arizona State University, Tempe, Arizona, USA

CML Web page: aviral.lab.asu.edu CML Need for High Performance Computing 2  Applications that need high performance computing  Weather and geophysical simulation  Genetic engineering  Multimedia streaming petaflop zettaflop

CML Web page: aviral.lab.asu.edu CML Need for Power-efficient Performance 3  Power requirements limit the aggressive scaling trends in processor technology  In high-end servers,  power consumption doubles every 5 years  Cost for cooling also increases in similar trend 2.3% of US Electrical Consumption $4 Billion Electricity charges ITRS 2010

CML Web page: aviral.lab.asu.edu CML Accelerators can help achieve Power-efficient Performance  Power critical computations can be off-loaded to accelerators  Perform application specific operations  Achieve high throughput without loss of CPU programmability  Existing examples  Hardware Accelerator  Intel SSE  Reconfigurable Accelerator  FPGA  Graphics Accelerator  nVIDIA Tesla (Fermi GPU) 4

CML Web page: aviral.lab.asu.edu CML CGRA: Power-efficient Accelerator  Distinguishing Characteristics  Flexible programming  High performance  Power-efficient computing  Cons  Compiling a program for CGRA difficult  Not all applications can be compiled  No standard CGRA architecture  Require extensive compiler support for general purpose computing PE Local Instruction Memory Main System Memory Local Data Memory From Neighbors and Memory To Neighbors and Memory FU RF 5 PEs communicate through an inter-connect network

CML Web page: aviral.lab.asu.edu CML Mapping a Kernel onto a CGRA Given the kernel’s DDG 1. Mark source and destination nodes 2. Assume CGRA Architecture 3. Place all nodes on the PE array 1. Dependent nodes closer to their sources 2. Ensure dependent nodes have interconnects connecting sources 4. Map time-slots for each PE execution 1. Dependent nodes cannot execute before source nodes Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Spatial Mapping PE i-2 1i1i 2i2i 3 i-1 5 i-2 6 i-3 7 i-4 8 i-5 9 i-6 Temporal Scheduling &

CML Web page: aviral.lab.asu.edu CML Mapped Kernel Executed on the CGRA 7 Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF PE Execution time slot: (or cycle) (or cycle) Iteration Interval ( II ) is a measure of mapping quality Entire kernel can be mapped onto CGRA by unrolling 6 times After cycle 6, one iteration of loop completes execution every cycle Iteration Interval = 1

CML Web page: aviral.lab.asu.edu CML Traditional Use of CGRAs 8 E0E0 E0E0 E1E1 E1E1 E2E2 E2E2 E3E3 E3E3 E4E4 E4E4 E5E5 E5E5 E6E6 E6E6 E7E7 E7E7 E8E8 E8E8 E9E9 E9E9 E 10 E 11 E 12 E 13 E 14 E 15  An application is mapped onto the CGRA  System inputs given to the application  Power-efficient application execution achieved  Generally used for streaming applications  ADRES, MorphoSys, ADRES, KressArray, RSPA, DART Application Input Application Output

CML Web page: aviral.lab.asu.edu CML Envisioned Use of CGRAs 9  Specific kernels in a thread can be power/performance critical  The kernel can be mapped and scheduled for execution on the CGRA  Using the CGRA as a co-processor (accelerator)  Power consuming processor execution is saved  Better performance of thread is realized  Overall throughput is increased E0E0 E0E0 E1E1 E1E1 E2E2 E2E2 E3E3 E3E3 E4E4 E4E4 E5E5 E5E5 E6E6 E6E6 E7E7 E7E7 E8E8 E8E8 E9E9 E9E9 E 10 E 11 E 12 E 13 E 14 E 15 Processor co-processor Program thread Kernel to accelerate

CML Web page: aviral.lab.asu.edu CML CGRA as an Accelerator  Application: Single thread  Entire CGRA used to schedule each kernel of the thread  Only a single thread is accelerated at a time  Application: Multiple threads  Entire CGRA is used to accelerate each individual kernel  if multiple threads require simultaneous acceleration  threads must be stalled  kernels are queued to be run on the CGRA E0E0 E0E0 E1E1 E1E1 E2E2 E2E2 E3E3 E3E3 E4E4 E4E4 E5E5 E5E5 E6E6 E6E6 E7E7 E7E7 E8E8 E8E8 E9E9 E9E9 E 10 E 11 E 12 E 13 E 14 E 15 S1S1 S2S2 S3S3 10 Not all PEs are used in each schedule. Thread-stalls create a performance bottleneck

CML Web page: aviral.lab.asu.edu CML E0E0 E0E0 E1E1 E1E1 E2E2 E2E2 E3E3 E3E3 E4E4 E4E4 E5E5 E5E5 E6E6 E6E6 E7E7 E7E7 E8E8 E8E8 E9E9 E9E9 E 10 E 11 E 12 E 13 E 14 E 15 Proposed Solution: Multi-threading on the CGRA  Through program compilation and scheduling  Schedule application onto subset of PEs, not entire CGRA  Enable dynamic multi-threading w/o re-compilation  Facilitate multiple schedules to execute simultaneously  Can increase total CGRA utilization  Reduce overall power consumption  Increases multi-threaded system throughput S1S1S1S1 S2S2S2S2 S2S2S2S2 S2S2S2S2 S3S3S3S3 S3S3S3S3 S 2’ S 3’ 11 S3S3S3S3 S3S3S3S3 Threads: 1, 2 Maximum CGRA utilization Threads: 1, 2, 3 Shrink-to-fit mapping maximizing performance Threads: 2, 3 Expand to maximize CGRA utilization and performance

CML Web page: aviral.lab.asu.edu CML Our Multithreading Technique 1. Static compile-time constraints to enable fast run-time transformations  Has minimal effect on performance ( II )  Increases compile-time 2. Perform fast dynamic transformations  Takes linear time to complete with respect to kernel II  All schedules are treated independently Features:  Dynamic Multithreading enabled in linear runtime  No additional hardware modifications  Require supporting PE inter-connects in CGRA topology  Works with current mapping algorithms  Algorithm must allow for custom PE interconnects 12

CML Web page: aviral.lab.asu.edu CML Hardware Abstraction: CGRA Paging  Page:  Page: conceptual group of PEs symmetrical connections  A page has symmetrical connections to each of the neighboring pages  No additional hardware ‘feature’ is required. ring topology  Page-level interconnects follow a ring topology 13 Local Instruction Memory Main System Memory Local Data Memory PE e0e0 e0e0 e1e1 e1e1 e2e2 e2e2 e3e3 e3e3 e4e4 e4e4 e5e5 e5e5 e6e6 e6e6 e7e7 e7e7 e8e8 e8e8 e9e9 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 P0P0P0P0 P3P3P3P3 P2P2P2P2 P1P1P1P1 P0P0P0P0 P3P3P3P3 P2P2P2P2 P1P1P1P1

CML Web page: aviral.lab.asu.edu CML Step 1: Compiler Constraints assumed during Initial Mapping  Compile-time Assumptions  CGRA is collection of pages  Each page can interact with only one topologically neighboring page.  Inter-PE connections within a page are unmodified  These assumptions,  in most cases will not effect mapping quality  may help improve CGRA resource usage 14 e0e0 e0e0 e1e1 e1e1 e2e2 e2e2 e3e3 e3e3 e4e4 e4e4 e5e5 e5e5 e6e6 e6e6 e7e7 e7e7 e8e8 e8e8 e9e9 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 P1P1P1P1 P0P0P0P0 P2P2P2P2 P3P3P3P Naïve mapping could result in under-used CGRA resources Our paging methodology, helps reduce CGRA resource usage

CML Web page: aviral.lab.asu.edu CML Step 2: Dynamic Transformation enabling multiple schedules  Example:  application mapped to 3 pages  Shrink to execute on 2 pages  Transformation Procedure: 1. Split pages 2. Arrange pages in time order 3. Mirror pages to facilitate shrinking 1. Ensures inter-node dependency 4. Shrunk pages executed on altered time-schedules  Constraints  inter-page dependencies should be maintained 15 e0e0 e0e0 e1e1 e1e1 e2e2 e2e2 e3e3 e3e3 e4e4 e4e4 e5e5 e5e5 e6e6 e6e6 e7e7 e7e7 e8e8 e8e8 e9e9 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 P3P3P3P3 P0P0P0P P1P1P1P P2P2P2P2

CML Web page: aviral.lab.asu.edu CML Step 2: Dynamic Transformation enabling multiple schedules  Transformation Procedure: 1. Split pages 2. Arrange pages in time order 3. Mirror pages to facilitate shrinking 1. Ensures inter-node dependency 4. Shrunk pages executed on altered time-schedules 16 e0e0 e0e0 e1e1 e1e1 e2e2 e2e2 e3e3 e3e3 e4e4 e4e4 e5e5 e5e5 e6e6 e6e6 e7e7 e7e7 e8e8 e8e8 e9e9 e9e9 e 10 e 11 e 12 e 13 e 14 e 15 P3P3P3P3 P0P0P0P P1P1P1P P2P2P2P2

CML Web page: aviral.lab.asu.edu CML e0e0 e0e0 e1e1 e1e1 e4e4 e4e4 e5e5 e5e5 e8e8 e8e8 e9e9 e9e9 e 12 e 13 e 10 e 11 e 14 e 15 Step 2: Dynamic Transformation enabling multiple schedules 17 T0T0T0T0 T1T1T1T1 T2T2T2T2 P0P0P0P P1P1P1P P2P2P2P2 e 10 e 11 e 14 e P2P2P2P2 P 0, P 1, T2T2T2T2 T3T3T3T3 T4T4T4T4  Example:  application mapped to 3 pages  Shrink to execute on 2 pages  Transformation Procedure: 1. Split pages 2. Arrange pages in time order 3. Mirror pages to facilitate shrinking 1. Ensures inter-node dependency 4. Shrunk pages executed on altered time-schedules  Constraints  inter-page dependencies should be maintained

CML Web page: aviral.lab.asu.edu CML Experiment 1: Compiler Constraints are Liberal  Mapping quality measured in Iteration Intervals  smaller II is better Constraints can also improve individual benchmark performance by, ironically, limiting compiler search space Constraints can degrade individual benchmark performance by limiting compiler search space On average, performance is minimally impacted 18

CML Web page: aviral.lab.asu.edu CML Experimental Setup  CGRA Configurations used:  4x4, 6x6, 8x8  Page configurations:  2, 4, 8 PEs per page  Number of threads in system:  1, 2, 4, 8, 16  Each has a kernel to be accelerated Experiments  Single-threaded CGRA  Each thread arrives at “kernel”  thread is stalled until kernel executes  Multi-threaded CGRA  CGRA used to accelerate kernels as and when they arrive  No thread is stalled CGRACGRA Thread 1 Thread 2 Thread 3 Thread 4 19 kernel to be accelerated Only ONE thread serviced MULTIPLE threads serviced

CML Web page: aviral.lab.asu.edu CML Multithreading Improves System Performance Number of Threads Accessing CGRA: As the number of threads increases, multithreading provides increasing performance CGRA Size: As we increase CGRA size, multithreading provides better utilization and therefore better performance Number of PEs per Page: For the set of benchmarks tested, the number of optimal PEs per page is either 2 or 4 20

CML Web page: aviral.lab.asu.edu CML Summary  Power-efficient performance is the need of the future  CGRAs can be used as accelerators  Power-efficient performance  Power-efficient performance can be achieved  Has limitations on usability  Has limitations on usability due to compiling difficulties need multi-threading capabilities  With multi-threaded applications, need multi-threading capabilities in CGRA  Propose a two-step dynamic methodology  Non-restrictive compile-time constraints  Non-restrictive compile-time constraints to schedule application into pages  Dynamic transformation procedure  Dynamic transformation procedure to shrink/expand the resources used by a schedule  Features:  No additional hardware required  Improved CGRA resource usage  Improved system performance 21

CML Web page: aviral.lab.asu.edu CML Future Work with inter-thread communication  Using CGRAs as accelerator in systems with inter-thread communication.  Study the impact of compiler constraints  Study the impact of compiler constraints on compute- intensive and memory-bound benchmark applications? thread-level scheduling  Possible use of thread-level scheduling to improve overall performance? 22

CML Web page: aviral.lab.asu.edu 23 Thank you !

CML Web page: aviral.lab.asu.edu CML Measuring CGRA Performance  A completed mapping is called a schedule  A schedule consists of a prolog, kernel, and epilog  Mapping Quality is measured by Initiation Interval (II)  II is a measure how many cycles it takes the kernel portion to execute a single iteration of a loop  ie, if II=2, every two cycles, an iteration of the original loop is completed  II is limited by CGRA resources (number of PEs, etc) and by recurrence cycles  If there are only 4 PEs but a DDG contains 9 nodes, II can at best be 3  If a loop cannot be unrolled, II can be hurt Recurrence Cycle: t1 = A[i] + C[i - 1] C[i] = B[i] + t1 1i1i 3i3i 2 4 5i5i Unrolling: t1a = A[i] + C[i-1] C[i] = B[i] + t1a t1b = A[i+1] + C[i] C[i+1] = B[i+1] + t1b 1i1i 3i3i 2 4 5i5i 1 i+1 3 i i+1 t1b can never execute any earlier, no matter how many times we unroll 24

CML Web page: aviral.lab.asu.edu CML DDG with Recurrance 1 i-1 3 i-1 2 i-1 4 i-1 5 i-1 6 i-1 1i1i 3i3i 2i2i 4i4i 5i5i 6i6i 1 i-2 3 i-2 2 i-2 4 i-2 5 i-2 6 i-2 1i1i 3i3i 2 4 5i5i 1 i+1 3 i i+1 25

CML Web page: aviral.lab.asu.edu CML State-of-the-art Multi-threading on CGRAs  Polymorphic Pipeline Arrays [Park 2009]  Enables dynamic scheduling  Collection of schedules make a kernel  Some schedules can be given more resources than other schedules  Limitations  Collection of schedules must be known at compile-time  Schedules are assumed to be ‘pipelining’ stages in a single kernel Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 8 Mem Bank 1 Mem Mem Bank 2 Mem Mem Bank 3 Mem Mem Bank 4 Mem Filter 1Filter 2Filter 3Filter 1Filter 2Filter 3Filter 1Filter 2Filter 3 26