University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

Compiling Application-Specific Hardware Mihai Budiu Seth Copen Goldstein Carnegie Mellon University.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Chapter 6 Memory and Programmable Logic Devices

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Automated Design of Custom Architecture Tulika Mitra

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Region-based Hierarchical Operation Partitioning for Multicluster Processors Michael Chu, Kevin Fan, Scott Mahlke University of Michigan Presented by Cristian.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

Wajid Minhass, Paul Pop, Jan Madsen Technical University of Denmark

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Using Custom Accelerators in Wireless Systems Alex Papakonstantinou, Deming Chen Illinois Center for Wireless Systems Wireless SoC Design Trends and Challenges.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Ioannis E. Venetis Department of Computer Engineering and Informatics

Ph.D. in Computer Science

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Michael Chu, Kevin Fan, Scott Mahlke

Verilog to Routing CAD Tool Optimization

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Dynamically Scheduled High-level Synthesis

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

CMSC 611: Advanced Computer Architecture

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators Manjunath Kudlur, Kevin Fan, Michael Chu, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science Motivation Custom application accelerators (ASICs/ASIPs) require careful data memory system design –Large volumes of data access at high bandwidth Distributed local memories (scratchpads) –Achieves high bandwidth through parallel access –Low latency by placing data near computation Custom memory design is complex –Multiple considerations – bandwidth, size requirements, data distribution –Decentralized datapath – another monkey wrench

University of Michigan Electrical Engineering and Computer Science Background – Our System Synthesis of non-programmable accelerators –System similar to PICO (Program-In Chip-Out) –Input is “Hot” loop nest expressed in C Throughput-directed synthesis –Required throughput expressed as II (initiation interval) –Innermost loop modulo scheduled –Datapath derived directly from the schedule –FU allocation to meet II

University of Michigan Electrical Engineering and Computer Science Background – Multicluster Datapath FUs divided into clusters Intercluster communication through global bus Reduced wire lengths, reduced porting on register file structures Increased compiler complexity C Program FU Register FIFOs MEM Local Memories FU Register FIFOs MEM Local Memories Cluster 1Cluster 2 Interconnection Network

University of Michigan Electrical Engineering and Computer Science Background – Local Memories SRAMs connected to MEM units in clusters –Data structures assigned to a single SRAM –Can be whole arrays, part of an array –Currently whole arrays considered Multiple arrays can be combined in a single SRAM FU Register FIFOs MEM Local Memories Cluster 1

University of Michigan Electrical Engineering and Computer Science Problem Statement and Approach “Given a set of arrays, their sizes and bitwidths, the corresponding loop nest, the number of clusters and the target II, find an allocation of arrays to SRAMs and allocation of SRAMs to clusters such that overall cost is minimized” Phase-ordered approach which handles 2 sub problems separately –Memory synthesis –Operation partitioning

University of Michigan Electrical Engineering and Computer Science Combining arrays into a single SRAM reduces hardware cost (row decoders, sense amps) Issues with combining: –Consider two arrays with (Bitwidth, Size) = (B 1, S 1 ) and (B 2, S 2 ) –Suppose A 1 and A 2 are number of static accesses in the loop –Number of ports = Combining Arrays II A 1 + A 2 X Y B1B1 B2B2 S1S1 S2S2 X Y MAX(B 1, B 2 ) S 1 + S 2

University of Michigan Electrical Engineering and Computer Science Combining Arrays Multicluster issues –Can cause imbalance in operation distribution All load store operations for the combined arrays should be assigned to same cluster –Can increase inter cluster traffic Address calculations and load-uses would cause extra inter cluster moves LD + R1R2 USE IC Move

University of Michigan Electrical Engineering and Computer Science Solution 1 Formulate the problem as an integer program –A binary decision variable X(i,j,k,l) to denote assignment of array ‘i’ to local memory ‘j’ with ‘k’ ports on cluster ‘l’ Constraints to make sure inter cluster move bandwidth is not violated Perform operation partitioning and Modulo schedule after memory synthesis B A C D A C B D Cluster 1Cluster 2 Input Arrays Target II Memory SynthesisOperation PartitioningModulo Schedule

University of Michigan Electrical Engineering and Computer Science Experiments System implemented in the Trimaran framework Memory costs obtained from ARTISAN SRAM generator scripts lp_solve used to solve the integer programs A set of DSP kernels evaluated –Loop oriented –Many arrays accessed in the loops

University of Michigan Electrical Engineering and Computer Science Results for Solution 1 channel Target Initiation Interval (II) huffman LUlyapunov

University of Michigan Electrical Engineering and Computer Science Achieved II in Solution 1 Solution 1 eagerly combines arrays –Potential increase in inter cluster moves due to imbalance in distribution of LD/ST ops –Achieved II poor due to IC moves in recurrence cycles Benchmark BW=2BW=3BW=4BW=5 channel huffman LU 5332 lyapunov Best II achieved

University of Michigan Electrical Engineering and Computer Science Solution 2 Phase-ordered approach –Two highly intertwined decisions: allocation of local memories and partitioning of operations Three phases: –Pre-Partitioning –Memory Synthesis –Operation Partitioning

University of Michigan Electrical Engineering and Computer Science Pre-Partitioning Performance-oriented operation partitioning –Memory operations accessing the same arrays are bound to same cluster –Consequently, arrays are bound to clusters Pre-Partitioning AC B D E Cluster 1 Cluster 2

University of Michigan Electrical Engineering and Computer Science Memory Synthesis ILP used to optimally combine arrays within clusters Pre-partitioning effectively disables combining of arrays that cause operation imbalance A C B D Cluster 1Cluster 2 Memory Synthesis AC B D E Cluster 1 Cluster 2 E

University of Michigan Electrical Engineering and Computer Science Results for Solution 2 Target Initiation Interval (II) channelhuffman LUlyapunov

University of Michigan Electrical Engineering and Computer Science Achieved II for Solution 2 BW=2BW=3BW=4BW=5 BenchmarkNONEPRENONEPRENONEPRENONEPRE channel huffman LU lyapunov %35%33%40% Cost of synthesized memory not substantially different But achieved II is 36% better with pre- partitioning Best II achieved

University of Michigan Electrical Engineering and Computer Science Conclusion An approach for synthesizing custom local memories –ILP based optimal solution –Works for clustered datapath Pre-partitioning to improve achieved throughput, with minimal impact on cost For more information –

University of Michigan Electrical Engineering and Computer Science Example