University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

University of Michigan Electrical Engineering and Computer Science 1 Application-Specific Processing on a General Purpose Core via Transparent Instruction.

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

1 Crosslayer Design for Distributed MAC and Network Coding in Wireless Ad Hoc Networks Yalin E. Sagduyu Anthony Ephremides University of Maryland at College.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.

University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.

A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.

University of Michigan Electrical Engineering and Computer Science 1 An Architecture Framework for Transparent Instruction Set Customization in Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.

Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.

Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.

University of Michigan Electrical Engineering and Computer Science Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

Design Space Exploration

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

University of Michigan Electrical Engineering and Computer Science 1 Systematic Register Bypass Customization for Application-Specific Processors Kevin.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

University of Michigan Electrical Engineering and Computer Science 1 Libra: Tailoring SIMD Execution using Heterogeneous Hardware and Dynamic Configurability.

University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day3:

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

University of Michigan Electrical Engineering and Computer Science 1 SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines Manjunath Kudlur,

Probabilistic Predicate-Aware Modulo Scheduling Mikhail Smelyanskiy 1, Scott Mahlke, Edward Davidson Department of EECS University of Michigan 1 Currently.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

University of Michigan Electrical Engineering and Computer Science 1 Increasing Hardware Efficiency with Multifunction Loop Accelerators Kevin Fan, Manjunath.

University of Michigan Electrical Engineering and Computer Science Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators.

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

University of Michigan Electrical Engineering and Computer Science Dynamic Voltage/Frequency Scaling in Loop Accelerators using BLADES Ganesh Dasika 1,

Optimizing Interconnection Complexity for Realizing Fixed Permutation in Data and Signal Processing Algorithms Ren Chen, Viktor K. Prasanna Ming Hsieh.

Ph.D. in Computer Science

Parallel Algorithm Design

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

RegLess: Just-in-Time Operand Staging for GPUs

Michael Chu, Kevin Fan, Scott Mahlke

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Objective of This Course

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Application-Specific Processing on a General Purpose Core via Transparent Instruction Set Customization Nathan Clark, Manjunath Kudlur, Hyunchul Park,

Sculptor: Flexible Approximation with

Presentation transcript:

University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan, Manjunath Kudlur, Hyunchul Park, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

Electrical Engineering and Computer Science 2 Introduction Emerging applications have high performance, cost, energy demands –H.264, wireless, software radio, signal processing – Gops required –200 mW power budget Applications dominated by tight loops processing large amounts of streaming data 3.5G (HSDPA) WiMax Stereo Headset TV out PC / Mac Memory card 20 GB HD [ARM 2005]

University of Michigan Electrical Engineering and Computer Science 3 Loop Accelerators Order-of-magnitude performance and efficiency wins –Viterbi: 100x speedup vs. ARM9.C Automated C  gates solution Correct by construction Close designer productivity gap Achieve short time-to-market

University of Michigan Electrical Engineering and Computer Science 4 Loop Accelerator Template Parameterized execution resources, storage, connectivity Hardware realization of modulo scheduled loop

University of Michigan Electrical Engineering and Computer Science 5 Loop Accelerator Design Flow FU Alloc.c C Code, Performance (Throughput) Abstract Arch 1 Modulo Schedule Op1 Op2 Op3 … time FUs Scheduled Ops 2 RF FU Build Datapath Concrete Arch 3 FU Instantiate Arch Synthesize Verilog, Control Signals.v Loop Accelerator 54

University of Michigan Electrical Engineering and Computer Science 6 Modulo Scheduling and Datapath Derivation Schedule to abstract architecture (FUs) Determine register and interconnect requirements from schedule r1 = Mem[r2] r3 = r Source Code Datapath MEM+ 12 ADD LOAD time 1 time 4 FU1FU2 Schedule...

University of Michigan Electrical Engineering and Computer Science 7 Cost Sensitive Scheduling Different scheduling alternatives not equal +1+1 LD LD time FU1FU2FU3 FU1FU2FU LD 2 LD 1 time FU1FU2FU3 FU1FU2FU Traditional scheduling is hardware unaware Intelligent scheduling needed to reduce hardware cost

University of Michigan Electrical Engineering and Computer Science 8 Scheduling to Reduce Cost Hardware cost is function of final schedule Increased hardware sharing = reduced cost 1 2 FU Reusing hardware is “free” Traditional metrics (register pressure) not sufficient 3 4 FU No additional cost for longer lifetime FU

University of Michigan Electrical Engineering and Computer Science 9 Initial Approach: Greedy Standard iterative modulo scheduler, augmented with hardware cost model Choose alternative which increases cost the least while unscheduled ops remain { get valid alternatives for op for each alternative { get hardware cost } schedule op using min-cost alternative update hardware cost model } Hardware cost = FU cost + Storage cost + Wire cost + -*<<

University of Michigan Electrical Engineering and Computer Science 10 Results – Greedy Scheduling 5% average cost savings Local scope  local minima Much more cost savings possible FUStorageMUX

University of Michigan Electrical Engineering and Computer Science 11 Optimal Modulo Scheduling LD (1,0) (1,1)(3,0)(3,1) (2,0) (2,1) Op1 Op2 Op3 Loop Search Space (FU #, time) Optimal modulo scheduling extends [Eichenberger ’97] Storage cost =  width i  depth i FU cost =  cost(FU i )

University of Michigan Electrical Engineering and Computer Science 12 Results – Optimal Scheduling 27% average cost savings FUStorageMUX

University of Michigan Electrical Engineering and Computer Science 13 Problem Decomposition Exact solutions are not practical –(#FU  II  stages) ^ #ops possible schedules –20 lines of C code  100 hours –Excessive runtimes even for modest-size loops Decompose into more manageable sub-problems –Partitioned scheduling –Time-space decomposition

University of Michigan Electrical Engineering and Computer Science 14 Partitioned Scheduling Partition the operations into small groups Schedule groups of operations sequentially –Account for hardware contribution of previously scheduled groups –Backtrack if infeasible state reached Optimal Modulo Scheduler Optimal Modulo Scheduler

University of Michigan Electrical Engineering and Computer Science 15 Operation Partitioning Traditional partitioning: minimize edge cuts –Does not necessarily lead to good cost Goal: maximize hardware sharing opportunities within a group + LD + << + * + LD +

University of Michigan Electrical Engineering and Computer Science 16 Results – Partitioned Scheduling 8% average cost savings With large number of partitions, similar to greedy FUStorageMUX

University of Michigan Electrical Engineering and Computer Science 17 Partition Size for Sharp Improve cost by considering more ops at a time

University of Michigan Electrical Engineering and Computer Science 18 Time-Space Decomposition time 0: time 1: time FU1FU2FU FU 1: FU 2: 4 FU 3: time FU1FU2FU Time, space Space, time Reduce scheduling complexity View all operations together Optimize for register depth during time assignment, register width and FU cost during space assignment

University of Michigan Electrical Engineering and Computer Science 19 Results – Time-Space Scheduling Time, space: 19% average cost savings Space, time: 20% average cost savings FUStorageMUX

University of Michigan Electrical Engineering and Computer Science 20 Real Cost Savings Viterbi, naïve scheduler, 0.66 mm 2 Viterbi, space-time decomposed scheduler, 0.37 mm % overall area savings

University of Michigan Electrical Engineering and Computer Science 21 Conclusion Automated C  loop accelerator synthesis system Modulo scheduler must be cost aware Decomposition methods make problem tractable –20% average cost savings with space-time decomposition –Importance of global view of all operations Individual savings up to 43% Compile times of 1 minute – 30 minutes

University of Michigan Electrical Engineering and Computer Science 22 Questions? For more information: