Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

EE384y: Packet Switch Architectures
Using Matrices in Real Life
Constraint Satisfaction Problems
Advanced Piloting Cruise Plot.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
ZMQS ZMQS
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Mathematics for Economics Beatrice Venturi 1 Economics Faculty CONTINUOUS TIME: LINEAR DIFFERENTIAL EQUATIONS Economic Applications LESSON 2 prof. Beatrice.
Acceleration of Cooley-Tukey algorithm using Maxeler machine
DOROTHY Design Of customeR dRiven shOes and multi-siTe factorY Product and Production Configuration Method (PPCM) ICE 2009 IMS Workshops Dorothy Parallel.
1 Challenge the future Subtitless On Lightweight Design of Submarine Pressure Hulls.
1 Column Generation. 2 Outline trim loss problem different formulations column generation the trim loss problem master problem and subproblem in column.
ABC Technology Project
Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.
2 |SharePoint Saturday New York City
VOORBLAD.
15. Oktober Oktober Oktober 2012.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
Optimization 1/33 Radford, A D and Gero J S (1988). Design by Optimization in Architecture, Building, and Construction, Van Nostrand Reinhold, New York.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Exponential and Logarithmic Functions
Januar MDMDFSSMDMDFSSS
REGISTRATION OF STUDENTS Master Settings STUDENT INFORMATION PRABANDHAK DEFINE FEE STRUCTURE FEE COLLECTION Attendance Management REPORTS Architecture.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
RollCaller: User-Friendly Indoor Navigation System Using Human-Item Spatial Relation Yi Guo, Lei Yang, Bowen Li, Tianci Liu, Yunhao Liu Hong Kong University.
Delay Analysis and Optimality of Scheduling Policies for Multihop Wireless Networks Gagan Raj Gupta Post-Doctoral Research Associate with the Parallel.
New Opportunities for Load Balancing in Network-Wide Intrusion Detection Systems Victor Heorhiadi, Michael K. Reiter, Vyas Sekar UNC Chapel Hill UNC Chapel.
A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He, Pei Zhang, Huazhong Yang
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Presentation transcript:

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23

Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 2

Well-developed software libraries Low speed, high power Introduction: Background 3 Application complexity increasing MPSoCs architecture Hardware design complexity How to rapidly design hardware from existing software algorithms?

Introduction: Background (cont’d) This challenge is now new. However, –Ever increasing design gap –Progression of EDA tools 4 How to rapidly design hardware from existing software algorithms? C-to-RTL synthesis can help to tackle this challenge [29]

A 3g/4g MIMO wireless systems [7] Introduction: Motivation C2RTL tools are promising –A number of C2RTL tools –A lot of successful stories 5 …… A DVB-SH Turbo Decoder [8] 10 days 3x speedup A face detection system [9] Very complicated design

Introduction: Motivation (cont’d) However, state-of-the-art C2RTL tools suffer from: –Low Quality of results (QoR) for large C programs –System-level optimization options are limited 6 A Reed-Solomon decoding [28] A JPEG encoder [10] Flatten approach Hierarchical approach speedup Cycles42,475,2024,070, x Clock x Control and optimization at the system level are needed

Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 7

Overview: Our work Given –a large C program for a streaming application –system constraints (latency, area, …) Determine –how to partition the code into pipelined blocks –which blocks should be parallelized The objectives –Improve synthesis result quality –Provide more system-level optimization options 8 Partition Parallelization

Overview: Design flow STEP 1: –We use eXCite here STEP 2: –Determine partition and parallelization STEP 3: –Synthesize each block with a C2RTL tool STEP 4: –Construct the complete system 9

Overview: An example 10 Given a C program: –In the straight-line style Given constraints: –System throughput and area Partition: –Which functions should be synthesized together as one pipeline stage Parallelization: –Which synthesized modules should be parallelized

Overview: Challenges The design space is large: –Partition has a great impact on throughput and area –Parallelization has a great impact on throughput and area –The Pareto optimal solutions The importance to simultaneously consider partition and parallelization: –The constraints are for the system after both partition and parallelization –If optimizing them separately, it is not clear how to apply the constraints to each problem individually 11 A GSM case

Overview: Related work 12 A somewhat related line of work is mapping C programs to MPSoCs (software mapping): –Blocks (or tasks) can be assigned to the same processor –The processor area is given ApplicationInputTargetPartitionParallelization A. Hagiescu and et al., in DAC2009[11] StreamStreamITMSoPCManuallyHeuristic J. Cong and et al., in DATE2012[12] StreamCFPGAManuallyILP Y. Liu and et al., in Intech Book[13] StreamCFPGAManuallyHeuristic Y. Hara and et al., in IEICE[14] GeneralCFPGAILPN/A This workStreamCFPGA Both MILP and Heuristic (consider simultaneously)

Overview: Our Contribution A novel MILP based formulation –Find a partition and parallelization solution with maximum throughput or minimum area while satisfying a given area or throughput constraint, respectively An efficient heuristic algorithm –Overcome the scalability challenge facing the MILP formulation Validation of the proposed methods –Developing FPGA based accelerators for seven streaming applications 13

Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 14

MILP-Based Solution: Formulation Given function parameters (Para) –Area, throughput … of each function Determine (x n ) –Which functions should be clustered to form blocks –Which blocks should be parallelized Objective: –min. Area (a all (x n,Para) ) or max. Throughput (r all (x n,Para)) Subject to: –Area constraints (a all <A req ) –Throughput constraints (r all >R req ) –Connectivity constraints 15

MILP-Based Solution: Variable We use {x n } ∈ Z to represent partition and parallelization: –Partition: If x n =0: F n and F n+1 are in the same block –Parallelization: If x n ≠0: The parallelism degree of block with F n is x n We also use {y i,j } ∈ Binary to represent partition –y i,j =1 means F i,F i+1 …F j are clustered x n 16

MILP-Based Solution: Details To calculate throughput r all (x n,Para): To calculate area a all (x n,Para): Connectivity constraints: 17 (1) (2) (3) (4) (5)(6)

Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 18

Heuristic Solution: Overview 19 Motivation: –MILP is not scalable –Bad feasible regions may incur long running time even when N is small Consider partition and oarallelization separately (constructive algorithm): –Parallelization before partition to increase throughput: Incx() –Partition for the given parallelization to reduce area: Clust() –Implement Incx() and Clust() in a backtracking iterative way

Heuristic Solution: Algorithm 20 Incx(): Parallelization before Partition to increase throughput Clust(): Partition for the given Parallelization to reduce area

Heuristic Solution: Algorithm (cont’d) 21 Incx(), Parallelization before Partition: –Increase the parallelization degree of the bottleneck function Clust(), Partition under the given Parallelization: –Model the blocks and their connections as a graph –Convert the problem to a shortest path problem

Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 22

Experiments: Set up 23 7 Benchmark [21]: –ADPCM –JPEG encoder/decoder –AES encryption/decryption –GSM –Filter Groups Environment & flow: –C2RTL: eXCite –Logic synthesis: Quartus II (cyclone II) –Simulation: Modelsim

Experiments: Validate proposed method 24 Min. area for GSM case Both MILP and Heuristic are effective –Heuristic solutions differ from the MILP results by 2.3% on average

25 Min. Area for 7 benchmarks –Heuristic with a difference of 7.5% on average Solutions are effective with different applications Exp.: Validate proposed method (cont’d)

Experiments: Running time 26 Running time: –The heuristic solutions are worse by 7.2% on average x shorter time than the MILP

Outline Introduction Overview MILP-Based Solution Heuristic Solution Experimental Evaluation Conclusions and Future work 27

Conclusions and Future work 28 Conclusions : –Our work adopts a hierarchical framework with automatic C-code partition and block-level parallelization –Both an MILP-based solution and a heuristic solution are proposed –Experimental results obtained from seven real applications show that our approaches are effective Future work: –Extend the solution to C program with feedback –Taking power into consideration

Reference 29 [1]-[27] is listed in the paper [28] “Comparison of high level design methodologies for algorithmic ips: Bluespec and c-based synthesis,” Ph.D. dissertation, MIT, 2009 [29] “ITRS roadmap on Design” 2011 Edition

THANK YOU! 30

MILP-Based Solution: Linearization Linearize x j y i,j : z i,j =x j y i,j Linearize Equation (1): Linearize Equation (2): Linearize Equation (4): 31

Exp.: Validate proposed method (cont’d) 32 Min. area or Max. throughput for GSM Solutions are effective with various settings