ICCAD’01: November, 2001 Instruction Generation for Hybrid Reconfigurable Systems Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh and Majid Sarrafzadeh.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Instruction Generation and Regularity Extraction for Reconfigurable Processors Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science.
A Dictionary Construction Technique for Code Compression Systems with Echo Instructions Embedded and Reconfigurable Systems Lab Computer Science Department.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
CPT 310 Logic and Computer Design Instructor: David LublinerPhone Engineering Technology Dept.Cell
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
Aiman El-Maleh, Ali Alsuwaiyan King Fahd University of Petroleum & Minerals, Dept. of Computer Eng., Saudi Arabia Aiman El-Maleh, Ali Alsuwaiyan King Fahd.
Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab Computer Science Department, UCLA Led by Jason Cong Yiping.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
38 th Design Automation Conference, Las Vegas, June 19, 2001 Creating and Exploiting Flexibility in Steiner Trees Elaheh Bozorgzadeh, Ryan Kastner, Majid.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
ER UCLA UCLA ICCAD: November 5, 2000 Predictable Routing Ryan Kastner, Elaheh Borzorgzadeh, and Majid Sarrafzadeh ER Group Dept. of Computer Science UCLA.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.
1/20 Data Communication Estimation and Reduction for Reconfigurable Systems Adam Kaplan Philip Brisk Ryan Kastner Computer Science Elec. and Computer Engineering.
Architecture and Compilation for Reconfigurable Processors Jason Cong, Yiping Fan, Guoling Han, Zhiru Zhang Computer Science Department UCLA Nov 22, 2004.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
Reconfigurable Architectures. 2 Granularity of Reconfigurable Systems Granularity:  The abstraction level used to configure the device.  May use a −Boolean-level,
CPEN Digital System Design Chapter 9 – Computer Design
Optimality Study of Logic Synthesis for LUT-Based FPGAs Jason Cong and Kirill Minkovich VLSI CAD Lab Computer Science Department University of California,
High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
L21: “Irregular” Graph Algorithms November 11, 2010.
Levels of Architecture & Language CHAPTER 1 © copyright Bobby Hoggard / material may not be redistributed without permission.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Automated Design of Custom Architecture Tulika Mitra
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
1 ER UCLA ISPD: Sonoma County, CA, April, 2001 An Exact Algorithm for Coupling-Free Routing Ryan Kastner, Elaheh Bozorgzadeh,Majid Sarrafzadeh.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Lecture 6: Mapping to Embedded Memory and PLAs September 27, 2004 ECE 697F Reconfigurable Computing Lecture 6 Mapping to Embedded Memory and PLAs.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Review of “Register Binding for FPGAs with Embedded Memory” by Hassan Al Atat and Iyad Ouaiss Lisa Steffen CprE 583.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Systems Architecture, Fourth Edition 1 Processor Technology and Architecture Chapter 4.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Kandemir224/MAPLD Reliability-Aware OS Support for FPGA-Based Systems M. Kandemir, G. Chen, and F. Li Department of Computer Science & Engineering.
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Methodology of a Compiler that Compresses Code using Echo Instructions
Jason Cong, Guoling Han, Zhiru Zhang VLSI CAD Lab
CSCI1600: Embedded and Real Time Software
CSCI1600: Embedded and Real Time Software
Presentation transcript:

ICCAD’01: November, 2001 Instruction Generation for Hybrid Reconfigurable Systems Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh and Majid Sarrafzadeh Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh and Majid Sarrafzadeh Embedded and Reconfigurable Systems Group Computer Science Department UCLA Los Angeles, CA Embedded and Reconfigurable Systems Group Computer Science Department UCLA Los Angeles, CA 90095

ICCAD’01: November, 2001OutlineOutline Introduction Programmability Hybrid Reconfigurable Systems Strategically Programmable System Instruction Generation Uses in Hybrid Reconfigurable Systems Relation to Template Generation and Matching Algorithm for Template Generation and Matching Experiments Conclusion Introduction Programmability Hybrid Reconfigurable Systems Strategically Programmable System Instruction Generation Uses in Hybrid Reconfigurable Systems Relation to Template Generation and Matching Algorithm for Template Generation and Matching Experiments Conclusion

ICCAD’01: November, 2001ProgrammabilityProgrammability Future systems need programmability multiple levels of computation hierarchy Computational Hierarchy: Future systems need programmability multiple levels of computation hierarchy Computational Hierarchy: Gate Level ADD Register MUL Control  -Architecture Level FU Memory Register Bank Control ArchitectureLevel FUProgrammabilityBitByteInstruction (8 – 128 bits) Basic Unit of Computation Boolean Operation (and, or, xor) Arithmetic Operation Functional Operation Communication Direct wires connections Bundles of wires, registers Bus, memory Hybrid Reconfigurable Systems have programmability at one or more levels Register

ICCAD’01: November, 2001TradeoffsTradeoffs ADD Register MUL Control FU Memory Register Bank Control FU Register Example Platform Types of Programmable Units Custom instructions, Register banks Datapath unit, Control unit, RAM CLBs, LUTs Architecture level Micro- architecture level Gate level Hybrid Reconfigurable Systems should find a happy medium Tensilica, Improv Chameleon Systems Xilinx, Altera Flexibility Configuration Time Thousands of cycles Hundreds of cycles

ICCAD’01: November, 2001 SPS - Strategically Programmable System Embed (hard or soft) computational units – Versatile Programmable Blocks (VPB) - into FPGA-like fabric Combine programmable units from gate, microarchitecture and architecture levels Balance flexibility and configuration time Embed (hard or soft) computational units – Versatile Programmable Blocks (VPB) - into FPGA-like fabric Combine programmable units from gate, microarchitecture and architecture levels Balance flexibility and configuration time VPB Memory Need automated method of determining the functionality of VPBs

ICCAD’01: November, 2001 SPS Architecture RoutingArch. Overview of SPS SPS Compiler SPS Architecture Generation VPBSynthesis SPSModulePlacement Set of applications specified in high level code (c/c++, fortran, MOC) Compile to low Compile to low level specification level specification Determine VPB Determine VPB functionality functionality

ICCAD’01: November, 2001 VPB Instruction Generation Given a set of applications, what computation should be implemented on VPBs? RAM VPB VPBs? Want complex, commonly occurring computation patterns Look for computational patterns at the instruction level Basic operation is add, multiply, shift, etc. Want complex, commonly occurring computation patterns Look for computational patterns at the instruction level Basic operation is add, multiply, shift, etc. Set of applications VPB RAM

ICCAD’01: November, 2001 Problem Definition Determining VPB functionality requires regularity extraction Regularity Extraction - find common sub-structures (templates) in one or a collection of graphs Each application can be specified by collection of graphs (CDFGs) Templates are implemented as VPBs Two related sub-problems: Template Matching Template Generation Determining VPB functionality requires regularity extraction Regularity Extraction - find common sub-structures (templates) in one or a collection of graphs Each application can be specified by collection of graphs (CDFGs) Templates are implemented as VPBs Two related sub-problems: Template Matching Template Generation

ICCAD’01: November, 2001 Template Matching – Formal Def’n Problem 1: Given a directed, labeled graph G(N, A), a library of templates, each of which is a directed labeled graph T i (V,E), find every subgraph of G that is isomorphic to any T i + * * + + *+ + *& + || + + & ** Templates T +* *+ + & %+ + % ** *&|| **+ ++ Directed Labeled Graph G T1T1 T2T2 T3T3 T4T4 T5T5 T6T6

ICCAD’01: November, 2001 Template Matching – Formal Def’n Problem 2: Given an infinite number of each set of templates  = T 1, …, T k and an overlapping set of subgraphs of the given graph G(N,E) which are isomorphic to some member of  ; minimize k as well as  x i where x i is the number of templates of type T i used such that the number of nodes left uncovered is the minimum. +* *+ + & %+ + % +* *&|| **+ ++

ICCAD’01: November, 2001 Template Generation Templates may not always be given as input An automatic regularity extraction algorithm must develop it’s own templates Generate a set of templates such that: Number of templates is minimized Covering of the graph is maximized Templates may not always be given as input An automatic regularity extraction algorithm must develop it’s own templates Generate a set of templates such that: Number of templates is minimized Covering of the graph is maximized

ICCAD’01: November, 2001 Related Work Useful in a wide variety of CAD applications Data path regularity [Chowdhary98], [Callahan99] Scheduling [Ly95] System partitioning [Rao93] Low power design [Mehra96] Soft macros – CPR [Cadambi99] for PipeRench architecture Useful in a wide variety of CAD applications Data path regularity [Chowdhary98], [Callahan99] Scheduling [Ly95] System partitioning [Rao93] Low power design [Mehra96] Soft macros – CPR [Cadambi99] for PipeRench architecture

ICCAD’01: November, 2001 An Algorithm for Simultaneous Template Generation and Matching 1. Given a labeled digraph G(V, E) 2. # C is a set of edge types 3. C   4.while (stop_conditions_not_met(G)) 5. C  profile_graph(G) 6. cluster_common_edges(G, C) 1.Find the most common edge type 2.Contract common edges 3.Repeat until stopping condition met Formal Definition Informal Definition

ICCAD’01: November, 2001 Explanation of Algorithm Edge contraction: Merge adjacent nodes and maintain connectivity Stopping Conditions Reach certain number of templates Graph sufficiently covered No frequently occurring edge type Stopping Conditions Reach certain number of templates Graph sufficiently covered No frequently occurring edge type Profile Edges: Find most common edge types ContractEdge + * * * * + * * * * + * * * * * * Most Common Edge Type

ICCAD’01: November, 2001 Edge 1 Edge 2 Edge 3 Edge 4 Algorithm in Action *** ** >>% * & + Iteration 2 *** ** >>% * & + MIS Edge 2 Conflict Graph Edge 1 Edge 3 Edge 4 Create Conflict Graph Determine MIS *** ** >>% * & + Contract edges 2 and 4 Templates *** ** >>% * & + Contract edges Templates

ICCAD’01: November, 2001 Algorithm Summary Algorithm can be generalized and used in a variety of applications Easily extended to hypergraphs Input/output pin restrictions can easily be added Performs template generation and matching simultaneously Algorithm can be generalized and used in a variety of applications Easily extended to hypergraphs Input/output pin restrictions can easily be added Performs template generation and matching simultaneously We target algorithm towards VPB generation in SPS

ICCAD’01: November, 2001 Experimental Setup Set of applications specified in C SUIF&Machine-SUIF Control Flow Graph + * + * + Control Dataflow Graph DataflowGraphGenerationPass

ICCAD’01: November, 2001 PerformTemplateGeneration and Matching Experimental Setup MediaBench Files + * + * + Control Dataflow Graph Compile to CDFGs GatherStatistics: Graph Coverage, Num. Templates

ICCAD’01: November, 2001 BenchmarkC FileDescription mpeg2motion.cMotion vector decoding mpeg2getblk.cDCT block decoding adpcmadpcm.cADPCM to/from 16-bit PCM epicconvolve.c2D general image convolution jpegjctrans.cTranscoding compression jpegjdmerge.cColor conversion rastafft.cFast Fourier Transform rastanoise_est.cNoise estimation functions gsmgsm_decode.cGSM decoding gsmgsm_encode.cGSM encoding Experimental Setup - Benchmarks Selected files from MediaBench

ICCAD’01: November, 2001 Similarity Across Applications Oper- ation MediaBench file name motionjdmergegetblkgsm_decjctrans ADD50.3%84.6%44.5%29.6%84.6% MUL36.3%13.8%24.0%22.4%13. 8% Template Coverage MUL- MUL 0.0% 1.3%0.0% ADD- ADD 14.5%9.1%3.2%3.6%9.1% ADD- MUL 0.0%0.4%0.6%0.0%0.4% MUL- ADD 36.3%13.0%21.5%22.4%13.0%

ICCAD’01: November, 2001 Experimental Results Techniques Simple – restrict templates to two operations No restrictions – unlimited amount of operations Stopping condition: most common edge occurs < x% (x  5-25) Techniques Simple – restrict templates to two operations No restrictions – unlimited amount of operations Stopping condition: most common edge occurs < x% (x  5-25)

ICCAD’01: November, 2001SummarySummary Systems need programmability at multiple levels of the computational hierarchy Introduced SPS as a Hybrid Reconfigurable System Developed an instruction generation algorithm to determine VPB functionality Showed that common templates can be found across a similar set of applications An efficient covering possible using simple templates Future work: Create methods to uncover more complex templates Systems need programmability at multiple levels of the computational hierarchy Introduced SPS as a Hybrid Reconfigurable System Developed an instruction generation algorithm to determine VPB functionality Showed that common templates can be found across a similar set of applications An efficient covering possible using simple templates Future work: Create methods to uncover more complex templates