Reconfigurable Computing (High-level Acceleration Approaches)

Slides:

Advertisements

Similar presentations

1 - ECpE 583 (Reconfigurable Computing): XPS / MP3 Overview + Midterm Overview Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 15:

Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.

1 - ECpE 583 (Reconfigurable Computing): Course overview Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 1: Wed 8/24/2011 (Course.

1 - CPRE 583 (Reconfigurable Computing): FPGA Features and Convey Computer HC-1 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

1 - CPRE 583 (Reconfigurable Computing): Exam 1 Review Session Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 13: Wed 10/5/2011.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Archs, VHDL 3 Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

1 - CPRE 583 (Reconfigurable Computing): Floating Point Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 14: Fri 10/12/2011 (Floating.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 7: Wed 10/28/2009 (Compute.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

1 - CPRE 583 (Reconfigurable Computing): VHDL to FPGA: A Tool Flow Overview Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 5: 9/7/2011.

1 - CPRE 583 (Reconfigurable Computing): Reconfiguration Management Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 5: Wed 10/14/2009.

1 - CPRE 583 (Reconfigurable Computing): Reconfiguration Management Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 11: Wed 9/28/2011.

Processor Architecture

1 - ECpE 583 (Reconfigurable Computing): Map, Place & route Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 24: Wed 12/8/2010 (Map,

1 - CPRE 583 (Reconfigurable Computing): System Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 20: Wed 11/2/2011 (Compute.

1 - CPRE 583 (Reconfigurable Computing): System Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/4/2011.

1 - ECpE 583 (Reconfigurable Computing): CoreGen Overview Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 18: Wed 10/26/2011 (CoreGen.

1 - CPRE 583 (Reconfigurable Computing): Evolvable Hardware Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 24: Fri 11/18/2011 (Evolvable.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

1 - CPRE 583 (Reconfigurable Computing): High-level Acceleration Approaches Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 23:

1 - CPRE 583 (Reconfigurable Computing): Compute Models Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 12: Wed 10/6/2010 (Compute.

1 - CPRE 583 (Reconfigurable Computing): Floating Point Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 18: Fri 10/27/2010 (Floating.

1 - ECpE 583 (Reconfigurable Computing): Project Introductions Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 16: Wed 10/14/2011.

1 - CPRE 583 (Reconfigurable Computing): Design Patterns Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 19: Fri 10/28/2011 (Design.

1 - CPRE 583 (Reconfigurable Computing): Streaming Applications Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 10: Fri 11/13/2009.

1 - ECpE 583 (Reconfigurable Computing): Midterm Overview Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 17: Wed 10/21/2011 (Midterm.

Introduction to the FPGA and Labs

Computer Organization and Architecture + Networks

Control Unit Lecture 6.

ESE532: System-on-a-Chip Architecture

William Stallings Computer Organization and Architecture 8th Edition

Instructor: Dr. Phillip Jones

SOFTWARE DESIGN AND ARCHITECTURE

Chap 7. Register Transfers and Datapaths

Embedded Systems Design

Assembly Language for Intel-Based Computers, 5th Edition

ESE532: System-on-a-Chip Architecture

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Cache Memory Presentation I

FPGAs in AWS and First Use Cases, Kees Vissers

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Introduction to cosynthesis Rabi Mahapatra CSCE617

Pipelining and Vector Processing

CSCI1600: Embedded and Real Time Software

Instructor: Dr. Phillip Jones

Performance Optimization for Embedded Software

Instructor: Dr. Phillip Jones

CPRE 583 Reconfigurable Computing

Instructor: Dr. Phillip Jones

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Instructor: Dr. Phillip Jones

Presentation transcript:

Reconfigurable Computing (High-level Acceleration Approaches) Dr. Phillip Jones, Scott Hauck Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Projects: Target Timeline Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Wed 10/20 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)

Common Questions

Overview First 15 minutes of Google FPGA lecture How to run Gprof Discuss some high-level approaches for accelerating applications.

What you should learn Start to get a feel for approaches for accelerating applications.

Why use Customize Hardware? Great talk about the benefits of Heterogeneous Computing http://video.google.com/videoplay?docid=-4969729965240981475#

Profiling Applications Finding bottlenecks Profiling tools gprof: http://www.cs.nyu.edu/~argyle/tutorial.html Valgrind

Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 4-LUT B C D A DFF 1 DFF delay per output

Pipelining (Systolic Arrays) Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner.

Pipelining (Systolic Arrays) Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1

Pipelining (Systolic Arrays) Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)

Dr. James Moscola (Example) MATL2 D10 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 20

Example RNA Model 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21 MATL2 MATP1 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21

Baseline Architecture Pipeline END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline 22

Processing Elements IL7,3,2 IR8,3,2 ML9,3,2 D10,3,2 ML4 + = + = + = + 1 2 3 .40 -INF .22 .72 .30 .44 1 j  IL7,3,2 2 + ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 + + ML4,3,3 = .22 ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi 23

Baseline Results for Example Model Comparison to Infernal software Infernal run on Intel Xeon 2.8GHz Baseline architecture run on Xilinx Virtex-II 4000 occupied 88% of logic resources run at 100 MHz Input database of 100 Million residues Bulk of time spent on I/O (41.434s)

Expected Speedup on Larger Models Name Num PEs Pipeline Width Pipeline Depth Latency (ns) HW Processing Time (seconds) Total Time with measured I/O (seconds) Infernal Time (seconds) Infernal Time (QDB) (seconds) Expected Speedup over Infernal Expected Speedup over Infernal (w/QDB) RF00001 3539545 39492 195 19500 1.0000195 42.4340195 349492 128443 8236 3027 RF00016 5484002 43256 282 28200 1.0000282 42.4340282 336000 188521 7918 4443 RF00034 3181038 38772 187 18700 1.0000187 42.4340187 314836 87520 7419 2062 RF00041 4243415 44509 206 20600 1.0000206 42.4340206 388156 118692 9147 2797 Example 81 26 6 600 1.0000006 42.4340006 1039 868 25 20 Speedup estimated ... using 100 MHz clock for processing database of 100 Million residues Speedups range from 500x to over 13,000x larger models with more parallelism exhibit greater speedups

Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM

Next Class Models of Computation (Design Patterns)

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 11: Fri 10/1/2010 (Design Patterns) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Weekly Project Updates The current state of your project write up Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section The current state of your Final Presentation Your Initial Project proposal presentation (Due Wed 10/20). Should make for a starting point for you Final presentation What things are work & not working What roadblocks are you running into

Overview Class Project (example from 2008) Common Design Patterns

What you should learn Introduction to common Design Patterns & Compute Models

Outline Design patterns Why are they useful? Examples Compute models

Outline Design patterns Why are they useful? Examples Compute models

References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986) Design Patterns: Abstraction and Reuse of Object Oriented Design [4] E. Gamma (1992) The Timeless Way of Building [5] C. Alexander (1979)

Design Patterns Design patterns: are a solution to reoccurring problems.

Reconfigurable Hardware Design “Building good reconfigurable designs requires an appreciation of the different costs and opportunities inherent in reconfigurable architectures” [2] “How do we teach programmers and designers to design good reconfigurable applications and systems?” [2] Traditional approach: Read lots of papers for different applications Over time figure out ad-hoc tricks Better approach?: Use design patterns to provide a more systematic way of learning how to design It has been shown in other realms that studying patterns is useful Object oriented software [4] Computer Architecture [5]

Common Language Provides a means to organize and structure the solution to a problem Provide a common ground from which to discuss a given design problem Enables the ability to share solutions in a consistent manner (reuse)

Describing a Design Pattern [2] 10 attributes suggested by Gamma (Design Patterns, 1995) Name: Standard name Intent: What problem is being addressed?, How? Motivation: Why use this pattern Applicability: When can this pattern be used Participants: What components make up this pattern Collaborations: How do components interact Consequences: Trade-offs Implementation: How to implement Known Uses: Real examples of where this pattern has been used. Related Patterns: Similar patterns, patterns that can be used in conjunction with this pattern, when would you choose a similar pattern instead of this pattern.

Example Design Pattern Coarse-grain Time-multiplexing Template Specialization

Coarse-grain Time-Multiplexing B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

Coarse-grain Time-Multiplexing Name: Coarse-grained Time-Multiplexing Intent: Enable a design that is too large to fit on a chip all at once to run as multiple subcomponents Motivation: Method to share limited fixed resources to implement a design that is too large as a whole.

Coarse-grain Time-Multiplexing Applicability (Requirements): Configuration can be done on large time scale No feedback loops in computation Feedback loop only spans the current configuration Feedback loop is very slow Participants: Computational graph Control algorithm Collaborations: Control algorithm manages when sub-graphs are loaded onto the device

Coarse-grain Time-Multiplexing Consequences: Often platforms take millions of cycles to reconfigure Need an app that will run for 10’s of millions of cycles before needing to reconfigure May need large buffers to store data during a reconfiguration Known Uses: Video processing pipeline [Villasenor] “Video Communications using Rapidly Reconfigurable Hardware”, Transactions on Circuits and Systems for Video Technology 1995 Automatic Target Recognition [[Villasenor] “Configurable Computer Solutions for Automatic Target Recognition”, FCCM 1996

Coarse-grain Time-Multiplexing Implementation: Break design into multiple sub graphs that can be configured onto the platform in sequence Design a controller to orchestrate the configuration sequencing Take steps to minimize configuration time Related patterns: Streaming Data Queues with Back-pressure

Coarse-grain Time-Multiplexing B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

Coarse-grain Time-Multiplexing Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

Coarse-grain Time-Multiplexing Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 What constraint does this place on Temp? A B 1 MB buffer What if the data path is changed from 8-bit to 64-bit? M3 Temp M3 Temp 8 MB buffer Likely need off chip memory Configuration 1 Configuration 2

Template Specialization Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)

Template Specialization Name: Template Specialization Intent: Reduce the size or time needed for a computation. Motivation: Use early-bound data and slowly changing data to reduce circuit size and execution time.

Template Specialization Applicability: When circuit specialization can be adapted quickly Example: Can treat LUTs as small memories that can be written. No interconnect modifications Participants: Template cell: Contains specialization configuration Template filler: Manages what and how a configuration is written to a Template cell Collaborations: Template filler manages Template cell

Template Specialization Consequences: Can not optimize as much as when a circuit is fully specialize for a given instance. Overhead need to allow template to implement several specializations. Known Uses: Multiply-by-Constant String Matching Implementation: Multiply-by-Constant Use LUT as memory to store answer Use controller to update this memory when a different constant should be used.

Template Specialization Related patterns: CONSTRUCTOR EXCEPTION TEMPLATE

Template Specialization Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)

Template Specialization Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0) Multiply by a constant of 2: Support inputs of 0 - 7

Template Specialization Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0)

Template Specialization Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 Mult by 2 A(2) A(1) A(0) LUT LUT LUT LUT 2 4 6 8 10 12 14 1 1 1 C(3) C(2) C(1) C(0)

Catalog of Patterns (Just a start) [2] [2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions

Catalog of Patterns (Just a start) [2] [2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions

Next Lecture Continue Compute Models

Lecture Notes:

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 12: Wed 10/6/2010 (Compute Models) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Projects: Target Timeline Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Fri 10/22 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)

Projects: Target Timeline Work on projects: 10/22 - 12/8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)

Project Grading Breakdown 50% Final Project Demo 30% Final Project Report 30% of your project report grade will come from your 5-6 project updates. Friday’s midnight 20% Final Project Presentation

Common Questions

Common Questions

Common Questions

Common Questions

Overview Compute Models

What you should learn Introduction to Compute Models

Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986)

Building Applications Problem -> Compute Model + Architecture -> Application Questions to answer How to think about composing the application? How will the compute model lead to a naturally efficient architecture? How does the compute model support composition? How to conceptualize parallelism? How to tradeoff area and time? How to reason about correctness? How to adapt to technology trends (e.g. larger/faster chips)? How does compute model provide determinacy? How to avoid deadlocks? What can be computed? How to optimize a design, or validate application properties?

Compute Models Compute Models [1]: High-level models of the flow of computation. Useful for: Capturing parallelism Reasoning about correctness Decomposition Guide designs by providing constraints on what is allowed during a computation Communication links How synchronization is performed How data is transferred

Two High-level Families Data Flow: Single-rate Synchronous Data Flow Synchronous Data Flow Dynamic Streaming Dataflow Dynamic Streaming Dataflow with Peeks Steaming Data Flow with Allocation Sequential Control: Finite Automata (i.e. Finite State Machine) Sequential Controller with Allocation Data Centric Data Parallel

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions Captures: Parallelism Dependences Communication X X +

Single-rate Synchronous Data Flow One token rate for the entire graph For example all operation take one token on a given link before producing an output token Same power as a Finite State Machine 1 1 1 update - 1 1 1 1 1 1 1 1 F copy

Synchronous Data Flow - F Each link can have a different constant token input and output rate Same power as signal rate version but for some applications easier to describe Automated ways to detect/determine: Dead lock Buffer sizes 1 10 1 update - 1 1 1 1 10 10 1 1 F copy

Dynamic Steaming Data Flow Token rates dependent on data Just need to add two structures Switch Select in in0 in1 S S Switch Select out0 out1 out

Dynamic Steaming Data Flow Token rates dependent on data Just need to add two structures Switch, Select More Powerful Difficult to detect Deadlocks Still Deterministic 1 Switch y x x y S F0 F1 x y x y Select

Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge

Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times A Merge

Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times B Merge A

Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge B A

Steaming Data Flow with Allocation Removes the need for static links and operators. That is the Data Flow graph can change over time More Power: Turing Complete More difficult to analysis Could be useful for some applications Telecom applications. For example if a channel carries voice verses data the resources needed may vary greatly Can take advantage of platforms that allow runtime reconfiguration

Sequential Control Sequence of sub routines Programming languages (C, Java) Hardware control logic (Finite State Machines) Transform global data state

Finite Automata (i.e. Finite State Machine) Can verify state reachablilty in polynomial time S1 S2 S3

Sequential Controller with Allocation Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S3

Sequential Controller with Allocation Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S4 S3 SN

Data Parallel Multiple instances of a operation type acting on separate pieces of data. For example: Single Instruction Multiple Data (SIMD) Identical match test on all items in a database Inverting the color of all pixels in an image

Data Centric Similar to Data flow, but state contained in the objects of the graph are the focus, not the tokens flowing through the graph Network flow example Source1 Dest1 Source2 Switch Dest2 Source3 Flow rate Buffer overflow

Multi-threaded Multi-threaded: a compute model made up a multiple sequential controllers that have communications channels between them Very general, but often too much power and flexibility. No guidance for: Ensuring determinism Dividing application into threads Avoiding deadlock Synchronizing threads The models discussed can be defined in terms of a Multi-threaded compute model

Multi-threaded (Illustration)

Streaming Data Flow as Multithreaded Thread: is an operator that performs transforms on data as it flows through the graph Thread synchronization: Tokens sent between operators

Data Parallel as Multithreaded Thread: is a data item Thread synchronization: data updated with each sequential instruction

Caution with Multithreaded Model Use when a stricter compute model does not give enough expressiveness. Define restrictions to limit the amount of expressive power that can be used Define synchronization policy How to reason about deadlocking

Other Models “A Framework for Comparing Models of computation” [1998] E. Lee, A. Sangiovanni-Vincentelli Transactions on Computer-Aided Design of Integrated Circuits and Systems “Concurrent Models of Computation for Embedded Software”[2005] E. Lee, S. Neuendorffer IEEE Proceedings – Computers and Digital Techniques

Next Lecture System Architectures

User Defined Instruction MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

User Defined Instruction MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

User Defined Instruction MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

MP3 Notes MUCH less VHDL coding than MP2 But you will be writing most of the VHDL from scratch The focus will be more on learning to read a specification (Power PC coprocessor interface protocol), and designing hardware that follows that protocol. You will be dealing with some pointer intensive C-code. It’s a small amount of C code, but somewhat challenging to get the pointer math right.

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes kk

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010 (System Architectures) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Projects: Target Timeline Work on projects: 10/22 - 12/8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)

Common Questions

Common Questions

Common Questions

Overview Common System Architectures Plus/Delta mid-semester feedback

What you should learn Introduction to common System Architectures

Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon

System Architectures Compute Models: Help express the parallelism of an application System Architecture: How to organize application implementation

Efficient Application Implementation Compute model and system architecture should work together Both are a function of The nature of the application Required resources Required performance The nature of the target platform Resources available

Efficient Application Implementation (Image Processing) Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Data Parallel Compute Model Vector System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Efficient Application Implementation (Image Processing) X X Data Flow Compute Model + Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

Data Presence X X +

Data Presence X X data_ready data_ready + data_ready

Data Presence X X FIFO FIFO data_ready data_ready + FIFO data_ready

Data Presence X X stall stall FIFO FIFO data_ready data_ready + FIFO

Data Presence Flow control: Term typical used in networking X X stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking

Data Presence Flow control: Term typical used in networking Increase flexibility of how application can be implemented X X stall stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking

Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

Datapath Sharing X X +

Datapath Sharing Platform may only have one multiplier X X +

Datapath Sharing Platform may only have one multiplier X +

Datapath Sharing Platform may only have one multiplier REG X REG +

Datapath Sharing Platform may only have one multiplier REG X FSM REG +

Datapath Sharing Platform may only have one multiplier REG X FSM REG + Important to keep track of were data is coming!!

Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

Interconnect sharing X X +

Interconnect sharing Need more efficient use of interconnect X X +

Interconnect sharing Need more efficient use of interconnect X X +

Interconnect sharing Need more efficient use of interconnect X X FSM +

Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

Streaming coprocessor See SCORE chapter 9 of text for an example.

Sequential Control Typically thought of in the context of sequential programming on a processor (e.g. C, Java programming) Key to organizing synchronizing and control over highly parallel operations Time multiplexing resources: when task to too large for computing fabric Increasing data path utilization

Sequential Control X + A B C

Sequential Control X + A B C A*x2 + B*x + C

Sequential Control X + A B C C A B X X + A*x2 + B*x + C A*x2 + B*x + C

Finite State Machine with Datapath (FSMD) B X X + A*x2 + B*x + C

Finite State Machine with Datapath (FSMD) B X FSM X + A*x2 + B*x + C

Sequential Control: Types Finite State Machine with Datapath (FSMD) Very Long Instruction Word (VLIW) data path control Processor Instruction augmentation Phased reconfiguration manager Worker farm

Very Long Instruction Word (VLIW) Datapath Control See 5.2 of text for this architecture

Processor

Instruction Augmentation

Phased Configuration Manager Will see more detail with SCORE architecture from chapter 9 of text.

Worker Farm Chapter 5.2 of text

Bulk Synchronous Parallelism See chapter 5.2 for more detail

Data Parallel Single Program Multiple Data Single Instruction Multiple Data (SIMD) Vector Vector Coprocessor

Data Parallel

Data Parallel

Data Parallel

Data Parallel

Cellular Automata

Multi-threaded

Next Lecture

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes Add CSP/Mulithread as root of a simple tree 15+5(late start) minutes of time left Think of one to two in class exercise (10 min) Data Flow graph optimization algorithm? Dead lock detection on a small model? Give some examples of where a given compute model would map to a given application. Systolic array (implement) or Dataflow compute model) String matching (FSM) (MISD) New image for MP3, too dark of a color

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 14: Fri 10/13/2010 (Streaming Applications) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Common Questions

Common Questions

Common Questions

Overview Steaming Applications (Chapters 8 & 9) Simulink SCORE

What you should learn Two approaches for implementing streaming applications

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +

Data Flow Graph of operators that data (tokens) flows through Composition of functions Captures: Parallelism Dependences Communication X X +

Streaming Application Examples Some images processing algorithms Edge Detection Image Recognition Image Compression (JPEG) Network data processing String Matching (your MP2 assignment) Sorting??

Sorting Initial list of items Split Split Split Sort Sort Sort Sort merge merge merge

Example Tools for Streaming Application Design Simulink from Matlab: Graphical based SCORE (Steam Computation Organized for Reconfigurable Hardware): A programming model

Simulink (MatLab) What is it? MatLab module that allows building and simulating systems through a GUI interface

Simulink: Example Model

Simulink: Example Model

Simulink: Sub-Module

Simulink: Example Model

Simulink: Example Model

Simulink: Example Plot

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection Detect Horizontal Edges Detect Vertical Edges -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

Top Level

Shifter

Multiplier

Input Image

Output Image

SCORE Overview of the SCORE programming approach Developed by Stream Computations Organized for Reconfigurable Execution Developed by University of California Berkeley California Institute of Technology FPL 2000 overview presentation http://brass.cs.berkeley.edu/documents/score_tutorial.html

Next Lecture Data Parallel

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 15: Fri 10/15/2010 (Reconfiguration Management) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders Midterm: Take home portion (40%) given Friday 10/22, due Tue 10/26 (midnight) In class portion (60%) Wed 10/27 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today/tomorrow) Problem 2 of HW 2 (released after MP3 gets released)

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Common Questions

Common Questions

Overview Chapter 4: Reconfiguration Management

What you should learn Some basic configuration architectures Key issues when managing the reconfiguration of a system

Reconfiguration Management Goal: Minimize the overhead associated with run-time reconfiguration Why import to address Can take 100’s of milliseconds to reconfigure a device For high performance applications this can be a large overhead (i.e. decreases performance)

High Level Configuration Setups Externally trigger reconfiguration CPU Configuration Request FPGA ROM (bitfile) Config Data FSM Config Control (CC)

High Level Configuration Setups Self trigger reconfiguration FPGA Config Data ROM (bitfile) FSM CC

Configuration Architectures Single-context Multi-context Partially Reconfigurable Relocation & Defragmentation Pipeline Reconfiguration Block Reconfigurable

Single-context FPGA Config clk Config I/F Config Data Config enable OUT IN OUT IN OUT EN EN EN Config enable

Multi-context FPGA 1 1 2 2 3 3 Config clk Context switch Config Config OUT IN OUT IN EN EN Context switch 1 1 Context 1 Enable 2 2 Context 2 Enable 3 3 Context 3 Enable Config Enable Config Enable Config Data Config Data

Partially Reconfigurable Reduce amount of configuration to send to device. Thus decreasing reconfiguration overhead Need addressable configuration memory, as opposed to single context daisy chain shifting Example Encryption Change key And logic dependent on key PR devices AT40K Xilinx Virtex series (and Spartan, but not a run time) Need to make sure partial config do not overlap in space/time (typical a config needs to be placed in a specific location, not as homogenous as you would think in terms of resources, and timing delays)

Partially Reconfigurable

Partially Reconfigurable Full Reconfig 10-100’s ms

Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms

Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms

Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms

Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms

Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms Typically a partial configuration modules map to a specific physical location

Relocation and Defragmentation Make configuration architectures support relocatable modules Example of defragmentation text good example (defrag or swap out, 90% decrease in reconfig time compared to full single context) Best fit, first fit, … Limiting factor Routing/logic is heterogeneous timing issues, need modified routes Special resources needed (e.g. hard mult, BRAMS) Easy issue if there are blocks of homogeneity Connection to external I/O (fix IP cores, board restrict) Virtualized I/O (fixed pin with multiple internal I/Fs? 2D architecture more difficult to deal with Summary of feature PR arch should have Homogenous logic and routing layout Bus based communication (e.g. network on chip) 1D organization for relocation

Relocation and Defragmentation B C

Relocation and Defragmentation

Relocation and Defragmentation

Relocation and Defragmentation

Relocation and Defragmentation

Relocation and Defragmentation More efficient use of Configuration Space C A

Pipeline Reconfigurable Example: PipeRench Simplifies reconfiguration Limit what can be implemented Cycle 1 2 3 4 5 6 Virtual Pipeline stage 1 2 3 4 PE PE PE PE 1 1 1 PE PE PE PE 2 2 2 3 3 3 PE PE PE PE 4 4 Cycle 1 2 3 4 5 6 Physical Pipeline stage 1 2 3 3 3 1 1 1 4 4 2 2 2

Block Reconfigurable Swappable Logic Units Abstraction layer over a general PR architecture: SCORE Config Data

Managing the Reconfiguration Process Choosing a configuration When to load Where to load Reduce how often one needs to reconfigure, hiding latency

Configuration Grouping What to pack Pack multiple related in time configs into one Simulated annealing, clustering based on app control flow

Configuration Caching When to load LRU, credit based dealing with variable sized configs

Configuration Scheduling Prefetching Control flow graph Static compiler inserted conf instructions Dynamic: probabilistic approaches MM (branch prediction) Constraints Resource Real-time Mitigation System status and prediction What are current request Predict which config combination will give best speed up

Software-based Relocation Defragmentation Placing R/D decision on CPU host not on chip config controller

Context Switching Safe state then start where left off.

Next Lecture Data Parallel

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 16: Fri 10/20/2010 (Data Parallel Architectures) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders Midterm: Take home portion (40%) given Friday 10/29, due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today): Problem 2 of HW 2 (released after MP3 gets released)

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Common Questions

Common Questions

Overview Data Parallel Architectures: MP3 Demo/Overview Chapters 5.2.4, and chapter 10 MP3 Demo/Overview

What you should learn Data Parallel Architecture basics Flexibility Reconfigurable Hardware Addes

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Data Parallel Architectures

Next Lecture Project initial presentations.

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 17: Fri 10/22/2010 (Initial Project Presentations) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Common Questions

Common Questions

Overview Present Project Ideas

Projects

Next Lecture Fixed Point Math and Floating Point Math

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 18: Fri 10/27/2010 (Floating Point) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders Midterm: Take home portion (40%) given Friday 10/29 (released today by 5pm), due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Problem 2 of HW 2 (released soon)

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Common Questions

Common Questions

Overview Floating Point on FPGAs (Chapter 21.4 and 31) Why is it viewed as difficult?? Options for mitigating issues

Floating Point Format (IEEE-754) Single Precision S exp Mantissa 1 8 23 23 Mantissa = b-1 b-2 b-3 ….b-23 = ∑ b-i 2-i i=1 Floating point value = (-1)S * 2(exp-127) * (1.Mantissa) Example: 0 x”80” 110 x”00000” = -1^0 * 2^128-127 * 1.(1/2 + 1/4) = -1^0 * 2^1 * 1.75 = 3.5 Double Precision S exp Mantissa 1 11 52 Floating point value = (-1)S * 2(exp-1023) * (1.Mantissa)

Fixed Point Whole Fractional bW-1 … b1 b0 b-1 b-2 …. b-F Example formats (W.F): 5.5, 10.12, 3.7 Example fixed point 5.5 format: 01010 01100 = 10. 1/4 + 1/8 = 10.375 Compare floating point and fixed point Floating point: 0 x”80” “110” x”00000” = 3.5 10-bit (Format 3.7) Fixed Point for 3.5 = ? 011 1000000

Fixed Point (Addition) Whole Fractional Operand 1 Whole Fractional Operand 2 + Whole Fractional sum

Fixed Point (Addition) 11-bit 4.7 format 0011 111 0000 Operand 1 = 3.875 0001 101 0000 + Operand 2 = 1.625 sum 0101 100 0000 = 5.5 You can use a standard ripple-carry adder!

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 +

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”80” -> x”7F” or visa-verse?

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”7F”->x”80”, lose least significant bits of Operand 2 - Add the difference of x”80” – x“7F” = 1 to x”7F” - Shift mantissa of Operand 2 by difference to the right. remember “implicit” 1 of the original mantissa 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 +

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 +

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + Overflow! 1 110 x”00000”

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas You can’t just overflow mantissa into exponent field You are actually overflowing the implicit “1” of Operand 1, so you sort of have an implicit “2” (i.e. “10”). 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + Overflow! 1 110 x”00000”

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + 0 x”81” 011 x”00000”

Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + 0 x”81” 011 x”00000” = 5.5 Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + 0 x”81” 011 x”00000”

Floating Point (Addition): Other concerns Special Value Sign Exponent Mantissa Zero 0/1 Infinity MAX -Infinity 1 NaN Non-zero Denormal nonzero Single Precision S exp Mantissa 1 8 23

Floating Point (Addition): High-level Hardware M0 M1 Difference Greater Than Mux SWAP Shift value Right Shift Add/Sub Priority Encoder Round Denormal? Left Shift value Left Shift Sub/const E M

Floating Point Both Xilinx and Altera supply floating point soft-cores (which I believe are IEEE-754 compliant). So don’t get too afraid if you need floating point in your class projects Also there should be floating point open cores that are freely available.

Fixed Point vs. Floating Point Floating Point advantages: Application designer does not have to think “much” about the math Floating point format supports a wide range of numbers (+/- 3x1038 to +/-1x10-38), single precision If IEEE-754 compliant, then easier to accelerate existing floating point base applications Floating Point disadvantages Ease of use at great hardware expense 32-bit fix point add (~32 DFF + 32 LUTs) 32-bit single precision floating point add (~250 DFF + 250 LUTs). About 10x more resources, thus 1/10 possible best case parallelism. Floating point typically needs massive pipeline to achieve high clock rates (i.e. high throughput) No hard-resouces such as carry-chain to take advantage of

Fixed Point vs. Floating Point Range example: Floating Point vs. Fixed Point advantages: Some exception with respect to precision

Mitigating Floating Point Disadvantages Only support a subset of the IEEE-754 standard Could use software to off-load special cases Modify floating point format to support a smaller data type (e.g. 18-bit instead of 32-bit) Link to Cornell class: http://instruct1.cit.cornell.edu/courses/ece576/FloatingPoint/index.html Add hardware support in the FPGA for floating point Hardcore multipliers: Added by companies early 2000’s Altera: Hard shared paths for floating point (Stratix-V 2011) How to get 1-TFLOP throughput on FPGAs article http://www.eetimes.com/design/programmable-logic/4207687/How-to- achieve-1-trillion-floating-point-operations-per-second-in-an-FPGA

Mitigating Fixed Point Disadvantages (21.4) Block Floating Point (mitigating range issue)

CPU/FPGA/GPU reported FLOPs Block Floating Point (mitigating range issue)

Next Lecture Mid-term Then on Friday: Evolvable Hardware

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Lecture Notes Altera App Notes on computing FLOPs for Stratix-III Altera old app Notes on floating point add/mult

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 19: Fri 11/5/2010 (Evolvable Hardware) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders MP3: Extended due date until Monday midnight Those that finish by Friday (11/5) midnight bonus +1% per day before new deadline If after Friday midnight, no bonus but no penalty 10% deduction after Monday midnight, and addition -10% each day late Problem 2 of HW 2 (will now call HW3): released by Sunday midnight, will be due Monday 11/22 midnight. Turn in weekly project report (tonight midnight)

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

What you should learn Understand Evolvable Hardware basics? Benefits and Drawbacks Key types/categories

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 GATACA

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 GATACA GATAGA

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 GATACA GATAGA

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 0001 1100

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 0001 1100 0001 0000

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 DFF 0001 1100 0001 0000 DFF

Classifying Adaption/Evolution Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks

Classifying Adaption/Evolution Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks Phylogeny Epigenesis Ontogeny

Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

Genetic Algorithms Genome: a finite sting of symbols encoding an individual Phenotype: The decoding of the genome to realize the individual Constant Size population Generic steps Initial population Decode Evaluate (must define a fitness function) Selection Mutation Cross over

Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 Evaluate Decode Next Generation Selection Cross Over Mutation

Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 Evaluate Decode (0110 1000) Next Generation Selection Cross Over Mutation

Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation Selection Cross Over Mutation

Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation

Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation 1110 0011 1010 0000 1110 0100 1010 1011 0010 0000 1111 0100 Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation

Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation 1110 0000 1010 0011 1110 0100 1010 0000 0010 1011 1111 0100 1110 0011 1010 0000 1110 0100 1010 1011 0010 0000 1111 0100 Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation

Initialize Population Genetic Algorithms Initialize Population 1110 0000 1010 0011 1110 0100 1010 0000 0010 1011 1111 0100 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation 1110 0000 1010 0011 1110 0100 1010 0000 0010 1011 1111 0100 1110 0011 1010 0000 1110 0100 1010 1011 0010 0000 1111 0100 Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation

Evolvable Hardware Platform

Genetic Algorithms GA are a type of guided search Why use a guide search? Why not just do an exhaustive search?

Genetic Algorithms GA are a type of guided search Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second The genome of a individual is 32-bits in size How long to do an exhaustive search?

Genetic Algorithms GA are a type of guided search Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second Now genome of a individual is a FPGA 1,000,000 bits in size How long to do an exhaustive search?

Evolvable Hardware Taxonomy Extrinsic Evolution (furthest from biology) Evolution done in SW, then result realized in HW Intrinsic Evolution HW is used to deploy individuals Results are sent back to SW for fitness calculation Complete Evolution Evolution is completely done on target HW device Open-ended Evolution (closest to biology) Evaluation criteria changes dynamically Phylogeny Epigenesis Ontogeny

Evolvable Hardware Applications Prosthetic Hand controller chip Kajitani “An Evolvable Hardware Chip for Prostatic Hand Controller”, 1999

Evolvable Hardware Applications Tone Discrimination and Frequency generation Adrian Thompson “Silicon Evolution”, 1996 Xilinx XC6200

Evolvable Hardware Applications Tone Discrimination and Frequency generation Node Functions Node Genotype

Evolvable Hardware Applications Tone Discrimination and Frequency generation Evolved 4KHz oscillator

Evolvable Hardware Issues?

Evolvable Hardware Issues?

Evolvable Hardware Platforms Commercial Platforms Xilinx XC6200 Completely multiplex base, thus could program random bitstreams dynamically without damaging chip Xilinx Virtex FPGA Custom Platforms POEtic cell Evolvable LSI chip (Higuchi)

Next Lecture Overview the synthesis process

Notes Notes

Adaptive Thermoregulation for Applications on Reconfigurable Devices Phillip Jones Applied Research Laboratory Washington University Saint Louis, Missouri, USA http://www.arl.wustl.edu/arl/~phjones Iowa State University Seminar April 2008 Funded by NSF Grant ITR 0313203

What are FPGAs? FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB Configurable Logic Block

What are FPGAs? FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

What are FPGAs? FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB

FPGA Usage Models Partial Reconfiguration Fast Prototyping System on Experimental ISA Experimental Micro Architectures Run-time adaptation Run-time Customization CPU + Specialized HW - Sparc-V8 Leon Partial Reconfiguration Fast Prototyping System on Chip (SoC) Parallel Applications Full Reconfiguration Image Processing Computational Biology Remote Update Fault Tolerance

Some FPGA Details CLB CLB CLB CLB

Some FPGA Details CLB CLB CLB 4 input Look Up Table 0000 0001 1110 1111 ABCD Z Z A LUT B C D

Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111 1 A AND Z 4 input Look Up Table B C D

Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111 1 A OR Z 4 input Look Up Table B C D

Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z B X000 X001 X110 1 Z 4 input Look Up Table C 2:1 Mux D

Some FPGA Details CLB CLB CLB Z A LUT B C D

Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z LUT DFF B C D

Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z LUT DFF B C D

Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

Why Thermal Management?

Why Thermal Management? Location? Hot Cold Regulated

Why Thermal Management? Mobile? Hot Cold Regulated

Why Thermal Management? Reconfigurability FPGA Plasma Physics Microcontroller

Why Thermal Management? Exceptional Events

Why Thermal Management? Exceptional Events

Local Experience Thermally aggressive application Disruption of air flow

Damaged Board (bottom view) Thermally aggressive application Disruption of air flow

Damaged Board (side view) Thermally aggressive application Disruption of air flow

Response to catastrophic thermal events Easy Fix Not Feasible!! Very Inconvenient

Solutions Over provision Use thermal feedback Large heat sinks and fans Restrict performance Limiting operating frequency Limit amount chip utilization Use thermal feedback Dynamic operating frequency Adaptive Computation Shutdown device My approach

Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

Measuring Temperature FPGA

Measuring Temperature FPGA A/D 60 C

Background: Measuring Temperature FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp. 84.91, 2000. Temperature 1. .0 .1 0. .0 1. Period

Background: Measuring Temperature FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period

Background: Measuring Temperature FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period

Background: Measuring Temperature FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp. 84.91, 2000. Temperature 1. .1 .0 Period Voltage

Background: Measuring Temperature FPGA Temperature 1. .1 .0 Period Voltage

Background: Measuring Temperature FPGA “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands Temperature 1. .1 .0 Period Voltage

Background: Measuring Temperature FPGA Mode 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

Background: Measuring Temperature FPGA Mode 1 Mode 2 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Core 1 Core 2 70C Temperature 40C Core 3 Core 4 Period 8,000 8,300 Frequency: Low Frequency: High

Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Sample Controller Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 2 5 3 1 4 5 2 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High

Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 3 2 5 1 4 5 3 1 2 3 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High

Background: Measuring Temperature FPGA Mode 2 1 3 Sample Mode Pause Time out Counter 2 1 5 4 3 5 2 3 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

Temperature Benchmark Circuits Desired Properties: Scalable Work over a wide range of frequencies Can easily increase or decrease circuit size Simple to analyze Regular structure Distributes evenly over chip Help reduce thermal gradients that may cause damage to the chip May serve as standard Further experimentation Repeatability of results “A Thermal Management and Profiling Method for Reconfigurable Hardware Applications”, by Phillip H. Jones, John W. Lockwood, and Young H. Cho; Field Programmable Logic and Applications (FPL’06), Madrid, Spain,

Temperature Benchmark Circuits LUT 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF

Temperature Benchmark Circuits RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate 8 Input Gen Array of 18 core blocks (864 LUTs, 864 DFFs) (1 LUT, 1 DFF) Thermal workload unit: Computation Row CB 0 CB 17 CB 1 CB 16

Temperature Benchmark Circuits RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate RLOC_ORIGIN: Row, Col 100% Activation Rate Thermal workload unit: Computation Row 01 Input Gen CB 0 CB 1 CB 16 CB 17 00 1 1 8 8 (1 LUT, 1 DFF) Array of 18 core blocks (864 LUTs, 864 DFFs)

Example Circuit Layout (Configuration 1x, 9% LUTs and DFFs) RLOC_ORIGIN: Row, Col (27,6) Thermal Workload Unit

Example Circuit Layout (Configuration 4x, 36% LUTs and DFFs)

Observed Temperature vs. Frequency T ~ P P ~ F*C*V2 Steady-State Temperatures Cfg4x Cfg10x Cfg2x Cfg1x

Observed Temperature vs. Active Area Max rated Tj 85 C T ~ P P ~ F*C*V2 Steady-State Temperatures 200 MHz 100 MHz 50 MHz 25 MHz 10 MHz

Projecting Thermal Trajectories Estimate Steady State Temperature 5.4±.5 Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)

Projecting Thermal Trajectories Estimate Steady State Temperature How long until 60 C? 5.4±.5 Exploit this phase for performance Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)

Thermal Shutdown Max Tj (70C)

Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

Image Correlation Application Template

Image Correlation Application Virtex-4 100FX Resource Utilization Heats FPGA a lot! (> 85 C) Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs)

Application Infrastructure Temperature Sample Controller Thermoregulation Controller Pause 65 C Application Mode “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Image Buffer Mode Image Processor Core 1 Mask 1 2 Image Processor Core 3 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 Score Out

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 Mode MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 3 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Score Out

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 180 150 100 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 100 75 50 MHz MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 50 MHz 6 4 5 7 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 2 Mask 1 Mask 2 Mask 1 Mask 2 Mask 2 High Priority Features Low Priority Features Score Out

Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 75 100 180 150 50 200 MHz MHz 4 7 8 6 5 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 Mask 2 Mask 1 Mask 2 High Priority Features Low Priority Features Score Out

Thermally Adaptive Frequency High Frequency Thermal Budget = 72 C “An Adaptive Frequency Control Method Using Thermal Feedback for Reconfigurable Hardware Applications”, by Phillip H. Jones, Young H. Cho, and John W. Lockwood; Field Programmable Technology (FPT’06), Bangkok, Thailand Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)

Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) S. Wang (“Reactive Speed Control”, ECRTS06) Time (s)

Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

Platform Overview Virtex-4 FPGA Temperature Probe

Thermal Budget Efficiency 200 MHz 106 MHz 184 MHz 50 MHz 65 MHz 50 MHz 50 MHz Adaptive Fixed 70 Adaptive Thermal Budget (65 C) 65 4 Features 50 MHz 4 50 6 50 8 65 8 106 8 184 60 Fixed 8 200 25 C Unused 55 Junction Temperature (C) 50 45 40 35 30 40 C 35 C 30 C 25 C 25 C 25 C 0 Fans 0 Fans 0 Fans 0 Fans 1 Fan 2 Fans Thermal Condition

Conclusions Motivated the need for thermal management Measuring temperature Application dependent voltage variations effects. Temperature benchmark circuits Examined application specific adaptation for improving performance in dynamic thermal environments

Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

Thermally Constrained Systems Space Craft Sun Earth

Thermally Constrained Systems

Temperature-Safe Real-time Systems Task scheduling is a concern in many embedded systems Goal: Satisfy thermal constraints without violating real-time constraints

How to manage temperature? Static frequency scaling Sleep while idle Time T1 T2 T3 T1 T2 T3 Time

How to manage temperature? Static frequency scaling Sleep while idle Time T1 T2 T3 Too hot? Deadlines could be missed T1 T2 T3 Idle Time

How to manage temperature? Static frequency scaling Sleep while idle Time T1 T2 T3 Deadlines could be missed T1 T2 T3 Idle Idle Idle Time Generalization: Idle task insertion

Idle Task Insertion More Powerful Task for schedule at F_max (100 MHz) Period (s) Cost (s) Deadline (s) Utilization (%) Deadline equals cost, frequency cannot be scaled or task schedule becomes infeasible 30 10.0 10.0 33.33 120 30.0 120 25.00 480 30.0 480 6.25 960 20.0 960 2.08 66.66 a. No idle task inserted Tasks scheduled at F_max (100 MHz), 1 Idle Task 960 480 120 60.0 10.0 Deadline (s) 33.33 20.0 60 2.08 99.99 6.25 30.0 25.00 30 Utilization (%) Cost (s) Period (s) b. 1 idle task inserted Idle task insertion No impact on tasks’ cost Higher priority task response times unaffected Allow control over distribution of idle time

Sleep when idle is insufficient Temperature constraint = 65 C Peak Temperature = 70 C

Idle-task inserted Temperature constraint = 65 C Peak Temperature = 61 C

Idle-Task Insertion + Deadlines Temperature met? Yes No System (task set) Idle tasks Scheduler (e.g. RMS) + Deadlines met? Temperature Yes No a. Original schedule does not meet temperature constraints b. Use idle tasks to redistribute device idle time in order to reduce peak device temperature

Related Research Power Management Thermal Management EDF, Dynamic Frequency Scaling Yao (FOCS’95) EDF, Minimize Temperature Bansal (FOCS’04) Worst Case Execution Time Shin (DAC’99) RMS, Reactive Frequency, CIA Wang (RTSS’06, ECRTS’06)

Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Conclusions Temperature-Safe Real-time Systems Future Directions

Research Fronts Near term Longer term Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)

Questions/Comments? Near term Longer term Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)

Temperature per Processing Core Temperature vs. Number of Processing Core 70 y = + 60.1 2.21x 65 S1 y = + 57.1 2.24x S2 60 y = + 52.1 2.23x S3 55 2.07x Junction Temperature (C) y = + 44.2 50 S4 45 y = + 37.5 1.43x S5 40 y = + 34.0 1.22x S6 35 1 2 3 4 Number of Processing Cores

Temperature Sample Mode

Ring Oscillator Thermometer Characteristics Thermometer size Ring oscillator size Oscillation period Incrementer Cycle Period Temperature resolution ~100 LUTs 48 LUTs (47 NOT + 1 OR) ~40 ns ~.16 ms (40ns * 4096) .1ºC/ count Or .1ºC/ 20ns

Application Mode B C Count = 8235 Count = 8425 Count = 8620 Temperature vs. Incrementer Period (Measuring Temperature while Application Active) 10 20 30 40 50 60 70 80 90 8100 8200 8300 8400 8500 8600 8700 Incrementer Period (20ns/count) Temperature (C) Application Mode A B C Count = 8235 Count = 8425 Count = 8620

Virtex-4 100FX Resource Utilization Application implementation statistics Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) Image Correlation Characteristics 40.6 (at 200 MHz) 1 - 8 8-bit (grey scale) 320x480 Image Processing Rate (Frames per second) # of Features Pixel Resolution Image Size (# pixels)

VirtexE 2000 Resource Utilization Image Correlation Characteristics Application implementation statistics 125 MHz 26% (43) 32,868 (15,808) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) VirtexE 2000 Resource Utilization 12.7/second (at 125 MHz) 10 (in parallel) 1 - 4 8-bit (grey scale) 640x480 Image Processing Rate # of Templates # of Mask Patterns Pixel Resolution Image Size (# pixels) Image Correlation Characteristics a.) b.)

Scenario Descriptions 30 C (86 F) S3 25 C (77 F) S4 40 C (104 F) S1 35 C (95 F) S2 # of Fans Ambient Temperature Scenario S1 – S6 1 S5 2 S6

High Level Architecture Application Pause Thermal Manager Frequency & Quality Controller Frequency mode Quality Temperature

Periodic Temperature Sampling Application Pause Thermal Manager 50 ms Event Counter Event Ring Oscillator Based Thermometer ready Sample Mode Controller Temperature Frequency & Quality capture Frequency mode Quality

Ring Oscillator Based Thermometer Reset 12-bit incrementer ring_clk MSB Edge Detect 14-bit Clk DFF reset 14 Temperature sel Ready mux

ASIC, GPP, FPGA Comparison Cost Performance Power Flexibility

Frequency Multiplexing Circuit Frequency Control Clk Multiplier (DLLs) clk clk to global clock tree 2:1 MUX 4xclk BUFG Current Virtex-4 platform uses glitch free BUFGMUX component

Thermally Adaptive Frequency High Frequency Thermal Budget = 72 C Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)

Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

Worst Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C Thermally Safe Frequency 50 MHz

Worst Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency Thermally Safe Frequency 50 MHz

Worst Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

Typical Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

Typical Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

Best Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Thermally Safe Frequency 50 MHz

Best Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 119 MHz Thermally Safe Frequency 50 MHz 2.4x Factor Performance Increase

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/12/2010 (Synthesis) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

What you should learn Intro to synthesis Synthesis and Optimization of Digital Circuits De micheli, 1994 (chapter 1)

Synthesis (big picture) Synthesis & Optimization Architectural Logic Boolean Function Min Boolean Relation Min State Min Scheduling Sharing Coloring Covering Satisfiability Graph Theory Boolean Algebra

Views of a design Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S1 S2 Logic level DFF S3

Levels of Synthesis Architectural level Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

Levels of Synthesis Architectural level Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view ID Func. Resources Schedule use (control) Inter connect (data path) Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

Levels of Synthesis Architectural level Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 read S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 + S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 *, + S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 +,* S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 + S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 write S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit DFF DFF DFF DFF * ALU Control Unit Memory & Steering logic

Optimization Combinational Metrics: propagation delay, circuit size Sequential Cycle time Latency Circuit size

Optimization Combinational Metrics: propagation delay, circuit size Sequential Cycle time Latency Circuit size

Impact of Highlevel Syn on Optimaiztion y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

Impact of Highlevel Syn on Optimaiztion y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit * * * ALU Memory & Steering logic Control Unit

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Sum of products A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products Sum of products (minimized) 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 A * B + A’*C*D’ 01 1 10 11

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw (xy + xw)’ (xw)’CD + (xy + xw)’(xw)C’D’

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Introduction to HW3

Introduction to HW3

Introduction to HW3

Next Lecture MAP

Notes Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 22: Fri 11/19/2010 (Coregen Overview) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders HW3): released by Saturday midnight, will be due Wed 12/15 midnight. Turn in weekly project report (tonight midnight) Midterms still being graded, sorry for the delay: You can stop by my office after 5pm today to pick up your graded tests 584 Advertisement: Number 1

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

What you should learn Basic of using coregen, in class demo

Next Lecture Finish up synthesis process, start MAP

Notes Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 22: Fri 12/1/2010 (Class Project Work) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Next Lecture Finish up synthesis process, MAP

Notes Notes

Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 24: Wed 12/8/2010 (Map, Place & Route) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (9 – 10:30 am) Take home final given on Wed 12/15 due 12/17 5pm

Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Applications on FPGA: Low-level Implement circuit in VHDL (Verilog) Simulate compiled VHDL Synthesis VHDL into a device independent format Map device independent format to device specific resources Check that device has enough resources for the design Place resources onto physical device locations Route (connect) resources together Completely routed Circuit meets specified performance Download configuration file (bit-steam) to the FPGA

Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download

(Technology) Map Translate device independent net list to device specific resources

(Technology) Map Translate device independent net list to device specific resources

(Technology) Map Translate device independent net list to device specific resources

(Technology) Map Translate device independent net list to device specific resources

Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download

Place Bind each mapped resource to a physical device location User Guided Layout (Chapter 16:Reconfigurable Computing) General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based Heuristics used No efficient means for finding an optimal solution

Place (High-level) Netlist from technology mapping in A in B in C RAM LUT D DFF F DFF G clk out

Place (High-level) Netlist from technology mapping FPGA physical layout I/O I/O I/O I/O in A in B in C I/O LUT BRAM I/O LUT RAM E DFF F I/O I/O LUT D LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O

Place (High-level) Netlist from technology mapping FPGA physical layout clk in C out I/O in A in B in C In A LUT G E I/O D F RAM E In B I/O LUT D DFF F LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O

Place User Guided Layout (Chapter 16:Reconfigurable Computing General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

Place (User-Guided) User provide information about applications structure to help guide placement Can help remove critical paths Can greatly reduce amount of time for routing Several methods to guide placement Fixed region Floating region Exact location Relative location

Place (User-Guided): Examples FPGA LUT D DFF F G Part of Map Netlist Fixed region

Place (User-Guided): Examples FPGA LUT D DFF F G Part of Map Netlist Fixed region SDRAM

Place (User-Guided): Examples FPGA Floating region Softcore Processor

Place (User-Guided): Examples FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

Place (User-Guided): Examples FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT G LUT D F LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

Place (User-Guided): Examples FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT G D F LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

Place (User-Guided): Examples FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT G D F LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

Place (User-Guided): Examples FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT G D F LUT LUT LUT LUT

Place User Guided Layout (Chapter 16:Reconfigurable Computing General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

Place (General Purpose) Characteristics: Places resources without any knowledge of high level structure Guided primarily by local connections between resources Drawback: Does not take explicit advantage of applications structure Advantage: Typically can be used to place any arbitrary circuit

Place (General Purpose) Preprocess Map Netlist using Clustering Group netlist components that have local conductivity into a single logic block Clustering helps to reduce the number of objects a placement algorithm has to explicitly place.

Place (General Purpose) Placement using simulated annealing Based on the physical process of annealing used to create metal alloys

Place (General Purpose) Simulated annealing basic algorithm Placement_cur = Inital_Placement; T = Initial_Temperature; While (not exit criteria 1) While (not exit criteria 2) Placement_new = Modify_placement(Placement_cur) ∆ Cost = Cost(Placement_new) – Cost(Placement_cur) r = random (0,1); If r < e^(-∆Cost / T), Then Placement_cur = Placement_new End loop T = UpdateTemp(T);

Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT LUT G LUT Z B BRAM X A LUT LUT F LUT LUT D LUT

Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X LUT LUT A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT X A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Place User Guided Layout (Chapter 16:Reconfigurable Computing General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

Place (Structured-based) Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure.

Structure high-level example

Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download

Route Connect placed resources together Two requirements Design must be completely routed Routed design meets timing requirements Widely used algorithm “PathFinder” PathFinder (FPGA’95) McMurchie and Ebeling Reconfigurable Computing (Chapter 17) Scott Hauch, Andre Dehon (2008)

Route: Route FPGA Circuit

Route (PathFinder) PathFinder: A Negotiation-Based Performance- Driven Router for FPGAs (FPGA’95) Basic PathFinder algorithm Based closely on Djikstra’s shortest path Weights are assigned to nodes instead of edges

Route (PathFinder): Example G = (V,E) Vertices V: set of nodes (wires) Edges E: set of switches used to connect wires Cost of using a wire: c_n = (b_n + h_n) * p_n S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

Route (PathFinder): Example Simple node cost cn = bn Obstacle avoidance Note order matters S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

Route (PathFinder): Example cn = b * p p: sharing cost (function of number of signals sharing a resource) Congestion avoidance S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download

Download Convert routed design into a device configuration file (e.g. bitfile for Xilinx devices)

Next Lecture Project presentations

Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Place (Structured-based) Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure. GLACE “A Generic Library for Adaptive Computing Environments” (FPL 2001) Is an example tool that takes the structure of an application into account. FLAME (Flexible API for Module-based Environments) JHDL (From BYU) Gen (From Lockheed-Martin Advanced Technology Laboratories)

GLACE: High-level

GLACE: Flow

GLACE: Library Modules

GLACE: Data Path and Control Path

GLACE: FLAME low-level

GLACE: Final placement example