Reconfigurable Computing (High-level Acceleration Approaches) Dr. Phillip Jones, Scott Hauck Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Projects: Target Timeline Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Wed 10/20 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)
Common Questions
Overview First 15 minutes of Google FPGA lecture How to run Gprof Discuss some high-level approaches for accelerating applications.
What you should learn Start to get a feel for approaches for accelerating applications.
Why use Customize Hardware? Great talk about the benefits of Heterogeneous Computing http://video.google.com/videoplay?docid=-4969729965240981475#
Profiling Applications Finding bottlenecks Profiling tools gprof: http://www.cs.nyu.edu/~argyle/tutorial.html Valgrind
Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 4-LUT B C D A DFF 1 DFF delay per output
Pipelining (Systolic Arrays) Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner.
Pipelining (Systolic Arrays) Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1
Pipelining (Systolic Arrays) Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 1 1
Pipelining (Systolic Arrays) Dynamic Programming 1 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 1 1 1
Pipelining (Systolic Arrays) Dynamic Programming 1 3 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1
Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1
Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?
Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)
Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)
Dr. James Moscola (Example) MATL2 D10 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 20
Example RNA Model 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21 MATL2 MATP1 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21
Baseline Architecture Pipeline END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline 22
Processing Elements IL7,3,2 IR8,3,2 ML9,3,2 D10,3,2 ML4 + = + = + = + 1 2 3 .40 -INF .22 .72 .30 .44 1 j IL7,3,2 2 + ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 + + ML4,3,3 = .22 ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi 23
Baseline Results for Example Model Comparison to Infernal software Infernal run on Intel Xeon 2.8GHz Baseline architecture run on Xilinx Virtex-II 4000 occupied 88% of logic resources run at 100 MHz Input database of 100 Million residues Bulk of time spent on I/O (41.434s)
Expected Speedup on Larger Models Name Num PEs Pipeline Width Pipeline Depth Latency (ns) HW Processing Time (seconds) Total Time with measured I/O (seconds) Infernal Time (seconds) Infernal Time (QDB) (seconds) Expected Speedup over Infernal Expected Speedup over Infernal (w/QDB) RF00001 3539545 39492 195 19500 1.0000195 42.4340195 349492 128443 8236 3027 RF00016 5484002 43256 282 28200 1.0000282 42.4340282 336000 188521 7918 4443 RF00034 3181038 38772 187 18700 1.0000187 42.4340187 314836 87520 7419 2062 RF00041 4243415 44509 206 20600 1.0000206 42.4340206 388156 118692 9147 2797 Example 81 26 6 600 1.0000006 42.4340006 1039 868 25 20 Speedup estimated ... using 100 MHz clock for processing database of 100 Million residues Speedups range from 500x to over 13,000x larger models with more parallelism exhibit greater speedups
Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM
Next Class Models of Computation (Design Patterns)
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 11: Fri 10/1/2010 (Design Patterns) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Weekly Project Updates The current state of your project write up Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section The current state of your Final Presentation Your Initial Project proposal presentation (Due Wed 10/20). Should make for a starting point for you Final presentation What things are work & not working What roadblocks are you running into
Overview Class Project (example from 2008) Common Design Patterns
What you should learn Introduction to common Design Patterns & Compute Models
Outline Design patterns Why are they useful? Examples Compute models
Outline Design patterns Why are they useful? Examples Compute models
References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986) Design Patterns: Abstraction and Reuse of Object Oriented Design [4] E. Gamma (1992) The Timeless Way of Building [5] C. Alexander (1979)
Design Patterns Design patterns: are a solution to reoccurring problems.
Reconfigurable Hardware Design “Building good reconfigurable designs requires an appreciation of the different costs and opportunities inherent in reconfigurable architectures” [2] “How do we teach programmers and designers to design good reconfigurable applications and systems?” [2] Traditional approach: Read lots of papers for different applications Over time figure out ad-hoc tricks Better approach?: Use design patterns to provide a more systematic way of learning how to design It has been shown in other realms that studying patterns is useful Object oriented software [4] Computer Architecture [5]
Common Language Provides a means to organize and structure the solution to a problem Provide a common ground from which to discuss a given design problem Enables the ability to share solutions in a consistent manner (reuse)
Describing a Design Pattern [2] 10 attributes suggested by Gamma (Design Patterns, 1995) Name: Standard name Intent: What problem is being addressed?, How? Motivation: Why use this pattern Applicability: When can this pattern be used Participants: What components make up this pattern Collaborations: How do components interact Consequences: Trade-offs Implementation: How to implement Known Uses: Real examples of where this pattern has been used. Related Patterns: Similar patterns, patterns that can be used in conjunction with this pattern, when would you choose a similar pattern instead of this pattern.
Example Design Pattern Coarse-grain Time-multiplexing Template Specialization
Coarse-grain Time-Multiplexing B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2
Coarse-grain Time-Multiplexing Name: Coarse-grained Time-Multiplexing Intent: Enable a design that is too large to fit on a chip all at once to run as multiple subcomponents Motivation: Method to share limited fixed resources to implement a design that is too large as a whole.
Coarse-grain Time-Multiplexing Applicability (Requirements): Configuration can be done on large time scale No feedback loops in computation Feedback loop only spans the current configuration Feedback loop is very slow Participants: Computational graph Control algorithm Collaborations: Control algorithm manages when sub-graphs are loaded onto the device
Coarse-grain Time-Multiplexing Consequences: Often platforms take millions of cycles to reconfigure Need an app that will run for 10’s of millions of cycles before needing to reconfigure May need large buffers to store data during a reconfiguration Known Uses: Video processing pipeline [Villasenor] “Video Communications using Rapidly Reconfigurable Hardware”, Transactions on Circuits and Systems for Video Technology 1995 Automatic Target Recognition [[Villasenor] “Configurable Computer Solutions for Automatic Target Recognition”, FCCM 1996
Coarse-grain Time-Multiplexing Implementation: Break design into multiple sub graphs that can be configured onto the platform in sequence Design a controller to orchestrate the configuration sequencing Take steps to minimize configuration time Related patterns: Streaming Data Queues with Back-pressure
Coarse-grain Time-Multiplexing B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2
Coarse-grain Time-Multiplexing Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2
Coarse-grain Time-Multiplexing Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 What constraint does this place on Temp? A B 1 MB buffer What if the data path is changed from 8-bit to 64-bit? M3 Temp M3 Temp 8 MB buffer Likely need off chip memory Configuration 1 Configuration 2
Template Specialization Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)
Template Specialization Name: Template Specialization Intent: Reduce the size or time needed for a computation. Motivation: Use early-bound data and slowly changing data to reduce circuit size and execution time.
Template Specialization Applicability: When circuit specialization can be adapted quickly Example: Can treat LUTs as small memories that can be written. No interconnect modifications Participants: Template cell: Contains specialization configuration Template filler: Manages what and how a configuration is written to a Template cell Collaborations: Template filler manages Template cell
Template Specialization Consequences: Can not optimize as much as when a circuit is fully specialize for a given instance. Overhead need to allow template to implement several specializations. Known Uses: Multiply-by-Constant String Matching Implementation: Multiply-by-Constant Use LUT as memory to store answer Use controller to update this memory when a different constant should be used.
Template Specialization Related patterns: CONSTRUCTOR EXCEPTION TEMPLATE
Template Specialization Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)
Template Specialization Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0) Multiply by a constant of 2: Support inputs of 0 - 7
Template Specialization Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0)
Template Specialization Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 Mult by 2 A(2) A(1) A(0) LUT LUT LUT LUT 2 4 6 8 10 12 14 1 1 1 C(3) C(2) C(1) C(0)
Catalog of Patterns (Just a start) [2] [2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions
Catalog of Patterns (Just a start) [2] [2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions
Next Lecture Continue Compute Models
Lecture Notes:
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 12: Wed 10/6/2010 (Compute Models) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Projects: Target Timeline Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Fri 10/22 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)
Projects: Target Timeline Work on projects: 10/22 - 12/8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)
Project Grading Breakdown 50% Final Project Demo 30% Final Project Report 30% of your project report grade will come from your 5-6 project updates. Friday’s midnight 20% Final Project Presentation
Common Questions
Common Questions
Common Questions
Common Questions
Overview Compute Models
What you should learn Introduction to Compute Models
Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986)
Building Applications Problem -> Compute Model + Architecture -> Application Questions to answer How to think about composing the application? How will the compute model lead to a naturally efficient architecture? How does the compute model support composition? How to conceptualize parallelism? How to tradeoff area and time? How to reason about correctness? How to adapt to technology trends (e.g. larger/faster chips)? How does compute model provide determinacy? How to avoid deadlocks? What can be computed? How to optimize a design, or validate application properties?
Compute Models Compute Models [1]: High-level models of the flow of computation. Useful for: Capturing parallelism Reasoning about correctness Decomposition Guide designs by providing constraints on what is allowed during a computation Communication links How synchronization is performed How data is transferred
Two High-level Families Data Flow: Single-rate Synchronous Data Flow Synchronous Data Flow Dynamic Streaming Dataflow Dynamic Streaming Dataflow with Peeks Steaming Data Flow with Allocation Sequential Control: Finite Automata (i.e. Finite State Machine) Sequential Controller with Allocation Data Centric Data Parallel
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions Captures: Parallelism Dependences Communication X X +
Single-rate Synchronous Data Flow One token rate for the entire graph For example all operation take one token on a given link before producing an output token Same power as a Finite State Machine 1 1 1 update - 1 1 1 1 1 1 1 1 F copy
Synchronous Data Flow - F Each link can have a different constant token input and output rate Same power as signal rate version but for some applications easier to describe Automated ways to detect/determine: Dead lock Buffer sizes 1 10 1 update - 1 1 1 1 10 10 1 1 F copy
Dynamic Steaming Data Flow Token rates dependent on data Just need to add two structures Switch Select in in0 in1 S S Switch Select out0 out1 out
Dynamic Steaming Data Flow Token rates dependent on data Just need to add two structures Switch, Select More Powerful Difficult to detect Deadlocks Still Deterministic 1 Switch y x x y S F0 F1 x y x y Select
Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge
Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times A Merge
Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times B Merge A
Dynamic Steaming Data Flow with Peeks Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge B A
Steaming Data Flow with Allocation Removes the need for static links and operators. That is the Data Flow graph can change over time More Power: Turing Complete More difficult to analysis Could be useful for some applications Telecom applications. For example if a channel carries voice verses data the resources needed may vary greatly Can take advantage of platforms that allow runtime reconfiguration
Sequential Control Sequence of sub routines Programming languages (C, Java) Hardware control logic (Finite State Machines) Transform global data state
Finite Automata (i.e. Finite State Machine) Can verify state reachablilty in polynomial time S1 S2 S3
Sequential Controller with Allocation Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S3
Sequential Controller with Allocation Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S4 S3 SN
Data Parallel Multiple instances of a operation type acting on separate pieces of data. For example: Single Instruction Multiple Data (SIMD) Identical match test on all items in a database Inverting the color of all pixels in an image
Data Centric Similar to Data flow, but state contained in the objects of the graph are the focus, not the tokens flowing through the graph Network flow example Source1 Dest1 Source2 Switch Dest2 Source3 Flow rate Buffer overflow
Multi-threaded Multi-threaded: a compute model made up a multiple sequential controllers that have communications channels between them Very general, but often too much power and flexibility. No guidance for: Ensuring determinism Dividing application into threads Avoiding deadlock Synchronizing threads The models discussed can be defined in terms of a Multi-threaded compute model
Multi-threaded (Illustration)
Streaming Data Flow as Multithreaded Thread: is an operator that performs transforms on data as it flows through the graph Thread synchronization: Tokens sent between operators
Data Parallel as Multithreaded Thread: is a data item Thread synchronization: data updated with each sequential instruction
Caution with Multithreaded Model Use when a stricter compute model does not give enough expressiveness. Define restrictions to limit the amount of expressive power that can be used Define synchronization policy How to reason about deadlocking
Other Models “A Framework for Comparing Models of computation” [1998] E. Lee, A. Sangiovanni-Vincentelli Transactions on Computer-Aided Design of Integrated Circuits and Systems “Concurrent Models of Computation for Embedded Software”[2005] E. Lee, S. Neuendorffer IEEE Proceedings – Computers and Digital Techniques
Next Lecture System Architectures
User Defined Instruction MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor
User Defined Instruction MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor
User Defined Instruction MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor
MP3 Notes MUCH less VHDL coding than MP2 But you will be writing most of the VHDL from scratch The focus will be more on learning to read a specification (Power PC coprocessor interface protocol), and designing hardware that follows that protocol. You will be dealing with some pointer intensive C-code. It’s a small amount of C code, but somewhat challenging to get the pointer math right.
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes kk
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010 (System Architectures) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Projects: Target Timeline Work on projects: 10/22 - 12/8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)
Common Questions
Common Questions
Common Questions
Overview Common System Architectures Plus/Delta mid-semester feedback
What you should learn Introduction to common System Architectures
Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
Outline Design patterns (previous lecture) Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon
System Architectures Compute Models: Help express the parallelism of an application System Architecture: How to organize application implementation
Efficient Application Implementation Compute model and system architecture should work together Both are a function of The nature of the application Required resources Required performance The nature of the target platform Resources available
Efficient Application Implementation (Image Processing) Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Data Parallel Compute Model Vector System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Efficient Application Implementation (Image Processing) X X Data Flow Compute Model + Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
Data Presence X X +
Data Presence X X data_ready data_ready + data_ready
Data Presence X X FIFO FIFO data_ready data_ready + FIFO data_ready
Data Presence X X stall stall FIFO FIFO data_ready data_ready + FIFO
Data Presence Flow control: Term typical used in networking X X stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking
Data Presence Flow control: Term typical used in networking Increase flexibility of how application can be implemented X X stall stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking
Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
Datapath Sharing X X +
Datapath Sharing Platform may only have one multiplier X X +
Datapath Sharing Platform may only have one multiplier X +
Datapath Sharing Platform may only have one multiplier REG X REG +
Datapath Sharing Platform may only have one multiplier REG X FSM REG +
Datapath Sharing Platform may only have one multiplier REG X FSM REG + Important to keep track of were data is coming!!
Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
Interconnect sharing X X +
Interconnect sharing Need more efficient use of interconnect X X +
Interconnect sharing Need more efficient use of interconnect X X +
Interconnect sharing Need more efficient use of interconnect X X FSM +
Implementing Streaming Dataflow Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
Streaming coprocessor See SCORE chapter 9 of text for an example.
Sequential Control Typically thought of in the context of sequential programming on a processor (e.g. C, Java programming) Key to organizing synchronizing and control over highly parallel operations Time multiplexing resources: when task to too large for computing fabric Increasing data path utilization
Sequential Control X + A B C
Sequential Control X + A B C A*x2 + B*x + C
Sequential Control X + A B C C A B X X + A*x2 + B*x + C A*x2 + B*x + C
Finite State Machine with Datapath (FSMD) B X X + A*x2 + B*x + C
Finite State Machine with Datapath (FSMD) B X FSM X + A*x2 + B*x + C
Sequential Control: Types Finite State Machine with Datapath (FSMD) Very Long Instruction Word (VLIW) data path control Processor Instruction augmentation Phased reconfiguration manager Worker farm
Very Long Instruction Word (VLIW) Datapath Control See 5.2 of text for this architecture
Processor
Instruction Augmentation
Phased Configuration Manager Will see more detail with SCORE architecture from chapter 9 of text.
Worker Farm Chapter 5.2 of text
Bulk Synchronous Parallelism See chapter 5.2 for more detail
Data Parallel Single Program Multiple Data Single Instruction Multiple Data (SIMD) Vector Vector Coprocessor
Data Parallel
Data Parallel
Data Parallel
Data Parallel
Cellular Automata
Multi-threaded
Next Lecture
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes Add CSP/Mulithread as root of a simple tree 15+5(late start) minutes of time left Think of one to two in class exercise (10 min) Data Flow graph optimization algorithm? Dead lock detection on a small model? Give some examples of where a given compute model would map to a given application. Systolic array (implement) or Dataflow compute model) String matching (FSM) (MISD) New image for MP3, too dark of a color
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 14: Fri 10/13/2010 (Streaming Applications) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Common Questions
Common Questions
Common Questions
Overview Steaming Applications (Chapters 8 & 9) Simulink SCORE
What you should learn Two approaches for implementing streaming applications
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow: Quick Review Graph of operators that data (tokens) flows through Composition of functions X X +
Data Flow Graph of operators that data (tokens) flows through Composition of functions Captures: Parallelism Dependences Communication X X +
Streaming Application Examples Some images processing algorithms Edge Detection Image Recognition Image Compression (JPEG) Network data processing String Matching (your MP2 assignment) Sorting??
Sorting Initial list of items Split Split Split Sort Sort Sort Sort merge merge merge
Example Tools for Streaming Application Design Simulink from Matlab: Graphical based SCORE (Steam Computation Organized for Reconfigurable Hardware): A programming model
Simulink (MatLab) What is it? MatLab module that allows building and simulating systems through a GUI interface
Simulink: Example Model
Simulink: Example Model
Simulink: Sub-Module
Simulink: Example Model
Simulink: Example Model
Simulink: Example Plot
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection Detect Horizontal Edges Detect Vertical Edges -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200
Example Edge Detection: Sobel CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200
Top Level
Shifter
Multiplier
Input Image
Output Image
SCORE Overview of the SCORE programming approach Developed by Stream Computations Organized for Reconfigurable Execution Developed by University of California Berkeley California Institute of Technology FPL 2000 overview presentation http://brass.cs.berkeley.edu/documents/score_tutorial.html
Next Lecture Data Parallel
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 15: Fri 10/15/2010 (Reconfiguration Management) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders Midterm: Take home portion (40%) given Friday 10/22, due Tue 10/26 (midnight) In class portion (60%) Wed 10/27 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today/tomorrow) Problem 2 of HW 2 (released after MP3 gets released)
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Common Questions
Common Questions
Overview Chapter 4: Reconfiguration Management
What you should learn Some basic configuration architectures Key issues when managing the reconfiguration of a system
Reconfiguration Management Goal: Minimize the overhead associated with run-time reconfiguration Why import to address Can take 100’s of milliseconds to reconfigure a device For high performance applications this can be a large overhead (i.e. decreases performance)
High Level Configuration Setups Externally trigger reconfiguration CPU Configuration Request FPGA ROM (bitfile) Config Data FSM Config Control (CC)
High Level Configuration Setups Self trigger reconfiguration FPGA Config Data ROM (bitfile) FSM CC
Configuration Architectures Single-context Multi-context Partially Reconfigurable Relocation & Defragmentation Pipeline Reconfiguration Block Reconfigurable
Single-context FPGA Config clk Config I/F Config Data Config enable OUT IN OUT IN OUT EN EN EN Config enable
Multi-context FPGA 1 1 2 2 3 3 Config clk Context switch Config Config OUT IN OUT IN EN EN Context switch 1 1 Context 1 Enable 2 2 Context 2 Enable 3 3 Context 3 Enable Config Enable Config Enable Config Data Config Data
Partially Reconfigurable Reduce amount of configuration to send to device. Thus decreasing reconfiguration overhead Need addressable configuration memory, as opposed to single context daisy chain shifting Example Encryption Change key And logic dependent on key PR devices AT40K Xilinx Virtex series (and Spartan, but not a run time) Need to make sure partial config do not overlap in space/time (typical a config needs to be placed in a specific location, not as homogenous as you would think in terms of resources, and timing delays)
Partially Reconfigurable
Partially Reconfigurable Full Reconfig 10-100’s ms
Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms
Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms
Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms
Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms
Partially Reconfigurable Partial Reconfig 100’s us - 1’s ms Typically a partial configuration modules map to a specific physical location
Relocation and Defragmentation Make configuration architectures support relocatable modules Example of defragmentation text good example (defrag or swap out, 90% decrease in reconfig time compared to full single context) Best fit, first fit, … Limiting factor Routing/logic is heterogeneous timing issues, need modified routes Special resources needed (e.g. hard mult, BRAMS) Easy issue if there are blocks of homogeneity Connection to external I/O (fix IP cores, board restrict) Virtualized I/O (fixed pin with multiple internal I/Fs? 2D architecture more difficult to deal with Summary of feature PR arch should have Homogenous logic and routing layout Bus based communication (e.g. network on chip) 1D organization for relocation
Relocation and Defragmentation B C
Relocation and Defragmentation
Relocation and Defragmentation
Relocation and Defragmentation
Relocation and Defragmentation
Relocation and Defragmentation More efficient use of Configuration Space C A
Pipeline Reconfigurable Example: PipeRench Simplifies reconfiguration Limit what can be implemented Cycle 1 2 3 4 5 6 Virtual Pipeline stage 1 2 3 4 PE PE PE PE 1 1 1 PE PE PE PE 2 2 2 3 3 3 PE PE PE PE 4 4 Cycle 1 2 3 4 5 6 Physical Pipeline stage 1 2 3 3 3 1 1 1 4 4 2 2 2
Block Reconfigurable Swappable Logic Units Abstraction layer over a general PR architecture: SCORE Config Data
Managing the Reconfiguration Process Choosing a configuration When to load Where to load Reduce how often one needs to reconfigure, hiding latency
Configuration Grouping What to pack Pack multiple related in time configs into one Simulated annealing, clustering based on app control flow
Configuration Caching When to load LRU, credit based dealing with variable sized configs
Configuration Scheduling Prefetching Control flow graph Static compiler inserted conf instructions Dynamic: probabilistic approaches MM (branch prediction) Constraints Resource Real-time Mitigation System status and prediction What are current request Predict which config combination will give best speed up
Software-based Relocation Defragmentation Placing R/D decision on CPU host not on chip config controller
Context Switching Safe state then start where left off.
Next Lecture Data Parallel
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 16: Fri 10/20/2010 (Data Parallel Architectures) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders Midterm: Take home portion (40%) given Friday 10/29, due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today): Problem 2 of HW 2 (released after MP3 gets released)
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Common Questions
Common Questions
Overview Data Parallel Architectures: MP3 Demo/Overview Chapters 5.2.4, and chapter 10 MP3 Demo/Overview
What you should learn Data Parallel Architecture basics Flexibility Reconfigurable Hardware Addes
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Data Parallel Architectures
Next Lecture Project initial presentations.
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 17: Fri 10/22/2010 (Initial Project Presentations) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Initial Project Proposal Slides (5-10 slides) Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
Common Questions
Common Questions
Overview Present Project Ideas
Projects
Next Lecture Fixed Point Math and Floating Point Math
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 18: Fri 10/27/2010 (Floating Point) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders Midterm: Take home portion (40%) given Friday 10/29 (released today by 5pm), due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Problem 2 of HW 2 (released soon)
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Common Questions
Common Questions
Overview Floating Point on FPGAs (Chapter 21.4 and 31) Why is it viewed as difficult?? Options for mitigating issues
Floating Point Format (IEEE-754) Single Precision S exp Mantissa 1 8 23 23 Mantissa = b-1 b-2 b-3 ….b-23 = ∑ b-i 2-i i=1 Floating point value = (-1)S * 2(exp-127) * (1.Mantissa) Example: 0 x”80” 110 x”00000” = -1^0 * 2^128-127 * 1.(1/2 + 1/4) = -1^0 * 2^1 * 1.75 = 3.5 Double Precision S exp Mantissa 1 11 52 Floating point value = (-1)S * 2(exp-1023) * (1.Mantissa)
Fixed Point Whole Fractional bW-1 … b1 b0 b-1 b-2 …. b-F Example formats (W.F): 5.5, 10.12, 3.7 Example fixed point 5.5 format: 01010 01100 = 10. 1/4 + 1/8 = 10.375 Compare floating point and fixed point Floating point: 0 x”80” “110” x”00000” = 3.5 10-bit (Format 3.7) Fixed Point for 3.5 = ? 011 1000000
Fixed Point (Addition) Whole Fractional Operand 1 Whole Fractional Operand 2 + Whole Fractional sum
Fixed Point (Addition) 11-bit 4.7 format 0011 111 0000 Operand 1 = 3.875 0001 101 0000 + Operand 2 = 1.625 sum 0101 100 0000 = 5.5 You can use a standard ripple-carry adder!
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 +
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”80” -> x”7F” or visa-verse?
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”7F”->x”80”, lose least significant bits of Operand 2 - Add the difference of x”80” – x“7F” = 1 to x”7F” - Shift mantissa of Operand 2 by difference to the right. remember “implicit” 1 of the original mantissa 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 +
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 +
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + Overflow! 1 110 x”00000”
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas You can’t just overflow mantissa into exponent field You are actually overflowing the implicit “1” of Operand 1, so you sort of have an implicit “2” (i.e. “10”). 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + Overflow! 1 110 x”00000”
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + 0 x”81” 011 x”00000”
Floating Point (Addition) 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”7F” 101 x”00000” Operand 2 = 1.625 + 0 x”81” 011 x”00000” = 5.5 Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” 111 x”80000” Operand 1 = 3.875 0 x”80” 110 x”80000” Operand 2 = 1.625 + 0 x”81” 011 x”00000”
Floating Point (Addition): Other concerns Special Value Sign Exponent Mantissa Zero 0/1 Infinity MAX -Infinity 1 NaN Non-zero Denormal nonzero Single Precision S exp Mantissa 1 8 23
Floating Point (Addition): High-level Hardware M0 M1 Difference Greater Than Mux SWAP Shift value Right Shift Add/Sub Priority Encoder Round Denormal? Left Shift value Left Shift Sub/const E M
Floating Point Both Xilinx and Altera supply floating point soft-cores (which I believe are IEEE-754 compliant). So don’t get too afraid if you need floating point in your class projects Also there should be floating point open cores that are freely available.
Fixed Point vs. Floating Point Floating Point advantages: Application designer does not have to think “much” about the math Floating point format supports a wide range of numbers (+/- 3x1038 to +/-1x10-38), single precision If IEEE-754 compliant, then easier to accelerate existing floating point base applications Floating Point disadvantages Ease of use at great hardware expense 32-bit fix point add (~32 DFF + 32 LUTs) 32-bit single precision floating point add (~250 DFF + 250 LUTs). About 10x more resources, thus 1/10 possible best case parallelism. Floating point typically needs massive pipeline to achieve high clock rates (i.e. high throughput) No hard-resouces such as carry-chain to take advantage of
Fixed Point vs. Floating Point Range example: Floating Point vs. Fixed Point advantages: Some exception with respect to precision
Mitigating Floating Point Disadvantages Only support a subset of the IEEE-754 standard Could use software to off-load special cases Modify floating point format to support a smaller data type (e.g. 18-bit instead of 32-bit) Link to Cornell class: http://instruct1.cit.cornell.edu/courses/ece576/FloatingPoint/index.html Add hardware support in the FPGA for floating point Hardcore multipliers: Added by companies early 2000’s Altera: Hard shared paths for floating point (Stratix-V 2011) How to get 1-TFLOP throughput on FPGAs article http://www.eetimes.com/design/programmable-logic/4207687/How-to- achieve-1-trillion-floating-point-operations-per-second-in-an-FPGA
Mitigating Fixed Point Disadvantages (21.4) Block Floating Point (mitigating range issue)
CPU/FPGA/GPU reported FLOPs Block Floating Point (mitigating range issue)
Next Lecture Mid-term Then on Friday: Evolvable Hardware
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Lecture Notes Altera App Notes on computing FLOPs for Stratix-III Altera old app Notes on floating point add/mult
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 19: Fri 11/5/2010 (Evolvable Hardware) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders MP3: Extended due date until Monday midnight Those that finish by Friday (11/5) midnight bonus +1% per day before new deadline If after Friday midnight, no bonus but no penalty 10% deduction after Monday midnight, and addition -10% each day late Problem 2 of HW 2 (will now call HW3): released by Sunday midnight, will be due Monday 11/22 midnight. Turn in weekly project report (tonight midnight)
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
What you should learn Understand Evolvable Hardware basics? Benefits and Drawbacks Key types/categories
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 GATACA
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 GATACA GATAGA
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 GATACA GATAGA
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 0001 1100
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 0001 1100 0001 0000
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream 000100111011011011011010101 DFF 0001 1100 0001 0000 DFF
Classifying Adaption/Evolution Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks
Classifying Adaption/Evolution Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks Phylogeny Epigenesis Ontogeny
Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming
Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming
Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming
Genetic Algorithms Genome: a finite sting of symbols encoding an individual Phenotype: The decoding of the genome to realize the individual Constant Size population Generic steps Initial population Decode Evaluate (must define a fitness function) Selection Mutation Cross over
Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 Evaluate Decode Next Generation Selection Cross Over Mutation
Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 Evaluate Decode (0110 1000) Next Generation Selection Cross Over Mutation
Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation Selection Cross Over Mutation
Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation
Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation 1110 0011 1010 0000 1110 0100 1010 1011 0010 0000 1111 0100 Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation
Initialize Population Genetic Algorithms Initialize Population 0000 1000 1010 1111 1100 0000 1111 0000 0000 0000 1111 1111 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation 1110 0000 1010 0011 1110 0100 1010 0000 0010 1011 1111 0100 1110 0011 1010 0000 1110 0100 1010 1011 0010 0000 1111 0100 Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation
Initialize Population Genetic Algorithms Initialize Population 1110 0000 1010 0011 1110 0100 1010 0000 0010 1011 1111 0100 1010 0011 1010 0100 1100 0000 1111 0000 0000 0000 1111 0100 1010 0011 (.40) 1010 0100 (.70) 1100 0000 (.20) 1111 0000 (.10) 0000 0000 (.10) 1111 0100 (.60) Evaluate Decode (0110 1000) Next Generation 1110 0000 1010 0011 1110 0100 1010 0000 0010 1011 1111 0100 1110 0011 1010 0000 1110 0100 1010 1011 0010 0000 1111 0100 Selection Cross Over 1010 0011 (.40) 1010 0100 (.70) 1111 0100 (.60) Mutation
Evolvable Hardware Platform
Genetic Algorithms GA are a type of guided search Why use a guide search? Why not just do an exhaustive search?
Genetic Algorithms GA are a type of guided search Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second The genome of a individual is 32-bits in size How long to do an exhaustive search?
Genetic Algorithms GA are a type of guided search Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second Now genome of a individual is a FPGA 1,000,000 bits in size How long to do an exhaustive search?
Evolvable Hardware Taxonomy Extrinsic Evolution (furthest from biology) Evolution done in SW, then result realized in HW Intrinsic Evolution HW is used to deploy individuals Results are sent back to SW for fitness calculation Complete Evolution Evolution is completely done on target HW device Open-ended Evolution (closest to biology) Evaluation criteria changes dynamically Phylogeny Epigenesis Ontogeny
Evolvable Hardware Applications Prosthetic Hand controller chip Kajitani “An Evolvable Hardware Chip for Prostatic Hand Controller”, 1999
Evolvable Hardware Applications Tone Discrimination and Frequency generation Adrian Thompson “Silicon Evolution”, 1996 Xilinx XC6200
Evolvable Hardware Applications Tone Discrimination and Frequency generation Node Functions Node Genotype
Evolvable Hardware Applications Tone Discrimination and Frequency generation Evolved 4KHz oscillator
Evolvable Hardware Issues?
Evolvable Hardware Issues?
Evolvable Hardware Platforms Commercial Platforms Xilinx XC6200 Completely multiplex base, thus could program random bitstreams dynamically without damaging chip Xilinx Virtex FPGA Custom Platforms POEtic cell Evolvable LSI chip (Higuchi)
Next Lecture Overview the synthesis process
Notes Notes
Adaptive Thermoregulation for Applications on Reconfigurable Devices Phillip Jones Applied Research Laboratory Washington University Saint Louis, Missouri, USA http://www.arl.wustl.edu/arl/~phjones Iowa State University Seminar April 2008 Funded by NSF Grant ITR 0313203
What are FPGAs? FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB Configurable Logic Block
What are FPGAs? FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
What are FPGAs? FPGA: Field Programmable Gate Array Sea of general purpose logic gates CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB
FPGA Usage Models Partial Reconfiguration Fast Prototyping System on Experimental ISA Experimental Micro Architectures Run-time adaptation Run-time Customization CPU + Specialized HW - Sparc-V8 Leon Partial Reconfiguration Fast Prototyping System on Chip (SoC) Parallel Applications Full Reconfiguration Image Processing Computational Biology Remote Update Fault Tolerance
Some FPGA Details CLB CLB CLB CLB
Some FPGA Details CLB CLB CLB 4 input Look Up Table 0000 0001 1110 1111 ABCD Z Z A LUT B C D
Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111 1 A AND Z 4 input Look Up Table B C D
Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111 1 A OR Z 4 input Look Up Table B C D
Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z B X000 X001 X110 1 Z 4 input Look Up Table C 2:1 Mux D
Some FPGA Details CLB CLB CLB Z A LUT B C D
Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z LUT DFF B C D
Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z LUT DFF B C D
Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
Why Thermal Management?
Why Thermal Management? Location? Hot Cold Regulated
Why Thermal Management? Mobile? Hot Cold Regulated
Why Thermal Management? Reconfigurability FPGA Plasma Physics Microcontroller
Why Thermal Management? Exceptional Events
Why Thermal Management? Exceptional Events
Local Experience Thermally aggressive application Disruption of air flow
Damaged Board (bottom view) Thermally aggressive application Disruption of air flow
Damaged Board (side view) Thermally aggressive application Disruption of air flow
Response to catastrophic thermal events Easy Fix Not Feasible!! Very Inconvenient
Solutions Over provision Use thermal feedback Large heat sinks and fans Restrict performance Limiting operating frequency Limit amount chip utilization Use thermal feedback Dynamic operating frequency Adaptive Computation Shutdown device My approach
Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
Measuring Temperature FPGA
Measuring Temperature FPGA A/D 60 C
Background: Measuring Temperature FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp. 84.91, 2000. Temperature 1. .0 .1 0. .0 1. Period
Background: Measuring Temperature FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period
Background: Measuring Temperature FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period
Background: Measuring Temperature FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp. 84.91, 2000. Temperature 1. .1 .0 Period Voltage
Background: Measuring Temperature FPGA Temperature 1. .1 .0 Period Voltage
Background: Measuring Temperature FPGA “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands Temperature 1. .1 .0 Period Voltage
Background: Measuring Temperature FPGA Mode 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
Background: Measuring Temperature FPGA Mode 1 Mode 2 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Core 1 Core 2 70C Temperature 40C Core 3 Core 4 Period 8,000 8,300 Frequency: Low Frequency: High
Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Sample Controller Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 2 5 3 1 4 5 2 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High
Background: Measuring Temperature FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 3 2 5 1 4 5 3 1 2 3 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High
Background: Measuring Temperature FPGA Mode 2 1 3 Sample Mode Pause Time out Counter 2 1 5 4 3 5 2 3 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
Temperature Benchmark Circuits Desired Properties: Scalable Work over a wide range of frequencies Can easily increase or decrease circuit size Simple to analyze Regular structure Distributes evenly over chip Help reduce thermal gradients that may cause damage to the chip May serve as standard Further experimentation Repeatability of results “A Thermal Management and Profiling Method for Reconfigurable Hardware Applications”, by Phillip H. Jones, John W. Lockwood, and Young H. Cho; Field Programmable Logic and Applications (FPL’06), Madrid, Spain,
Temperature Benchmark Circuits LUT 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF
Temperature Benchmark Circuits RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate 8 Input Gen Array of 18 core blocks (864 LUTs, 864 DFFs) (1 LUT, 1 DFF) Thermal workload unit: Computation Row CB 0 CB 17 CB 1 CB 16
Temperature Benchmark Circuits RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate RLOC_ORIGIN: Row, Col 100% Activation Rate Thermal workload unit: Computation Row 01 Input Gen CB 0 CB 1 CB 16 CB 17 00 1 1 8 8 (1 LUT, 1 DFF) Array of 18 core blocks (864 LUTs, 864 DFFs)
Example Circuit Layout (Configuration 1x, 9% LUTs and DFFs) RLOC_ORIGIN: Row, Col (27,6) Thermal Workload Unit
Example Circuit Layout (Configuration 4x, 36% LUTs and DFFs)
Observed Temperature vs. Frequency T ~ P P ~ F*C*V2 Steady-State Temperatures Cfg4x Cfg10x Cfg2x Cfg1x
Observed Temperature vs. Active Area Max rated Tj 85 C T ~ P P ~ F*C*V2 Steady-State Temperatures 200 MHz 100 MHz 50 MHz 25 MHz 10 MHz
Projecting Thermal Trajectories Estimate Steady State Temperature 5.4±.5 Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)
Projecting Thermal Trajectories Estimate Steady State Temperature How long until 60 C? 5.4±.5 Exploit this phase for performance Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)
Thermal Shutdown Max Tj (70C)
Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
Image Correlation Application Template
Image Correlation Application Virtex-4 100FX Resource Utilization Heats FPGA a lot! (> 85 C) Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs)
Application Infrastructure Temperature Sample Controller Thermoregulation Controller Pause 65 C Application Mode “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Image Buffer Mode Image Processor Core 1 Mask 1 2 Image Processor Core 3 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 Score Out
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 Mode MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 3 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Score Out
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 180 150 100 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 100 75 50 MHz MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 50 MHz 6 4 5 7 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 2 Mask 1 Mask 2 Mask 1 Mask 2 Mask 2 High Priority Features Low Priority Features Score Out
Application Specific Adaptation Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 75 100 180 150 50 200 MHz MHz 4 7 8 6 5 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 Mask 2 Mask 1 Mask 2 High Priority Features Low Priority Features Score Out
Thermally Adaptive Frequency High Frequency Thermal Budget = 72 C “An Adaptive Frequency Control Method Using Thermal Feedback for Reconfigurable Hardware Applications”, by Phillip H. Jones, Young H. Cho, and John W. Lockwood; Field Programmable Technology (FPT’06), Bangkok, Thailand Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)
Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)
Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) S. Wang (“Reactive Speed Control”, ECRTS06) Time (s)
Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
Platform Overview Virtex-4 FPGA Temperature Probe
Thermal Budget Efficiency 200 MHz 106 MHz 184 MHz 50 MHz 65 MHz 50 MHz 50 MHz Adaptive Fixed 70 Adaptive Thermal Budget (65 C) 65 4 Features 50 MHz 4 50 6 50 8 65 8 106 8 184 60 Fixed 8 200 25 C Unused 55 Junction Temperature (C) 50 45 40 35 30 40 C 35 C 30 C 25 C 25 C 25 C 0 Fans 0 Fans 0 Fans 0 Fans 1 Fan 2 Fans Thermal Condition
Conclusions Motivated the need for thermal management Measuring temperature Application dependent voltage variations effects. Temperature benchmark circuits Examined application specific adaptation for improving performance in dynamic thermal environments
Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
Thermally Constrained Systems Space Craft Sun Earth
Thermally Constrained Systems
Temperature-Safe Real-time Systems Task scheduling is a concern in many embedded systems Goal: Satisfy thermal constraints without violating real-time constraints
How to manage temperature? Static frequency scaling Sleep while idle Time T1 T2 T3 T1 T2 T3 Time
How to manage temperature? Static frequency scaling Sleep while idle Time T1 T2 T3 Too hot? Deadlines could be missed T1 T2 T3 Idle Time
How to manage temperature? Static frequency scaling Sleep while idle Time T1 T2 T3 Deadlines could be missed T1 T2 T3 Idle Idle Idle Time Generalization: Idle task insertion
Idle Task Insertion More Powerful Task for schedule at F_max (100 MHz) Period (s) Cost (s) Deadline (s) Utilization (%) Deadline equals cost, frequency cannot be scaled or task schedule becomes infeasible 30 10.0 10.0 33.33 120 30.0 120 25.00 480 30.0 480 6.25 960 20.0 960 2.08 66.66 a. No idle task inserted Tasks scheduled at F_max (100 MHz), 1 Idle Task 960 480 120 60.0 10.0 Deadline (s) 33.33 20.0 60 2.08 99.99 6.25 30.0 25.00 30 Utilization (%) Cost (s) Period (s) b. 1 idle task inserted Idle task insertion No impact on tasks’ cost Higher priority task response times unaffected Allow control over distribution of idle time
Sleep when idle is insufficient Temperature constraint = 65 C Peak Temperature = 70 C
Idle-task inserted Temperature constraint = 65 C Peak Temperature = 61 C
Idle-Task Insertion + Deadlines Temperature met? Yes No System (task set) Idle tasks Scheduler (e.g. RMS) + Deadlines met? Temperature Yes No a. Original schedule does not meet temperature constraints b. Use idle tasks to redistribute device idle time in order to reduce peak device temperature
Related Research Power Management Thermal Management EDF, Dynamic Frequency Scaling Yao (FOCS’95) EDF, Minimize Temperature Bansal (FOCS’04) Worst Case Execution Time Shin (DAC’99) RMS, Reactive Frequency, CIA Wang (RTSS’06, ECRTS’06)
Outline Why Thermal Management? Measuring Temperature Thermally Driven Adaptation Experimental Results Conclusions Temperature-Safe Real-time Systems Future Directions
Research Fronts Near term Longer term Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)
Questions/Comments? Near term Longer term Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)
Temperature per Processing Core Temperature vs. Number of Processing Core 70 y = + 60.1 2.21x 65 S1 y = + 57.1 2.24x S2 60 y = + 52.1 2.23x S3 55 2.07x Junction Temperature (C) y = + 44.2 50 S4 45 y = + 37.5 1.43x S5 40 y = + 34.0 1.22x S6 35 1 2 3 4 Number of Processing Cores
Temperature Sample Mode
Ring Oscillator Thermometer Characteristics Thermometer size Ring oscillator size Oscillation period Incrementer Cycle Period Temperature resolution ~100 LUTs 48 LUTs (47 NOT + 1 OR) ~40 ns ~.16 ms (40ns * 4096) .1ºC/ count Or .1ºC/ 20ns
Application Mode B C Count = 8235 Count = 8425 Count = 8620 Temperature vs. Incrementer Period (Measuring Temperature while Application Active) 10 20 30 40 50 60 70 80 90 8100 8200 8300 8400 8500 8600 8700 Incrementer Period (20ns/count) Temperature (C) Application Mode A B C Count = 8235 Count = 8425 Count = 8620
Virtex-4 100FX Resource Utilization Application implementation statistics Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) Image Correlation Characteristics 40.6 (at 200 MHz) 1 - 8 8-bit (grey scale) 320x480 Image Processing Rate (Frames per second) # of Features Pixel Resolution Image Size (# pixels)
VirtexE 2000 Resource Utilization Image Correlation Characteristics Application implementation statistics 125 MHz 26% (43) 32,868 (15,808) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) VirtexE 2000 Resource Utilization 12.7/second (at 125 MHz) 10 (in parallel) 1 - 4 8-bit (grey scale) 640x480 Image Processing Rate # of Templates # of Mask Patterns Pixel Resolution Image Size (# pixels) Image Correlation Characteristics a.) b.)
Scenario Descriptions 30 C (86 F) S3 25 C (77 F) S4 40 C (104 F) S1 35 C (95 F) S2 # of Fans Ambient Temperature Scenario S1 – S6 1 S5 2 S6
High Level Architecture Application Pause Thermal Manager Frequency & Quality Controller Frequency mode Quality Temperature
Periodic Temperature Sampling Application Pause Thermal Manager 50 ms Event Counter Event Ring Oscillator Based Thermometer ready Sample Mode Controller Temperature Frequency & Quality capture Frequency mode Quality
Ring Oscillator Based Thermometer Reset 12-bit incrementer ring_clk MSB Edge Detect 14-bit Clk DFF reset 14 Temperature sel Ready mux
ASIC, GPP, FPGA Comparison Cost Performance Power Flexibility
Frequency Multiplexing Circuit Frequency Control Clk Multiplier (DLLs) clk clk to global clock tree 2:1 MUX 4xclk BUFG Current Virtex-4 platform uses glitch free BUFGMUX component
Thermally Adaptive Frequency High Frequency Thermal Budget = 72 C Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)
Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)
Thermally Adaptive Frequency Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)
Worst Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C Thermally Safe Frequency 50 MHz
Worst Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency Thermally Safe Frequency 50 MHz
Worst Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz
Typical Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz
Typical Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz
Best Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Thermally Safe Frequency 50 MHz
Best Case Thermal Condition Thermally Safe Frequency Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 119 MHz Thermally Safe Frequency 50 MHz 2.4x Factor Performance Increase
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/12/2010 (Synthesis) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
What you should learn Intro to synthesis Synthesis and Optimization of Digital Circuits De micheli, 1994 (chapter 1)
Synthesis (big picture) Synthesis & Optimization Architectural Logic Boolean Function Min Boolean Relation Min State Min Scheduling Sharing Coloring Covering Satisfiability Graph Theory Boolean Algebra
Views of a design Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S1 S2 Logic level DFF S3
Levels of Synthesis Architectural level Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3
Levels of Synthesis Architectural level Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view ID Func. Resources Schedule use (control) Inter connect (data path) Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3
Levels of Synthesis Architectural level Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 read S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 + S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 *, + S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 +,* S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 + S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 write S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
Example: Diffeq Forward Euler method y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit DFF DFF DFF DFF * ALU Control Unit Memory & Steering logic
Optimization Combinational Metrics: propagation delay, circuit size Sequential Cycle time Latency Circuit size
Optimization Combinational Metrics: propagation delay, circuit size Sequential Cycle time Latency Circuit size
Impact of Highlevel Syn on Optimaiztion y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit
Impact of Highlevel Syn on Optimaiztion y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit * * * ALU Memory & Steering logic Control Unit
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Sum of products A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products Sum of products (minimized) 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 A * B + A’*C*D’ 01 1 10 11
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw (xy + xw)’ (xw)’CD + (xy + xw)’(xw)C’D’
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
Logic-level Synthesis and Optimization Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
Introduction to HW3
Introduction to HW3
Introduction to HW3
Next Lecture MAP
Notes Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 22: Fri 11/19/2010 (Coregen Overview) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders HW3): released by Saturday midnight, will be due Wed 12/15 midnight. Turn in weekly project report (tonight midnight) Midterms still being graded, sorry for the delay: You can stop by my office after 5pm today to pick up your graded tests 584 Advertisement: Number 1
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
What you should learn Basic of using coregen, in class demo
Next Lecture Finish up synthesis process, start MAP
Notes Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 22: Fri 12/1/2010 (Class Project Work) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Next Lecture Finish up synthesis process, MAP
Notes Notes
Instructor: Dr. Phillip Jones CPRE 583 Reconfigurable Computing Lecture 24: Wed 12/8/2010 (Map, Place & Route) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/
Announcements/Reminders HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (9 – 10:30 am) Take home final given on Wed 12/15 due 12/17 5pm
Projects Ideas: Relevant conferences FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
Applications on FPGA: Low-level Implement circuit in VHDL (Verilog) Simulate compiled VHDL Synthesis VHDL into a device independent format Map device independent format to device specific resources Check that device has enough resources for the design Place resources onto physical device locations Route (connect) resources together Completely routed Circuit meets specified performance Download configuration file (bit-steam) to the FPGA
Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download
(Technology) Map Translate device independent net list to device specific resources
(Technology) Map Translate device independent net list to device specific resources
(Technology) Map Translate device independent net list to device specific resources
(Technology) Map Translate device independent net list to device specific resources
Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download
Place Bind each mapped resource to a physical device location User Guided Layout (Chapter 16:Reconfigurable Computing) General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based Heuristics used No efficient means for finding an optimal solution
Place (High-level) Netlist from technology mapping in A in B in C RAM LUT D DFF F DFF G clk out
Place (High-level) Netlist from technology mapping FPGA physical layout I/O I/O I/O I/O in A in B in C I/O LUT BRAM I/O LUT RAM E DFF F I/O I/O LUT D LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O
Place (High-level) Netlist from technology mapping FPGA physical layout clk in C out I/O in A in B in C In A LUT G E I/O D F RAM E In B I/O LUT D DFF F LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O
Place User Guided Layout (Chapter 16:Reconfigurable Computing General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based
Place (User-Guided) User provide information about applications structure to help guide placement Can help remove critical paths Can greatly reduce amount of time for routing Several methods to guide placement Fixed region Floating region Exact location Relative location
Place (User-Guided): Examples FPGA LUT D DFF F G Part of Map Netlist Fixed region
Place (User-Guided): Examples FPGA LUT D DFF F G Part of Map Netlist Fixed region SDRAM
Place (User-Guided): Examples FPGA Floating region Softcore Processor
Place (User-Guided): Examples FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
Place (User-Guided): Examples FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT G LUT D F LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
Place (User-Guided): Examples FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT G D F LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
Place (User-Guided): Examples FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT G D F LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
Place (User-Guided): Examples FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT G D F LUT LUT LUT LUT
Place User Guided Layout (Chapter 16:Reconfigurable Computing General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based
Place (General Purpose) Characteristics: Places resources without any knowledge of high level structure Guided primarily by local connections between resources Drawback: Does not take explicit advantage of applications structure Advantage: Typically can be used to place any arbitrary circuit
Place (General Purpose) Preprocess Map Netlist using Clustering Group netlist components that have local conductivity into a single logic block Clustering helps to reduce the number of objects a placement algorithm has to explicitly place.
Place (General Purpose) Placement using simulated annealing Based on the physical process of annealing used to create metal alloys
Place (General Purpose) Simulated annealing basic algorithm Placement_cur = Inital_Placement; T = Initial_Temperature; While (not exit criteria 1) While (not exit criteria 2) Placement_new = Modify_placement(Placement_cur) ∆ Cost = Cost(Placement_new) – Cost(Placement_cur) r = random (0,1); If r < e^(-∆Cost / T), Then Placement_cur = Placement_new End loop T = UpdateTemp(T);
Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT LUT G LUT Z B BRAM X A LUT LUT F LUT LUT D LUT
Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X LUT LUT A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
Place (General Purpose) Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT X A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
Place User Guided Layout (Chapter 16:Reconfigurable Computing General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based
Place (Structured-based) Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure.
Structure high-level example
Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download
Route Connect placed resources together Two requirements Design must be completely routed Routed design meets timing requirements Widely used algorithm “PathFinder” PathFinder (FPGA’95) McMurchie and Ebeling Reconfigurable Computing (Chapter 17) Scott Hauch, Andre Dehon (2008)
Route: Route FPGA Circuit
Route (PathFinder) PathFinder: A Negotiation-Based Performance- Driven Router for FPGAs (FPGA’95) Basic PathFinder algorithm Based closely on Djikstra’s shortest path Weights are assigned to nodes instead of edges
Route (PathFinder): Example G = (V,E) Vertices V: set of nodes (wires) Edges E: set of switches used to connect wires Cost of using a wire: c_n = (b_n + h_n) * p_n S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3
Route (PathFinder): Example Simple node cost cn = bn Obstacle avoidance Note order matters S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3
Route (PathFinder): Example cn = b * p p: sharing cost (function of number of signals sharing a resource) Congestion avoidance S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Route (PathFinder): Example cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
Applications on FPGA: Low-level Implement Simulate Synthesize Map Place Route Download
Download Convert routed design into a device configuration file (e.g. bitfile for Xilinx devices)
Next Lecture Project presentations
Questions/Comments/Concerns Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
Place (Structured-based) Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure. GLACE “A Generic Library for Adaptive Computing Environments” (FPL 2001) Is an example tool that takes the structure of an application into account. FLAME (Flexible API for Module-based Environments) JHDL (From BYU) Gen (From Lockheed-Martin Advanced Technology Laboratories)
GLACE: High-level
GLACE: Flow
GLACE: Library Modules
GLACE: Data Path and Control Path
GLACE: FLAME low-level
GLACE: Final placement example