Download presentation
Presentation is loading. Please wait.
Published byNorah Heath Modified over 6 years ago
1
Reconfigurable Computing (High-level Acceleration Approaches)
Dr. Phillip Jones, Scott Hauck Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
2
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
3
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
4
Projects: Target Timeline
Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Wed 10/20 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)
5
Common Questions
6
Overview First 15 minutes of Google FPGA lecture How to run Gprof
Discuss some high-level approaches for accelerating applications.
7
What you should learn Start to get a feel for approaches for accelerating applications.
8
Why use Customize Hardware?
Great talk about the benefits of Heterogeneous Computing
9
Profiling Applications
Finding bottlenecks Profiling tools gprof: Valgrind
10
Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 4-LUT B C D A DFF 1 DFF delay per output
11
Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner.
12
Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1
13
Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 1 1
14
Pipelining (Systolic Arrays)
Dynamic Programming 1 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 1 1 1
15
Pipelining (Systolic Arrays)
Dynamic Programming 1 3 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1
16
Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1
17
Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?
18
Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)
19
Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)
20
Dr. James Moscola (Example)
MATL2 D10 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 20
21
Example RNA Model 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21 MATL2 MATP1
ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21
22
Baseline Architecture Pipeline
END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline 22
23
Processing Elements IL7,3,2 IR8,3,2 ML9,3,2 D10,3,2 ML4 + = + = + = +
1 2 3 .40 -INF .22 .72 .30 .44 1 j IL7,3,2 2 + ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 + + ML4,3,3 = .22 ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi 23
24
Baseline Results for Example Model
Comparison to Infernal software Infernal run on Intel Xeon 2.8GHz Baseline architecture run on Xilinx Virtex-II 4000 occupied 88% of logic resources run at 100 MHz Input database of 100 Million residues Bulk of time spent on I/O (41.434s)
25
Expected Speedup on Larger Models
Name Num PEs Pipeline Width Pipeline Depth Latency (ns) HW Processing Time (seconds) Total Time with measured I/O (seconds) Infernal Time (seconds) Infernal Time (QDB) (seconds) Expected Speedup over Infernal Expected Speedup over Infernal (w/QDB) RF00001 39492 195 19500 349492 128443 8236 3027 RF00016 43256 282 28200 336000 188521 7918 4443 RF00034 38772 187 18700 314836 87520 7419 2062 RF00041 44509 206 20600 388156 118692 9147 2797 Example 81 26 6 600 1039 868 25 20 Speedup estimated ... using 100 MHz clock for processing database of 100 Million residues Speedups range from 500x to over 13,000x larger models with more parallelism exhibit greater speedups
26
Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM
27
Next Class Models of Computation (Design Patterns)
28
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
29
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 11: Fri 10/1/2010 (Design Patterns) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
30
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
31
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
32
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
33
Weekly Project Updates
The current state of your project write up Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section The current state of your Final Presentation Your Initial Project proposal presentation (Due Wed 10/20). Should make for a starting point for you Final presentation What things are work & not working What roadblocks are you running into
34
Overview Class Project (example from 2008) Common Design Patterns
35
What you should learn Introduction to common Design Patterns & Compute Models
36
Outline Design patterns Why are they useful? Examples Compute models
37
Outline Design patterns Why are they useful? Examples Compute models
38
References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986) Design Patterns: Abstraction and Reuse of Object Oriented Design [4] E. Gamma (1992) The Timeless Way of Building [5] C. Alexander (1979)
39
Design Patterns Design patterns: are a solution to reoccurring problems.
40
Reconfigurable Hardware Design
“Building good reconfigurable designs requires an appreciation of the different costs and opportunities inherent in reconfigurable architectures” [2] “How do we teach programmers and designers to design good reconfigurable applications and systems?” [2] Traditional approach: Read lots of papers for different applications Over time figure out ad-hoc tricks Better approach?: Use design patterns to provide a more systematic way of learning how to design It has been shown in other realms that studying patterns is useful Object oriented software [4] Computer Architecture [5]
41
Common Language Provides a means to organize and structure the solution to a problem Provide a common ground from which to discuss a given design problem Enables the ability to share solutions in a consistent manner (reuse)
42
Describing a Design Pattern [2]
10 attributes suggested by Gamma (Design Patterns, 1995) Name: Standard name Intent: What problem is being addressed?, How? Motivation: Why use this pattern Applicability: When can this pattern be used Participants: What components make up this pattern Collaborations: How do components interact Consequences: Trade-offs Implementation: How to implement Known Uses: Real examples of where this pattern has been used. Related Patterns: Similar patterns, patterns that can be used in conjunction with this pattern, when would you choose a similar pattern instead of this pattern.
43
Example Design Pattern
Coarse-grain Time-multiplexing Template Specialization
44
Coarse-grain Time-Multiplexing
B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2
45
Coarse-grain Time-Multiplexing
Name: Coarse-grained Time-Multiplexing Intent: Enable a design that is too large to fit on a chip all at once to run as multiple subcomponents Motivation: Method to share limited fixed resources to implement a design that is too large as a whole.
46
Coarse-grain Time-Multiplexing
Applicability (Requirements): Configuration can be done on large time scale No feedback loops in computation Feedback loop only spans the current configuration Feedback loop is very slow Participants: Computational graph Control algorithm Collaborations: Control algorithm manages when sub-graphs are loaded onto the device
47
Coarse-grain Time-Multiplexing
Consequences: Often platforms take millions of cycles to reconfigure Need an app that will run for 10’s of millions of cycles before needing to reconfigure May need large buffers to store data during a reconfiguration Known Uses: Video processing pipeline [Villasenor] “Video Communications using Rapidly Reconfigurable Hardware”, Transactions on Circuits and Systems for Video Technology 1995 Automatic Target Recognition [[Villasenor] “Configurable Computer Solutions for Automatic Target Recognition”, FCCM 1996
48
Coarse-grain Time-Multiplexing
Implementation: Break design into multiple sub graphs that can be configured onto the platform in sequence Design a controller to orchestrate the configuration sequencing Take steps to minimize configuration time Related patterns: Streaming Data Queues with Back-pressure
49
Coarse-grain Time-Multiplexing
B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2
50
Coarse-grain Time-Multiplexing
Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2
51
Coarse-grain Time-Multiplexing
Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 What constraint does this place on Temp? A B 1 MB buffer What if the data path is changed from 8-bit to 64-bit? M3 Temp M3 Temp 8 MB buffer Likely need off chip memory Configuration 1 Configuration 2
52
Template Specialization
Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)
53
Template Specialization
Name: Template Specialization Intent: Reduce the size or time needed for a computation. Motivation: Use early-bound data and slowly changing data to reduce circuit size and execution time.
54
Template Specialization
Applicability: When circuit specialization can be adapted quickly Example: Can treat LUTs as small memories that can be written. No interconnect modifications Participants: Template cell: Contains specialization configuration Template filler: Manages what and how a configuration is written to a Template cell Collaborations: Template filler manages Template cell
55
Template Specialization
Consequences: Can not optimize as much as when a circuit is fully specialize for a given instance. Overhead need to allow template to implement several specializations. Known Uses: Multiply-by-Constant String Matching Implementation: Multiply-by-Constant Use LUT as memory to store answer Use controller to update this memory when a different constant should be used.
56
Template Specialization
Related patterns: CONSTRUCTOR EXCEPTION TEMPLATE
57
Template Specialization
Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)
58
Template Specialization
Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0) Multiply by a constant of 2: Support inputs of 0 - 7
59
Template Specialization
Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0)
60
Template Specialization
Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 Mult by 2 A(2) A(1) A(0) LUT LUT LUT LUT 2 4 6 8 10 12 14 1 1 1 C(3) C(2) C(1) C(0)
61
Catalog of Patterns (Just a start) [2]
[2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions
62
Catalog of Patterns (Just a start) [2]
[2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions
63
Next Lecture Continue Compute Models
64
Lecture Notes:
65
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 12: Wed 10/6/2010 (Compute Models) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
66
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
67
Projects: Target Timeline
Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Fri 10/22 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)
68
Projects: Target Timeline
Work on projects: 10/ /8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)
69
Project Grading Breakdown
50% Final Project Demo 30% Final Project Report 30% of your project report grade will come from your 5-6 project updates. Friday’s midnight 20% Final Project Presentation
70
Common Questions
71
Common Questions
72
Common Questions
73
Common Questions
74
Overview Compute Models
75
What you should learn Introduction to Compute Models
76
Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
77
Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
78
References Reconfigurable Computing (2008) [1]
Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986)
79
Building Applications
Problem -> Compute Model + Architecture -> Application Questions to answer How to think about composing the application? How will the compute model lead to a naturally efficient architecture? How does the compute model support composition? How to conceptualize parallelism? How to tradeoff area and time? How to reason about correctness? How to adapt to technology trends (e.g. larger/faster chips)? How does compute model provide determinacy? How to avoid deadlocks? What can be computed? How to optimize a design, or validate application properties?
80
Compute Models Compute Models [1]: High-level models of the flow of computation. Useful for: Capturing parallelism Reasoning about correctness Decomposition Guide designs by providing constraints on what is allowed during a computation Communication links How synchronization is performed How data is transferred
81
Two High-level Families
Data Flow: Single-rate Synchronous Data Flow Synchronous Data Flow Dynamic Streaming Dataflow Dynamic Streaming Dataflow with Peeks Steaming Data Flow with Allocation Sequential Control: Finite Automata (i.e. Finite State Machine) Sequential Controller with Allocation Data Centric Data Parallel
82
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
83
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
84
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
85
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
86
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
87
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
88
Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +
89
Data Flow Graph of operators that data (tokens) flows through
Composition of functions Captures: Parallelism Dependences Communication X X +
90
Single-rate Synchronous Data Flow
One token rate for the entire graph For example all operation take one token on a given link before producing an output token Same power as a Finite State Machine 1 1 1 update - 1 1 1 1 1 1 1 1 F copy
91
Synchronous Data Flow - F
Each link can have a different constant token input and output rate Same power as signal rate version but for some applications easier to describe Automated ways to detect/determine: Dead lock Buffer sizes 1 10 1 update - 1 1 1 1 10 10 1 1 F copy
92
Dynamic Steaming Data Flow
Token rates dependent on data Just need to add two structures Switch Select in in0 in1 S S Switch Select out0 out1 out
93
Dynamic Steaming Data Flow
Token rates dependent on data Just need to add two structures Switch, Select More Powerful Difficult to detect Deadlocks Still Deterministic 1 Switch y x x y S F0 F1 x y x y Select
94
Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge
95
Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times A Merge
96
Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times B Merge A
97
Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge B A
98
Steaming Data Flow with Allocation
Removes the need for static links and operators. That is the Data Flow graph can change over time More Power: Turing Complete More difficult to analysis Could be useful for some applications Telecom applications. For example if a channel carries voice verses data the resources needed may vary greatly Can take advantage of platforms that allow runtime reconfiguration
99
Sequential Control Sequence of sub routines
Programming languages (C, Java) Hardware control logic (Finite State Machines) Transform global data state
100
Finite Automata (i.e. Finite State Machine)
Can verify state reachablilty in polynomial time S1 S2 S3
101
Sequential Controller with Allocation
Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S3
102
Sequential Controller with Allocation
Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S4 S3 SN
103
Data Parallel Multiple instances of a operation type acting on separate pieces of data. For example: Single Instruction Multiple Data (SIMD) Identical match test on all items in a database Inverting the color of all pixels in an image
104
Data Centric Similar to Data flow, but state contained in the objects of the graph are the focus, not the tokens flowing through the graph Network flow example Source1 Dest1 Source2 Switch Dest2 Source3 Flow rate Buffer overflow
105
Multi-threaded Multi-threaded: a compute model made up a multiple sequential controllers that have communications channels between them Very general, but often too much power and flexibility. No guidance for: Ensuring determinism Dividing application into threads Avoiding deadlock Synchronizing threads The models discussed can be defined in terms of a Multi-threaded compute model
106
Multi-threaded (Illustration)
107
Streaming Data Flow as Multithreaded
Thread: is an operator that performs transforms on data as it flows through the graph Thread synchronization: Tokens sent between operators
108
Data Parallel as Multithreaded
Thread: is a data item Thread synchronization: data updated with each sequential instruction
109
Caution with Multithreaded Model
Use when a stricter compute model does not give enough expressiveness. Define restrictions to limit the amount of expressive power that can be used Define synchronization policy How to reason about deadlocking
110
Other Models “A Framework for Comparing Models of computation” [1998]
E. Lee, A. Sangiovanni-Vincentelli Transactions on Computer-Aided Design of Integrated Circuits and Systems “Concurrent Models of Computation for Embedded Software”[2005] E. Lee, S. Neuendorffer IEEE Proceedings – Computers and Digital Techniques
111
Next Lecture System Architectures
112
User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor
113
User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor
114
User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor
115
MP3 Notes MUCH less VHDL coding than MP2
But you will be writing most of the VHDL from scratch The focus will be more on learning to read a specification (Power PC coprocessor interface protocol), and designing hardware that follows that protocol. You will be dealing with some pointer intensive C-code. It’s a small amount of C code, but somewhat challenging to get the pointer math right.
116
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
117
Lecture Notes kk
118
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010 (System Architectures) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
119
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
120
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
121
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
122
Projects: Target Timeline
Work on projects: 10/ /8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)
123
Common Questions
124
Common Questions
125
Common Questions
126
Overview Common System Architectures Plus/Delta mid-semester feedback
127
What you should learn Introduction to common System Architectures
128
Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
129
Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)
130
References Reconfigurable Computing (2008) [1]
Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon
131
System Architectures Compute Models: Help express the parallelism of an application System Architecture: How to organize application implementation
132
Efficient Application Implementation
Compute model and system architecture should work together Both are a function of The nature of the application Required resources Required performance The nature of the target platform Resources available
133
Efficient Application Implementation
(Image Processing) Platform 1 (Vector Processor) Platform 2 (FPGA)
134
Efficient Application Implementation
(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
135
Efficient Application Implementation
(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
136
Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
137
Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
138
Efficient Application Implementation
(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
139
Efficient Application Implementation
(Image Processing) Data Parallel Compute Model Vector System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
140
Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
141
Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
142
Efficient Application Implementation
(Image Processing) X X Data Flow Compute Model + Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)
143
Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
144
Data Presence X X +
145
Data Presence X X data_ready data_ready + data_ready
146
Data Presence X X FIFO FIFO data_ready data_ready + FIFO data_ready
147
Data Presence X X stall stall FIFO FIFO data_ready data_ready + FIFO
148
Data Presence Flow control: Term typical used in networking X X stall
FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking
149
Data Presence Flow control: Term typical used in networking
Increase flexibility of how application can be implemented X X stall stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking
150
Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
151
Datapath Sharing X X +
152
Datapath Sharing Platform may only have one multiplier X X +
153
Datapath Sharing Platform may only have one multiplier X +
154
Datapath Sharing Platform may only have one multiplier REG X REG +
155
Datapath Sharing Platform may only have one multiplier REG X FSM REG +
156
Datapath Sharing Platform may only have one multiplier
REG X FSM REG + Important to keep track of were data is coming!!
157
Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
158
Interconnect sharing X X +
159
Interconnect sharing Need more efficient use of interconnect X X +
160
Interconnect sharing Need more efficient use of interconnect X X +
161
Interconnect sharing Need more efficient use of interconnect X X FSM +
162
Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints
163
Streaming coprocessor
See SCORE chapter 9 of text for an example.
164
Sequential Control Typically thought of in the context of sequential programming on a processor (e.g. C, Java programming) Key to organizing synchronizing and control over highly parallel operations Time multiplexing resources: when task to too large for computing fabric Increasing data path utilization
165
Sequential Control X + A B C
166
Sequential Control X + A B C A*x2 + B*x + C
167
Sequential Control X + A B C C A B X X + A*x2 + B*x + C A*x2 + B*x + C
168
Finite State Machine with Datapath (FSMD)
B X X + A*x2 + B*x + C
169
Finite State Machine with Datapath (FSMD)
B X FSM X + A*x2 + B*x + C
170
Sequential Control: Types
Finite State Machine with Datapath (FSMD) Very Long Instruction Word (VLIW) data path control Processor Instruction augmentation Phased reconfiguration manager Worker farm
171
Very Long Instruction Word (VLIW) Datapath Control
See 5.2 of text for this architecture
172
Processor
173
Instruction Augmentation
174
Phased Configuration Manager
Will see more detail with SCORE architecture from chapter 9 of text.
175
Worker Farm Chapter 5.2 of text
176
Bulk Synchronous Parallelism
See chapter 5.2 for more detail
177
Data Parallel Single Program Multiple Data
Single Instruction Multiple Data (SIMD) Vector Vector Coprocessor
178
Data Parallel
179
Data Parallel
180
Data Parallel
181
Data Parallel
182
Cellular Automata
183
Multi-threaded
184
Next Lecture
185
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
186
Lecture Notes Add CSP/Mulithread as root of a simple tree
15+5(late start) minutes of time left Think of one to two in class exercise (10 min) Data Flow graph optimization algorithm? Dead lock detection on a small model? Give some examples of where a given compute model would map to a given application. Systolic array (implement) or Dataflow compute model) String matching (FSM) (MISD) New image for MP3, too dark of a color
187
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 14: Fri 10/13/2010 (Streaming Applications) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
188
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
189
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
190
Common Questions
191
Common Questions
192
Common Questions
193
Overview Steaming Applications (Chapters 8 & 9) Simulink SCORE
194
What you should learn Two approaches for implementing streaming applications
195
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
196
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
197
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
198
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
199
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
200
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
201
Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +
202
Data Flow Graph of operators that data (tokens) flows through
Composition of functions Captures: Parallelism Dependences Communication X X +
203
Streaming Application Examples
Some images processing algorithms Edge Detection Image Recognition Image Compression (JPEG) Network data processing String Matching (your MP2 assignment) Sorting??
204
Sorting Initial list of items Split Split Split Sort Sort Sort Sort
merge merge merge
205
Example Tools for Streaming Application Design
Simulink from Matlab: Graphical based SCORE (Steam Computation Organized for Reconfigurable Hardware): A programming model
206
Simulink (MatLab) What is it?
MatLab module that allows building and simulating systems through a GUI interface
207
Simulink: Example Model
208
Simulink: Example Model
209
Simulink: Sub-Module
210
Simulink: Example Model
211
Simulink: Example Model
212
Simulink: Example Plot
213
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection
214
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient
215
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection Detect Horizontal Edges Detect Vertical Edges -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient
216
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
217
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
218
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
219
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
220
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
221
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50
222
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50
223
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50
224
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50
225
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50
226
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50
227
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50
228
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50
229
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 50 50 50
230
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 50 50 50
231
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 50 50
232
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50
233
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50
234
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 50 50
235
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 50 50
236
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 50
237
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50
238
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50
239
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 50
240
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 50
241
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50
242
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200
243
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200
244
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200
245
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200
246
Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200
247
Top Level
248
Shifter
249
Multiplier
250
Input Image
251
Output Image
252
SCORE Overview of the SCORE programming approach Developed by
Stream Computations Organized for Reconfigurable Execution Developed by University of California Berkeley California Institute of Technology FPL 2000 overview presentation
253
Next Lecture Data Parallel
254
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
255
Lecture Notes
256
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 15: Fri 10/15/2010 (Reconfiguration Management) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
257
Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/22, due Tue 10/26 (midnight) In class portion (60%) Wed 10/27 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today/tomorrow) Problem 2 of HW 2 (released after MP3 gets released)
258
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
259
Common Questions
260
Common Questions
261
Overview Chapter 4: Reconfiguration Management
262
What you should learn Some basic configuration architectures
Key issues when managing the reconfiguration of a system
263
Reconfiguration Management
Goal: Minimize the overhead associated with run-time reconfiguration Why import to address Can take 100’s of milliseconds to reconfigure a device For high performance applications this can be a large overhead (i.e. decreases performance)
264
High Level Configuration Setups
Externally trigger reconfiguration CPU Configuration Request FPGA ROM (bitfile) Config Data FSM Config Control (CC)
265
High Level Configuration Setups
Self trigger reconfiguration FPGA Config Data ROM (bitfile) FSM CC
266
Configuration Architectures
Single-context Multi-context Partially Reconfigurable Relocation & Defragmentation Pipeline Reconfiguration Block Reconfigurable
267
Single-context FPGA Config clk Config I/F Config Data Config enable
OUT IN OUT IN OUT EN EN EN Config enable
268
Multi-context FPGA 1 1 2 2 3 3 Config clk Context switch Config Config
OUT IN OUT IN EN EN Context switch 1 1 Context 1 Enable 2 2 Context 2 Enable 3 3 Context 3 Enable Config Enable Config Enable Config Data Config Data
269
Partially Reconfigurable
Reduce amount of configuration to send to device. Thus decreasing reconfiguration overhead Need addressable configuration memory, as opposed to single context daisy chain shifting Example Encryption Change key And logic dependent on key PR devices AT40K Xilinx Virtex series (and Spartan, but not a run time) Need to make sure partial config do not overlap in space/time (typical a config needs to be placed in a specific location, not as homogenous as you would think in terms of resources, and timing delays)
270
Partially Reconfigurable
271
Partially Reconfigurable
Full Reconfig 10-100’s ms
272
Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms
273
Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms
274
Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms
275
Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms
276
Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms Typically a partial configuration modules map to a specific physical location
277
Relocation and Defragmentation
Make configuration architectures support relocatable modules Example of defragmentation text good example (defrag or swap out, 90% decrease in reconfig time compared to full single context) Best fit, first fit, … Limiting factor Routing/logic is heterogeneous timing issues, need modified routes Special resources needed (e.g. hard mult, BRAMS) Easy issue if there are blocks of homogeneity Connection to external I/O (fix IP cores, board restrict) Virtualized I/O (fixed pin with multiple internal I/Fs? 2D architecture more difficult to deal with Summary of feature PR arch should have Homogenous logic and routing layout Bus based communication (e.g. network on chip) 1D organization for relocation
278
Relocation and Defragmentation
B C
279
Relocation and Defragmentation
280
Relocation and Defragmentation
281
Relocation and Defragmentation
282
Relocation and Defragmentation
283
Relocation and Defragmentation
More efficient use of Configuration Space C A
284
Pipeline Reconfigurable
Example: PipeRench Simplifies reconfiguration Limit what can be implemented Cycle Virtual Pipeline stage 1 2 3 4 PE PE PE PE 1 1 1 PE PE PE PE 2 2 2 3 3 3 PE PE PE PE 4 4 Cycle Physical Pipeline stage 1 2 3 3 3 1 1 1 4 4 2 2 2
285
Block Reconfigurable Swappable Logic Units
Abstraction layer over a general PR architecture: SCORE Config Data
286
Managing the Reconfiguration Process
Choosing a configuration When to load Where to load Reduce how often one needs to reconfigure, hiding latency
287
Configuration Grouping
What to pack Pack multiple related in time configs into one Simulated annealing, clustering based on app control flow
288
Configuration Caching
When to load LRU, credit based dealing with variable sized configs
289
Configuration Scheduling
Prefetching Control flow graph Static compiler inserted conf instructions Dynamic: probabilistic approaches MM (branch prediction) Constraints Resource Real-time Mitigation System status and prediction What are current request Predict which config combination will give best speed up
290
Software-based Relocation Defragmentation
Placing R/D decision on CPU host not on chip config controller
291
Context Switching Safe state then start where left off.
292
Next Lecture Data Parallel
293
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
294
Lecture Notes
295
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 16: Fri 10/20/2010 (Data Parallel Architectures) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
296
Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/29, due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today): Problem 2 of HW 2 (released after MP3 gets released)
297
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
298
Common Questions
299
Common Questions
300
Overview Data Parallel Architectures: MP3 Demo/Overview
Chapters 5.2.4, and chapter 10 MP3 Demo/Overview
301
What you should learn Data Parallel Architecture basics
Flexibility Reconfigurable Hardware Addes
302
Data Parallel Architectures
303
Data Parallel Architectures
304
Data Parallel Architectures
305
Data Parallel Architectures
306
Data Parallel Architectures
307
Data Parallel Architectures
308
Data Parallel Architectures
309
Data Parallel Architectures
310
Data Parallel Architectures
311
Data Parallel Architectures
312
Next Lecture Project initial presentations.
313
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
314
Lecture Notes
315
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 17: Fri 10/22/2010 (Initial Project Presentations) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
316
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
317
Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea
318
Common Questions
319
Common Questions
320
Overview Present Project Ideas
321
Projects
322
Next Lecture Fixed Point Math and Floating Point Math
323
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
324
Lecture Notes
325
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 18: Fri 10/27/2010 (Floating Point) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
326
Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/29 (released today by 5pm), due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Problem 2 of HW 2 (released soon)
327
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
328
Common Questions
329
Common Questions
330
Overview Floating Point on FPGAs (Chapter 21.4 and 31)
Why is it viewed as difficult?? Options for mitigating issues
331
Floating Point Format (IEEE-754)
Single Precision S exp Mantissa 1 8 23 23 Mantissa = b-1 b-2 b-3 ….b-23 = ∑ b-i 2-i i=1 Floating point value = (-1)S * 2(exp-127) * (1.Mantissa) Example: 0 x”80” x”00000” = -1^0 * 2^ * 1.(1/2 + 1/4) = -1^0 * 2^1 * 1.75 = 3.5 Double Precision S exp Mantissa 1 11 52 Floating point value = (-1)S * 2(exp-1023) * (1.Mantissa)
332
Fixed Point Whole Fractional bW-1 … b1 b0 b-1 b-2 …. b-F
Example formats (W.F): 5.5, 10.12, 3.7 Example fixed point 5.5 format: = 10. 1/4 + 1/8 = Compare floating point and fixed point Floating point: 0 x”80” “110” x”00000” = 3.5 10-bit (Format 3.7) Fixed Point for 3.5 = ?
333
Fixed Point (Addition)
Whole Fractional Operand 1 Whole Fractional Operand 2 + Whole Fractional sum
334
Fixed Point (Addition)
11-bit 4.7 format Operand 1 = 3.875 + Operand 2 = 1.625 sum = 5.5 You can use a standard ripple-carry adder!
335
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 +
336
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”80” -> x”7F” or visa-verse?
337
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”7F”->x”80”, lose least significant bits of Operand 2 - Add the difference of x”80” – x“7F” = 1 to x”7F” - Shift mantissa of Operand 2 by difference to the right. remember “implicit” 1 of the original mantissa 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 +
338
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 +
339
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + Overflow! x”00000”
340
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas You can’t just overflow mantissa into exponent field You are actually overflowing the implicit “1” of Operand 1, so you sort of have an implicit “2” (i.e. “10”). 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + Overflow! x”00000”
341
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + 0 x”81” x”00000”
342
Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + 0 x”81” x”00000” = 5.5 Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + 0 x”81” x”00000”
343
Floating Point (Addition): Other concerns
Special Value Sign Exponent Mantissa Zero 0/1 Infinity MAX -Infinity 1 NaN Non-zero Denormal nonzero Single Precision S exp Mantissa 1 8 23
344
Floating Point (Addition): High-level Hardware
M0 M1 Difference Greater Than Mux SWAP Shift value Right Shift Add/Sub Priority Encoder Round Denormal? Left Shift value Left Shift Sub/const E M
345
Floating Point Both Xilinx and Altera supply floating point soft-cores (which I believe are IEEE-754 compliant). So don’t get too afraid if you need floating point in your class projects Also there should be floating point open cores that are freely available.
346
Fixed Point vs. Floating Point
Floating Point advantages: Application designer does not have to think “much” about the math Floating point format supports a wide range of numbers (+/- 3x1038 to +/-1x10-38), single precision If IEEE-754 compliant, then easier to accelerate existing floating point base applications Floating Point disadvantages Ease of use at great hardware expense 32-bit fix point add (~32 DFF + 32 LUTs) 32-bit single precision floating point add (~250 DFF LUTs). About 10x more resources, thus 1/10 possible best case parallelism. Floating point typically needs massive pipeline to achieve high clock rates (i.e. high throughput) No hard-resouces such as carry-chain to take advantage of
347
Fixed Point vs. Floating Point
Range example: Floating Point vs. Fixed Point advantages: Some exception with respect to precision
348
Mitigating Floating Point Disadvantages
Only support a subset of the IEEE-754 standard Could use software to off-load special cases Modify floating point format to support a smaller data type (e.g. 18-bit instead of 32-bit) Link to Cornell class: Add hardware support in the FPGA for floating point Hardcore multipliers: Added by companies early 2000’s Altera: Hard shared paths for floating point (Stratix-V 2011) How to get 1-TFLOP throughput on FPGAs article achieve-1-trillion-floating-point-operations-per-second-in-an-FPGA
349
Mitigating Fixed Point Disadvantages (21.4)
Block Floating Point (mitigating range issue)
350
CPU/FPGA/GPU reported FLOPs
Block Floating Point (mitigating range issue)
351
Next Lecture Mid-term Then on Friday: Evolvable Hardware
352
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
353
Lecture Notes Altera App Notes on computing FLOPs for Stratix-III
Altera old app Notes on floating point add/mult
354
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 19: Fri 11/5/2010 (Evolvable Hardware) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
355
Announcements/Reminders
MP3: Extended due date until Monday midnight Those that finish by Friday (11/5) midnight bonus +1% per day before new deadline If after Friday midnight, no bonus but no penalty 10% deduction after Monday midnight, and addition -10% each day late Problem 2 of HW 2 (will now call HW3): released by Sunday midnight, will be due Monday 11/22 midnight. Turn in weekly project report (tonight midnight)
356
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
357
What you should learn Understand Evolvable Hardware basics?
Benefits and Drawbacks Key types/categories
358
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream
359
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA
360
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA GATAGA
361
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA GATAGA
362
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream
363
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream
364
Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream DFF DFF
365
Classifying Adaption/Evolution
Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks
366
Classifying Adaption/Evolution
Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks Phylogeny Epigenesis Ontogeny
367
Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming
368
Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming
369
Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming
370
Genetic Algorithms Genome: a finite sting of symbols encoding an individual Phenotype: The decoding of the genome to realize the individual Constant Size population Generic steps Initial population Decode Evaluate (must define a fitness function) Selection Mutation Cross over
371
Initialize Population
Genetic Algorithms Initialize Population Evaluate Decode Next Generation Selection Cross Over Mutation
372
Initialize Population
Genetic Algorithms Initialize Population Evaluate Decode ( ) Next Generation Selection Cross Over Mutation
373
Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over Mutation
374
Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation
375
Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation
376
Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation
377
Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation
378
Evolvable Hardware Platform
379
Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search?
380
Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second The genome of a individual is 32-bits in size How long to do an exhaustive search?
381
Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second Now genome of a individual is a FPGA 1,000,000 bits in size How long to do an exhaustive search?
382
Evolvable Hardware Taxonomy
Extrinsic Evolution (furthest from biology) Evolution done in SW, then result realized in HW Intrinsic Evolution HW is used to deploy individuals Results are sent back to SW for fitness calculation Complete Evolution Evolution is completely done on target HW device Open-ended Evolution (closest to biology) Evaluation criteria changes dynamically Phylogeny Epigenesis Ontogeny
383
Evolvable Hardware Applications
Prosthetic Hand controller chip Kajitani “An Evolvable Hardware Chip for Prostatic Hand Controller”, 1999
384
Evolvable Hardware Applications
Tone Discrimination and Frequency generation Adrian Thompson “Silicon Evolution”, 1996 Xilinx XC6200
385
Evolvable Hardware Applications
Tone Discrimination and Frequency generation Node Functions Node Genotype
386
Evolvable Hardware Applications
Tone Discrimination and Frequency generation Evolved 4KHz oscillator
387
Evolvable Hardware Issues?
388
Evolvable Hardware Issues?
389
Evolvable Hardware Platforms
Commercial Platforms Xilinx XC6200 Completely multiplex base, thus could program random bitstreams dynamically without damaging chip Xilinx Virtex FPGA Custom Platforms POEtic cell Evolvable LSI chip (Higuchi)
390
Next Lecture Overview the synthesis process
391
Notes Notes
392
Adaptive Thermoregulation for Applications on Reconfigurable Devices
Phillip Jones Applied Research Laboratory Washington University Saint Louis, Missouri, USA Iowa State University Seminar April 2008 Funded by NSF Grant ITR
393
What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB Configurable Logic Block
394
What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB
395
What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB
396
FPGA Usage Models Partial Reconfiguration Fast Prototyping System on
Experimental ISA Experimental Micro Architectures Run-time adaptation Run-time Customization CPU + Specialized HW - Sparc-V8 Leon Partial Reconfiguration Fast Prototyping System on Chip (SoC) Parallel Applications Full Reconfiguration Image Processing Computational Biology Remote Update Fault Tolerance
397
Some FPGA Details CLB CLB CLB CLB
398
Some FPGA Details CLB CLB CLB 4 input Look Up Table 0000 0001 1110
1111 ABCD Z Z A LUT B C D
399
Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111
1 A AND Z 4 input Look Up Table B C D
400
Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111
1 A OR Z 4 input Look Up Table B C D
401
Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z B X000 X001 X110
1 Z 4 input Look Up Table C 2:1 Mux D
402
Some FPGA Details CLB CLB CLB Z A LUT B C D
403
Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z
LUT DFF B C D
404
Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z
LUT DFF B C D
405
Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
406
Why Thermal Management?
407
Why Thermal Management?
Location? Hot Cold Regulated
408
Why Thermal Management?
Mobile? Hot Cold Regulated
409
Why Thermal Management?
Reconfigurability FPGA Plasma Physics Microcontroller
410
Why Thermal Management?
Exceptional Events
411
Why Thermal Management?
Exceptional Events
412
Local Experience Thermally aggressive application
Disruption of air flow
413
Damaged Board (bottom view)
Thermally aggressive application Disruption of air flow
414
Damaged Board (side view)
Thermally aggressive application Disruption of air flow
415
Response to catastrophic thermal events
Easy Fix Not Feasible!! Very Inconvenient
416
Solutions Over provision Use thermal feedback
Large heat sinks and fans Restrict performance Limiting operating frequency Limit amount chip utilization Use thermal feedback Dynamic operating frequency Adaptive Computation Shutdown device My approach
417
Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
418
Measuring Temperature
FPGA
419
Measuring Temperature
FPGA A/D 60 C
420
Background: Measuring Temperature
FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp , 2000. Temperature 1. .0 .1 0. .0 1. Period
421
Background: Measuring Temperature
FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period
422
Background: Measuring Temperature
FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period
423
Background: Measuring Temperature
FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp , 2000. Temperature 1. .1 .0 Period Voltage
424
Background: Measuring Temperature
FPGA Temperature 1. .1 .0 Period Voltage
425
Background: Measuring Temperature
FPGA “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands Temperature 1. .1 .0 Period Voltage
426
Background: Measuring Temperature
FPGA Mode 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
427
Background: Measuring Temperature
FPGA Mode 1 Mode 2 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
428
Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Core 1 Core 2 70C Temperature 40C Core 3 Core 4 Period 8,000 8,300 Frequency: Low Frequency: High
429
Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Sample Controller Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
430
Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
431
Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 2 5 3 1 4 5 2 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High
432
Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 3 2 5 1 4 5 3 1 2 3 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High
433
Background: Measuring Temperature
FPGA Mode 2 1 3 Sample Mode Pause Time out Counter 2 1 5 4 3 5 2 3 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High
434
Temperature Benchmark Circuits
Desired Properties: Scalable Work over a wide range of frequencies Can easily increase or decrease circuit size Simple to analyze Regular structure Distributes evenly over chip Help reduce thermal gradients that may cause damage to the chip May serve as standard Further experimentation Repeatability of results “A Thermal Management and Profiling Method for Reconfigurable Hardware Applications”, by Phillip H. Jones, John W. Lockwood, and Young H. Cho; Field Programmable Logic and Applications (FPL’06), Madrid, Spain,
435
Temperature Benchmark Circuits
LUT 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF
436
Temperature Benchmark Circuits
RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate 8 Input Gen Array of 18 core blocks (864 LUTs, 864 DFFs) (1 LUT, 1 DFF) Thermal workload unit: Computation Row CB 0 CB 17 CB 1 CB 16
437
Temperature Benchmark Circuits
RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate RLOC_ORIGIN: Row, Col 100% Activation Rate Thermal workload unit: Computation Row 01 Input Gen CB 0 CB 1 CB 16 CB 17 00 1 1 8 8 (1 LUT, 1 DFF) Array of 18 core blocks (864 LUTs, 864 DFFs)
438
Example Circuit Layout (Configuration 1x, 9% LUTs and DFFs)
RLOC_ORIGIN: Row, Col (27,6) Thermal Workload Unit
439
Example Circuit Layout (Configuration 4x, 36% LUTs and DFFs)
440
Observed Temperature vs. Frequency
T ~ P P ~ F*C*V2 Steady-State Temperatures Cfg4x Cfg10x Cfg2x Cfg1x
441
Observed Temperature vs. Active Area
Max rated Tj 85 C T ~ P P ~ F*C*V2 Steady-State Temperatures 200 MHz 100 MHz 50 MHz 25 MHz 10 MHz
442
Projecting Thermal Trajectories
Estimate Steady State Temperature 5.4±.5 Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)
443
Projecting Thermal Trajectories
Estimate Steady State Temperature How long until 60 C? 5.4±.5 Exploit this phase for performance Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)
444
Thermal Shutdown Max Tj (70C)
445
Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
446
Image Correlation Application
Template
447
Image Correlation Application Virtex-4 100FX Resource Utilization
Heats FPGA a lot! (> 85 C) Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs)
448
Application Infrastructure Temperature Sample Controller
Thermoregulation Controller Pause 65 C Application Mode “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands
449
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Image Buffer Mode Image Processor Core 1 Mask 1 2 Image Processor Core 3 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 Score Out
450
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 Mode MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 3 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Score Out
451
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out
452
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 180 150 100 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out
453
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 100 75 50 MHz MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out
454
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 50 MHz 6 4 5 7 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 2 Mask 1 Mask 2 Mask 1 Mask 2 Mask 2 High Priority Features Low Priority Features Score Out
455
Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 75 100 180 150 50 200 MHz MHz 4 7 8 6 5 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 Mask 2 Mask 1 Mask 2 High Priority Features Low Priority Features Score Out
456
Thermally Adaptive Frequency
High Frequency Thermal Budget = 72 C “An Adaptive Frequency Control Method Using Thermal Feedback for Reconfigurable Hardware Applications”, by Phillip H. Jones, Young H. Cho, and John W. Lockwood; Field Programmable Technology (FPT’06), Bangkok, Thailand Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)
457
Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)
458
Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) S. Wang (“Reactive Speed Control”, ECRTS06) Time (s)
459
Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
460
Platform Overview Virtex-4 FPGA Temperature Probe
461
Thermal Budget Efficiency
200 MHz 106 MHz 184 MHz 50 MHz 65 MHz 50 MHz 50 MHz Adaptive Fixed 70 Adaptive Thermal Budget (65 C) 65 4 Features MHz 60 Fixed 25 C Unused 55 Junction Temperature (C) 50 45 40 35 30 40 C 35 C 30 C 25 C 25 C 25 C 0 Fans 0 Fans 0 Fans 0 Fans 1 Fan 2 Fans Thermal Condition
462
Conclusions Motivated the need for thermal management
Measuring temperature Application dependent voltage variations effects. Temperature benchmark circuits Examined application specific adaptation for improving performance in dynamic thermal environments
463
Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions
464
Thermally Constrained Systems
Space Craft Sun Earth
465
Thermally Constrained Systems
466
Temperature-Safe Real-time Systems
Task scheduling is a concern in many embedded systems Goal: Satisfy thermal constraints without violating real-time constraints
467
How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 T1 T2 T3 Time
468
How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 Too hot? Deadlines could be missed T1 T2 T3 Idle Time
469
How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 Deadlines could be missed T1 T2 T3 Idle Idle Idle Time Generalization: Idle task insertion
470
Idle Task Insertion More Powerful
Task for schedule at F_max (100 MHz) Period (s) Cost (s) Deadline (s) Utilization (%) Deadline equals cost, frequency cannot be scaled or task schedule becomes infeasible 30 10.0 10.0 33.33 120 30.0 120 25.00 480 30.0 480 6.25 960 20.0 960 2.08 66.66 a. No idle task inserted Tasks scheduled at F_max (100 MHz), 1 Idle Task 960 480 120 60.0 10.0 Deadline (s) 33.33 20.0 60 2.08 99.99 6.25 30.0 25.00 30 Utilization (%) Cost (s) Period (s) b. 1 idle task inserted Idle task insertion No impact on tasks’ cost Higher priority task response times unaffected Allow control over distribution of idle time
471
Sleep when idle is insufficient
Temperature constraint = 65 C Peak Temperature = 70 C
472
Idle-task inserted Temperature constraint = 65 C
Peak Temperature = 61 C
473
Idle-Task Insertion + Deadlines Temperature met? Yes No System
(task set) Idle tasks Scheduler (e.g. RMS) + Deadlines met? Temperature Yes No a. Original schedule does not meet temperature constraints b. Use idle tasks to redistribute device idle time in order to reduce peak device temperature
474
Related Research Power Management Thermal Management
EDF, Dynamic Frequency Scaling Yao (FOCS’95) EDF, Minimize Temperature Bansal (FOCS’04) Worst Case Execution Time Shin (DAC’99) RMS, Reactive Frequency, CIA Wang (RTSS’06, ECRTS’06)
475
Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Conclusions Temperature-Safe Real-time Systems Future Directions
476
Research Fronts Near term Longer term
Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)
477
Questions/Comments? Near term Longer term
Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)
478
Temperature per Processing Core
Temperature vs. Number of Processing Core 70 y = 2.21x 65 S1 y = 2.24x S2 60 y = 2.23x S3 55 2.07x Junction Temperature (C) y = 50 S4 45 y = 1.43x S5 40 y = 1.22x S6 35 1 2 3 4 Number of Processing Cores
479
Temperature Sample Mode
480
Ring Oscillator Thermometer Characteristics
Thermometer size Ring oscillator size Oscillation period Incrementer Cycle Period Temperature resolution ~100 LUTs 48 LUTs (47 NOT + 1 OR) ~40 ns ~.16 ms (40ns * 4096) .1ºC/ count Or .1ºC/ 20ns
481
Application Mode B C Count = 8235 Count = 8425 Count = 8620
Temperature vs. Incrementer Period (Measuring Temperature while Application Active) 10 20 30 40 50 60 70 80 90 8100 8200 8300 8400 8500 8600 8700 Incrementer Period (20ns/count) Temperature (C) Application Mode A B C Count = 8235 Count = 8425 Count = 8620
482
Virtex-4 100FX Resource Utilization
Application implementation statistics Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) Image Correlation Characteristics 40.6 (at 200 MHz) 1 - 8 8-bit (grey scale) 320x480 Image Processing Rate (Frames per second) # of Features Pixel Resolution Image Size (# pixels)
483
VirtexE 2000 Resource Utilization Image Correlation Characteristics
Application implementation statistics 125 MHz 26% (43) 32,868 (15,808) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) VirtexE 2000 Resource Utilization 12.7/second (at 125 MHz) 10 (in parallel) 1 - 4 8-bit (grey scale) 640x480 Image Processing Rate # of Templates # of Mask Patterns Pixel Resolution Image Size (# pixels) Image Correlation Characteristics a.) b.)
484
Scenario Descriptions
30 C (86 F) S3 25 C (77 F) S4 40 C (104 F) S1 35 C (95 F) S2 # of Fans Ambient Temperature Scenario S1 – S6 1 S5 2 S6
485
High Level Architecture
Application Pause Thermal Manager Frequency & Quality Controller Frequency mode Quality Temperature
486
Periodic Temperature Sampling
Application Pause Thermal Manager 50 ms Event Counter Event Ring Oscillator Based Thermometer ready Sample Mode Controller Temperature Frequency & Quality capture Frequency mode Quality
487
Ring Oscillator Based Thermometer
Reset 12-bit incrementer ring_clk MSB Edge Detect 14-bit Clk DFF reset 14 Temperature sel Ready mux
488
ASIC, GPP, FPGA Comparison
Cost Performance Power Flexibility
489
Frequency Multiplexing Circuit
Frequency Control Clk Multiplier (DLLs) clk clk to global clock tree 2:1 MUX 4xclk BUFG Current Virtex-4 platform uses glitch free BUFGMUX component
490
Thermally Adaptive Frequency
High Frequency Thermal Budget = 72 C Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)
491
Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)
492
Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)
493
Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C Thermally Safe Frequency 50 MHz
494
Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency Thermally Safe Frequency 50 MHz
495
Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz
496
Typical Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz
497
Typical Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz
498
Best Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Thermally Safe Frequency 50 MHz
499
Best Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 119 MHz Thermally Safe Frequency 50 MHz 2.4x Factor Performance Increase
500
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/12/2010 (Synthesis) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
501
Announcements/Reminders
HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm
502
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
503
What you should learn Intro to synthesis
Synthesis and Optimization of Digital Circuits De micheli, 1994 (chapter 1)
504
Synthesis (big picture)
Synthesis & Optimization Architectural Logic Boolean Function Min Boolean Relation Min State Min Scheduling Sharing Coloring Covering Satisfiability Graph Theory Boolean Algebra
505
Views of a design Behavioral view Structural view PC = PC +1 Fetch(PC)
Decode(INST) Add Mult Architectural level RAM control S1 S2 Logic level DFF S3
506
Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3
507
Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view ID Func. Resources Schedule use (control) Inter connect (data path) Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3
508
Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3
509
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if
510
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit
511
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit
512
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
513
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 read S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
514
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 + S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
515
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
516
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic
517
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 *, + S6 S5 S4 * ALU Control Unit Memory & Steering logic
518
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic
519
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
520
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 +,* S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
521
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 + S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
522
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 write S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic
523
Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit DFF DFF DFF DFF * ALU Control Unit Memory & Steering logic
524
Optimization Combinational Metrics: propagation delay, circuit size
Sequential Cycle time Latency Circuit size
525
Optimization Combinational Metrics: propagation delay, circuit size
Sequential Cycle time Latency Circuit size
526
Impact of Highlevel Syn on Optimaiztion
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit
527
Impact of Highlevel Syn on Optimaiztion
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit * * * ALU Memory & Steering logic Control Unit
528
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
529
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
530
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Sum of products A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’
531
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11
532
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11
533
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products Sum of products (minimized) 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 A * B + A’*C*D’ 01 1 10 11
534
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw
535
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw (xy + xw)’ (xw)’CD + (xy + xw)’(xw)C’D’
536
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
537
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
538
Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models
539
Introduction to HW3
540
Introduction to HW3
541
Introduction to HW3
542
Next Lecture MAP
543
Notes Notes
544
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 22: Fri 11/19/2010 (Coregen Overview) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
545
Announcements/Reminders
HW3): released by Saturday midnight, will be due Wed 12/15 midnight. Turn in weekly project report (tonight midnight) Midterms still being graded, sorry for the delay: You can stop by my office after 5pm today to pick up your graded tests 584 Advertisement: Number 1
546
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
547
What you should learn Basic of using coregen, in class demo
548
Next Lecture Finish up synthesis process, start MAP
549
Notes Notes
550
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 22: Fri 12/1/2010 (Class Project Work) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
551
Announcements/Reminders
HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm
552
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
553
Next Lecture Finish up synthesis process, MAP
554
Notes Notes
555
Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 24: Wed 12/8/2010 (Map, Place & Route) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA
556
Announcements/Reminders
HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (9 – 10:30 am) Take home final given on Wed 12/15 due 12/17 5pm
557
Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS
558
Applications on FPGA: Low-level
Implement circuit in VHDL (Verilog) Simulate compiled VHDL Synthesis VHDL into a device independent format Map device independent format to device specific resources Check that device has enough resources for the design Place resources onto physical device locations Route (connect) resources together Completely routed Circuit meets specified performance Download configuration file (bit-steam) to the FPGA
559
Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download
560
(Technology) Map Translate device independent net list to device specific resources
561
(Technology) Map Translate device independent net list to device specific resources
562
(Technology) Map Translate device independent net list to device specific resources
563
(Technology) Map Translate device independent net list to device specific resources
564
Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download
565
Place Bind each mapped resource to a physical device location
User Guided Layout (Chapter 16:Reconfigurable Computing) General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based Heuristics used No efficient means for finding an optimal solution
566
Place (High-level) Netlist from technology mapping in A in B in C RAM
LUT D DFF F DFF G clk out
567
Place (High-level) Netlist from technology mapping
FPGA physical layout I/O I/O I/O I/O in A in B in C I/O LUT BRAM I/O LUT RAM E DFF F I/O I/O LUT D LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O
568
Place (High-level) Netlist from technology mapping
FPGA physical layout clk in C out I/O in A in B in C In A LUT G E I/O D F RAM E In B I/O LUT D DFF F LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O
569
Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based
570
Place (User-Guided) User provide information about applications structure to help guide placement Can help remove critical paths Can greatly reduce amount of time for routing Several methods to guide placement Fixed region Floating region Exact location Relative location
571
Place (User-Guided): Examples
FPGA LUT D DFF F G Part of Map Netlist Fixed region
572
Place (User-Guided): Examples
FPGA LUT D DFF F G Part of Map Netlist Fixed region SDRAM
573
Place (User-Guided): Examples
FPGA Floating region Softcore Processor
574
Place (User-Guided): Examples
FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
575
Place (User-Guided): Examples
FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT G LUT D F LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
576
Place (User-Guided): Examples
FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT G D F LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
577
Place (User-Guided): Examples
FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT G D F LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT
578
Place (User-Guided): Examples
FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT G D F LUT LUT LUT LUT
579
Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based
580
Place (General Purpose)
Characteristics: Places resources without any knowledge of high level structure Guided primarily by local connections between resources Drawback: Does not take explicit advantage of applications structure Advantage: Typically can be used to place any arbitrary circuit
581
Place (General Purpose)
Preprocess Map Netlist using Clustering Group netlist components that have local conductivity into a single logic block Clustering helps to reduce the number of objects a placement algorithm has to explicitly place.
582
Place (General Purpose)
Placement using simulated annealing Based on the physical process of annealing used to create metal alloys
583
Place (General Purpose)
Simulated annealing basic algorithm Placement_cur = Inital_Placement; T = Initial_Temperature; While (not exit criteria 1) While (not exit criteria 2) Placement_new = Modify_placement(Placement_cur) ∆ Cost = Cost(Placement_new) – Cost(Placement_cur) r = random (0,1); If r < e^(-∆Cost / T), Then Placement_cur = Placement_new End loop T = UpdateTemp(T);
584
Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
585
Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT LUT G LUT Z B BRAM X A LUT LUT F LUT LUT D LUT
586
Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
587
Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X LUT LUT A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
588
Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT X A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT
589
Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based
590
Place (Structured-based)
Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure.
591
Structure high-level example
592
Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download
593
Route Connect placed resources together Two requirements
Design must be completely routed Routed design meets timing requirements Widely used algorithm “PathFinder” PathFinder (FPGA’95) McMurchie and Ebeling Reconfigurable Computing (Chapter 17) Scott Hauch, Andre Dehon (2008)
594
Route: Route FPGA Circuit
595
Route (PathFinder) PathFinder: A Negotiation-Based Performance- Driven Router for FPGAs (FPGA’95) Basic PathFinder algorithm Based closely on Djikstra’s shortest path Weights are assigned to nodes instead of edges
596
Route (PathFinder): Example
G = (V,E) Vertices V: set of nodes (wires) Edges E: set of switches used to connect wires Cost of using a wire: c_n = (b_n + h_n) * p_n S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3
597
Route (PathFinder): Example
Simple node cost cn = bn Obstacle avoidance Note order matters S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3
598
Route (PathFinder): Example
cn = b * p p: sharing cost (function of number of signals sharing a resource) Congestion avoidance S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3
599
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
600
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
601
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
602
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
603
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
604
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
605
Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3
606
Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download
607
Download Convert routed design into a device configuration file (e.g. bitfile for Xilinx devices)
608
Next Lecture Project presentations
609
Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR
610
Place (Structured-based)
Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure. GLACE “A Generic Library for Adaptive Computing Environments” (FPL 2001) Is an example tool that takes the structure of an application into account. FLAME (Flexible API for Module-based Environments) JHDL (From BYU) Gen (From Lockheed-Martin Advanced Technology Laboratories)
611
GLACE: High-level
612
GLACE: Flow
613
GLACE: Library Modules
614
GLACE: Data Path and Control Path
615
GLACE: FLAME low-level
616
GLACE: Final placement example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.