Presentation is loading. Please wait.

Presentation is loading. Please wait.

FPGA Routing Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Similar presentations


Presentation on theme: "FPGA Routing Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223."— Presentation transcript:

1 FPGA Routing Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223

2 Routing Resource Graph (RRG) www.eecg.toronto.edu/~aling/ece1718/project/fang/route_rr_graph.png 2-LUT in 1 in 2 out wire 1 wire 2 wire 3 wire 4 source sink out wire 4 wire 2 in 2 in 1 wire 1 wire 3 2

3 FPGA Routing Disjoint-path problem (on the RRG); NP-complete Input: – Graph G(V, E) – Set of sources S = {s 1, s 2, …, s m } – Set of sets of sinks T = {T 1, T 2, …, T n }, T i = {t i 1, t i 2, …, t i k } Solution – Finds paths from each source s i to all sinks in T i – Paths emanating from different sinks must be disjoint (cannot shared any vertices or edges) Objective(s) – Minimize delay, wirelength, etc. 3

4 Disjoint and Non-disjoint Paths s1s1 s2s2 t21t21 t11t11 4

5 This route is illegal! s1s1 s2s2 t21t21 t11t11 Two nets are routed on the same FPGA segment (remember, RRG vertices represent wires and CLB I/Os) 5

6 Disjoint and Non-disjoint Paths This route is legal! s1s1 s2s2 t21t21 t11t11 6

7 LUT Input Equivalence (1/3) s1s1 s2s2 2-LUT in 1 in 2 s1s1 s2s2 2-LUT in 1 in 2 LUT inputs are interchangeable 7

8 LUT Input Equivalence (2/3) s1s1 s2s2 2-LUT in 1 in 2 s1s1 s2s2 2-LUT in 1 in 2 s1s1 s2s2 t 2 1 = in 2 t 1 1 = in 1 t 2 1 = in 1 t 1 1 = in 2 s1s1 s2s2 Overly restrictive disjoint-path problem formulation 8

9 LUT Input Equivalence (3/3) s1s1 s2s2 2-LUT in 1 in 2 s1s1 s2s2 2-LUT in 1 in 2 Represent all inputs of a LUT as one RRG sink, t t s1s1 s2s2 9

10 PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs Larry McMurchie and Carl Ebeling International Symposium on FPGAs, 1995

11 Architecture and CAD for Deep Submicron FPGAs (Section 2.2.3, pp. 25-30) Vaughn Betz, Jonathan Rose, Sandy Marquardt Springer, 1999

12 Basic Plan Route nets one-by-one – Each net is routed by a priority-driven breadth-first search – Multiple nets may use the same nodes and edges – Node/edge costs depend on current usage and past usage history Repeat for a fixed number of iterations (say 200) – Success: A disjoint routing solution for for all nets – Failure: No disjoint solution found after a fixed number of iterations

13 Routability Cost Function (per Node) C(n) = p(n) × (b(n) + h(n)) – b(n) : base cost of routing through node n Typically the intrinsic delay of the circuit element corresponding to the node – h(n) : history cost of routing through node n, based on previous router iterations – p(n) : depends on the number of signals presently routed through node n in the current iteration

14 Timing-Routability Cost Function

15 First-Order Congestion s1s1 s3s3 s2s2 A t1t1 t3t3 t2t2 B C 1 1 1 2 3 4 3 1 1 1 2 3 4 3

16 s1s1 s3s3 s2s2 A t1t1 t3t3 t2t2 B C 1 1 1 2 3 4 3 1 1 1 2 3 4 3 All paths route through B Increase the penalty cost of B per iteration due to sharing In a future iteration, it will be cheaper to route signal 1 through A, rather than B In a future iteration, it will be cheaper to route signal 2 through C, rather than B Requires gradual increase in cost of sharing nodes

17 Second-order Congestion s1s1 s3s3 s2s2 A t1t1 t3t3 t2t2 B C 1 2 2 1 1 2 1 2 1 1 Routing Order: 1, 2, 3

18 Second-order Congestion Routing Order: 1, 2, 3 A B C 1 2 2 1 1 2 1 2 1 1 s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

19 Second-order Congestion Routing Order: 1, 2, 3 A B C 1 2 2 1 1 2 1 2 1 1 s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

20 Second-order Congestion Routing Order: 1, 2, 3 A B C 1 2 2 1 1 2 1 2 1 1 Two paths route through C Increasing the history cost through C will push signal 2 onto an alternative path through B s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

21 Second-order Congestion Routing Order: 1, 2, 3 A B C 1 2 2 3 3 2 1 2 3 3 Two paths route through C Increasing the history cost through C will push signal 2 onto an alternative path through B s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

22 Second-order Congestion Routing Order: 1, 2, 3 A B C 1 2 2 3 3 2 1 2 3 3 Two paths route through B Increasing the history cost through B will push signal 1 onto an alternative path through A Increasing the history cost through B will push signal 2 back onto the path through C s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

23 Second-order Congestion Routing Order: 1, 2, 3 A B C 3 4 2 3 3 4 3 2 3 3 Two paths route through B Increasing the history cost through B will push signal 1 onto an alternative path through A Increasing the history cost through B will push signal 2 back onto the path through C s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

24 Second-order Congestion Routing Order: 1, 2, 3 A B C 3 4 2 3 3 4 3 2 3 3 Two paths route through C Increasing the history cost through C will push signal 2 back through B s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

25 Second-order Congestion Routing Order: 1, 2, 3 A B C 3 4 2 5 5 4 3 2 5 5 Two paths route through C Increasing the history cost through C will push signal 2 back through B s1s1 s3s3 s2s2 t1t1 t3t3 t2t2

26 Pseudocode Priority driven BFS Computed from the source to every sink of net(i) Connect the newly found path to the current sink to the partly-computed routing tree RT(i) for net I

27 Timing-Routability Cost Function Represents the estimated criticality of the path from source I to sink j on net(i) net(i) may fan out to multiple sinks with varying criticality, depending on placement Delay of critical path

28 Negotiated A* Routing for FPGAs Russell Tessier Fifth Canadian Workshop on Field Programmable Devices, 1998

29 Basic Idea

30 A* Cost Function Cost at a node n i f i = g i + d i g i = f i-1 + c i Cost of routing from the source to n i Estimated cost of routing from n i to the sink Total cost of previous path Cost of the next candidate node

31 A* vs. BFS A*: f i = g i + d i = (f i-1 + c i ) + d i BFS: f i = g i Hybrid: f i = (1 - α)g i + αd i = (1 - α)(f i-1 + c i ) + αd i

32 Good Ideas, But… Algorithm and implementation assumed disjoint switch block – i.e., disjoint global routing domains Key issue: – Which routing domains (reachable from the source) are explored in what order?

33 A Fast Routability-driven Router for FPGAs Jordan S. Swartz, Vaughn Betz, Jonathan Rose International Symposium on FPGAs, 1998

34 Architecture and CAD for Deep Submicron FPGAs (Section 4.3, pp. 76-80) Vaughn Betz, Jonathan Rose, Sandy Marquardt Springer, 1999

35 Speed Enhancements Directed depth-first search – Similar to A*, but arguably more aggressive – Key Issues: Cost function Sink (target) ordering for multi-terminal nets Route classification – Low-stress – Difficult – Impossible

36 Cost Function Cost(n) = C(n) + Cost prev + α(ΔD) – Cost prev : cost of track segments used to reach the current segment – C(n) : (base) cost of using the track segment Increases more rapidly than the original PathFinder algorithm – (ΔD) : estimated cost of routing from the current track segment to the destination Based on Manhattan (XY) distance) – α : Direction factor BFS: α = 0 Directed search: α > 0

37 Updated Node (Base) Cost Function C(n) = p(n) × (b(n) + h(n)) – b(n) : base cost – h(n) : history cost – p(n) : usage cost C(n) = p(n) × b(n) × h(n) + Bendcost(n, m) – Multiplying b(n) and h(n) eliminates normalization – Penalize global routes that take a lot of turns These routes are less likely to use long wires

38 Penalty Cost Function Not specified in the original PathFinder paper p(n) = 1 + max(0, [occupancy(n) – capacity(n)] × p fac ) occupancy(n) : number of nets using resource n capacity(n) : maximum number of nets that can use resource n (typically 1) p fac : 0.5 for the first iteration; increase by 1.5x to 2x each subsequent iteration

39 History Cost Function Not specified in the original PathFinder paper occupancy(n) : number of nets using resource n capacity(n) : maximum number of nets that can use resource n (typically 1) h fac : constant; any value between 0.2 and 1 works well

40 Target Selection Closest Sink First Uses fewer track segments Furthest Sink First Uses more track segments

41 Net Order Route nets in order of decreasing fan-out High fanout nets tend to span the whole FPGA – Easier to route when there is less congestion do to other nets routed earlier Low fanout nets tend to be more localized – Relatively easy to route, even in the presence of some congestion

42 Binning Only the portions of the net close to the target sink should be expanded – E.g., only expand within Bin 4 in this example Shown to be effective for sinks with more than 50 targets

43 Bounding Box Define a bounding box around source and sinks Restrict the route for each net to no more than 3 channels outside of the bounding box

44 Routing High-fanout Nets

45 Difficulty Prediction Since you can’t know W min first without routing, use an estimate based on a wirelength prediction model

46 Routability Results

47 Routing Times as W increases

48 Routing Time vs. W (clma Benchmark)

49 BFS vs. Directed Search (w/Binning)

50 Difficulty Prediction W estimate took less than 1 second to compute Some prediction mistakes will happen Prediction accuracy was 84%

51 Fuzzy Prediction

52 Architecture and CAD for Deep Submicron FPGAs (Section 2.2.4, pp. 31-34, Section 4.4, pp. 80-89, Section 4.6.4, pp. 99-102 ) Vaughn Betz, Jonathan Rose, Sandy Marquardt Springer, 1999 VPR Timing-Driven Router

53 Example Timing Graph

54 12ns

55 Timing and Slack Process vertices in forward topological order Process vertices in reverse topological order (Required time for any output node is D max ) How much can I slow down this circuit element (node) without increasing the critical path?

56 Elmore Delay of a Source-Sink Path T d,i : intrinsic delay of element i (buffer); 0 otherwise R i : equivalent resistance of element i ( R wire, R buf, R pass ) C(subtree i ) : total downstream capacitance not isolated by buffers

57 Equivalent Circuits

58 Example: M Wire Segments connected by Buffers Linear and Elmore Delay Models agree!

59 Example: Chain of Pass Transistors Linear Quadratic

60 Fanout

61 Introduce Elmore Delay into VPR- PathFinder’s Node Cost Function Parameters that control how slack affects congestion-delay tradeoff

62 … And to Support Directed Search Expected cost to route from a wire segment n to the target sink j – The connection will be completed using wires of the same type as node n Length, connecting switch type, etc. – There is no congestion along the shortest path to the target p(m) and h(m) are both 1 for all nodes m on the path that completes the connection.

63 Why Optimize Elmore Delay?

64 Loops

65 Timing-Driven Router

66 Timing-Driven Routing of High- Fanout Nets Xun Chen, Jianwen Zhu, Minxuan Zhang International Conference on Field Programmable Logic and Applications, 2011

67 Motivation (for Binning, Earlier)

68 Angle-based Pruning Route the trunk first, then each leaf Net Decomposition

69 Results

70 Routing Algorithms for FPGAs with Sparse Intra-Cluster Routing Crossbars Yehdhih Moctar, Guy Lemieux, Philip Brisk International Conference on Field Programmable Logic and Applications, 2012

71 Full Crossbar Intra-cluster Routing 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks 71

72 RRG Full Crossbar Intra-cluster Routing RRG 72

73 Full Crossbar RRG Don’t model the RRG inside each CLB! 73

74 Sparse Crossbar Intra-cluster Routing (1/6) 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks 74 RRG must include the intra-cluster routing

75 Sparse Crossbar Intra-cluster Routing RRG 75

76 Sparse Crossbar RRG RRG encompasses the entire FPGA 76

77 Why is this Hard? Key Issues relating to RRG size – May run out of memory If you are not routing in the cloud – Long router runtimes Large search space leads to slow convergence Thrashing Objective – Route without expanding the whole RRG 77

78 Routing Algorithms Extensions to PathFinder routing algorithm – Routes one net at a time via wavefront expansion Baseline – Extend the RRG with intra-cluster routing at each CLB Selective RRG Expansion (SERRGE) – Route nets to CLB inputs – Dynamically expand the RRG to finish the route Partial Pre-Routing (PPR) – Pre-route all nets within CLBs – Complete global routes to CLB inputs 78

79 Baseline Router 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Inputs of the same LUT are equivalent! Global RRG 79

80 SERRGE in Action (1/8) Start with a global RRG s t 80

81 SERRGE in Action (2/8) s t Route from s’s CLB to t’s CLB 81

82 SERRGE in Action (3/8) s t Locally expand the RRG with part of t’s CLB 82

83 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Maintain one copy of the intra-cluster topology SERRGE in Action (4/8) t 83

84 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Expand the fanout of the CLB input being routed SERRGE in Action (5/8) t 84

85 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Complete the route if possible; if not, try again SERRGE in Action (6/8) t 85

86 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Track used CLB and BLE inputs; remember the route SERRGE in Action (7/8) t 86

87 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Deallocate unused RRG resources SERRGE in Action (8/8) t 87

88 SERRGE Summary Global routing uses standard PathFinder Selectively expand portions of the RRG to complete routes from CLB inputs to BLE inputs Storage Requirement – Global RRG (standard PathFinder requirement) – One copy of the intra-cluster routing topology – All intra-cluster routes computed thus far Fairly challenging to implement! 88

89 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Process CLBs one at a time PPR in Action (1/4) t1t1 t2t2 89

90 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks Compute local routes for each BLE PPR in Action (2/4) t1t1 t2t2 90

91 2-LUT BLE in 1 in 2 2-LUT BLE in 1 in 2 CLB inputs Local feedbacks CLB inputs now form equivalence classes. PPR in Action (3/4) t1t1 t2t2 91

92 PPR in Action (4/4) Perform global routing w/equivalence class constraints on CLB inputs s t 92

93 PPR Summary Pre-route each CLB Propagate LUT equivalences to CLB inputs – A baseline router on the full-blown RRG would still need to satisfy BLE input-pin equivalence constraints Route as normal Much easier implementation than SERRGE 93

94 Critical Path Delay p = 40% ns p = 75% ns PPR SERRGE Baseline

95 RRG Size (MB) p = 40% MB p = 75% MB PPR SERRGE Baseline

96 Router Runtime (s) p = 40% Seconds PPR SERRGE Baseline p = 75%


Download ppt "FPGA Routing Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223."

Similar presentations


Ads by Google