Chin Hau Hoo, Akash Kumar

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Topology-Aware Buffer Insertion and GPU-Based Massively Parallel Rerouting for ECO Timing Optimization Yen-Hung Lin, Yun-Jian Lo, Hian-Syun Tong, Wen-Hao.

FPGA Technology Mapping Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

EXPLORING HIGH THROUGHPUT COMPUTING PARADIGM FOR GLOBAL ROUTING Yiding Han, Dean Michael Ancajas, Koushik Chakraborty, and Sanghamitra Roy Electrical and.

Reconfigurable Computing (EN2911X, Fall07)

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

 Y. Hu, V. Shih, R. Majumdar and L. He, “Exploiting Symmetries to Speedup SAT-based Boolean Matching for Logic Synthesis of FPGAs”, TCAD  Y. Hu,

EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

VLSI Physical Design: From Graph Partitioning to Timing Closure Chapter 5: Global Routing © KLMH Lienig 1 EECS 527 Paper Presentation High-Performance.

Archer: A History-Driven Global Routing Algorithm Mustafa Ozdal Intel Corporation Martin D. F. Wong Univ. of Illinois at Urbana-Champaign Mustafa Ozdal.

1 Dynamic Interconnection Networks Miodrag Bolic.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

Parallel Routing for FPGAs based on the operator formulation

Static Process Scheduling

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

FPGA CAD 10-MAR-2003.

System in Package and Chip-Package-Board Co-Design

1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,

FPGA Routing Pathfinder [Ebeling, et al., 1995] Introduced negotiated congestion During each routing iteration, route nets using shortest.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks (cont.) Dr. Xiao Qin Auburn University

A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Global Delay Optimization using Structural Choices Alan Mishchenko Robert Brayton UC Berkeley Stephen Jang Xilinx Inc.

Oleg Petelin and Vaughn Betz FPL 2016

Hardware Descriptions of Multi-Layer Perceptions with Different Abstraction Levels Paper by E.M. Ortigosa , A. Canas, E.Ros, P.M. Ortigosa, S. Mota , J.

Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing

Backprojection Project Update January 2002

Auburn University

Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.

A New Logic Synthesis, ExorBDS

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Data Partition Dr. Xiao Qin Auburn University.

HeAP: Heterogeneous Analytical Placement for FPGAs

CS184a: Computer Architecture (Structure and Organization)

Delay Optimization using SOP Balancing

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Gouraud-shaded Triangle Rasterization

Reconfigurable Computing

2 University of California, Los Angeles

XC4000E Series Xilinx XC4000 Series Architecture 8/98

ChipScope Pro Software

Topics Logic synthesis. Placement and routing..

Dynamic FPGA Routing for Just-in-Time Compilation

COMP60621 Fundamentals of Parallel and Distributed Systems

Hossein Omidian, Guy Lemieux

Routing Algorithms.

Register-Transfer (RT) Synthesis

ChipScope Pro Software

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

Delay Optimization using SOP Balancing

Fast Min-Register Retiming Through Binary Max-Flow

COMP60611 Fundamentals of Parallel and Distributed Systems

CprE / ComS 583 Reconfigurable Computing

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Design Principles of Scalable Switching Networks

Reconfigurable Computing (EN2911X, Fall07)

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Chin Hau Hoo, Akash Kumar ParaDRo: A Parallel Deterministic Router Based on Spatial Partitioning and Scheduling Chin Hau Hoo, Akash Kumar There are two parts to the title, I will explain each individually

Moore’s law driving FPGA advancement Xilinx XC2064 64 LUTs This first FPGA has X number of LUTs So what is an f Xilinx Zynq Ultrascale+ 500,000+ LUTs

What is the problem? In other words, the capacity of fpga is growing faster than the ability of design tools to effectively utilize it

Why? End of Dennard’s scaling Increasing the clock speed of processor is no longer a viable solution to reduce execution time “Power wall”

Percentage of execution time in various stages FPGA design flow HDL Logic synthesis Percentage of execution time in various stages Technology mapping Packing Placement To better understand the design productivity gap, let’s look at a typical fpga design flow. Therefore, in this talk we are focusing on routing. Data is obtained using the 8 benchmarks in slide 29 Routing Bitstream generation

Pathfinder CLB CLB CLB CLB Initialize No Yes Rip up net Has nets to route? No Yes Rip up net Update history congestion costs Route net So what is the algorithm to perform routing in fpgas? Pathfinder The figure on the left shows an example arch of fpga Pathfinder builds a graph to model the the routing resources in this fpga during initialization Update present congestion costs CLB CLB Yes Congested? Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Yes Rip up net Has nets to route? No Yes Rip up net Update history congestion costs Route net Update present congestion costs CLB CLB Yes Congested? Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Yes Rip up net Has nets to route? No Yes Rip up net Update history congestion costs Route net Update present congestion costs CLB CLB Yes Congested? Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Yes Rip up net Has nets to route? No Yes Rip up net Update history congestion costs Route net Update present congestion costs CLB CLB Yes Congested? Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Yes Rip up net Has nets to route? No Yes Rip up net Update history congestion costs Route net Path finder is similar to Dijktra’s algorithm except that the cost is updated dynamically Update present congestion costs CLB CLB Yes Congested? Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Net 1 source Yes Rip up net Has nets to route? No Net 1 source Yes Rip up net Net 2 source Update history congestion costs Route net Update present congestion costs CLB Net 1 sink CLB Yes Congested? Net 2 sink Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Net 1 source Yes Rip up net Has nets to route? No Net 1 source Yes Rip up net Net 2 source Update history congestion costs Route net Update present congestion costs CLB Net 1 sink CLB Yes Congested? Net 2 sink Output pin No Input pin Done

Pathfinder CLB CLB CLB CLB Initialize No Net 1 source Yes Rip up net Has nets to route? No Net 1 source Yes Rip up net Net 2 source Update history congestion costs Route net Update present congestion costs CLB Net 1 sink CLB Yes Congested? Net 2 sink Output pin No Input pin Done

Observations Pathfinder is largely sequential Route of a net depends on the routes taken by nets routed before it Is it possible to create a scalable and deterministic parallel FPGA router? No free lunch Complex trade off between speedup, quality of result, and determinism

Existing deterministic routers Platform Method Speedup Gort, 2010 (Coarse) Shared memory Net partitioning Moderate Gort, 2010 (Fine) Parallel maze expansion Low Gort, 2010 (Hybrid) Shared + Distributed memory Net partitioning + Parallel maze expansion Gort, 2012 Distributed memory Gort, 2012 (Enhanced) Net partitioning + Conditional net rerouting High Shen, 2015 Shen, 2017 GPU Very high ParaDRo Net partitioning + Scheduling Net partitioning is used in almost all approaches to ensure determinism

Algorithm overview Initialization Recursive bi-partitioning Pre-processing Finding the optimal cut-line Multi-sink net decomposition and virtual net generation Overlap graph coloring and task graph generation Routing virtual nets in the task graph associated with each node in the bi-partitioning tree

Recursive bi-partitioning 1 2 3 4 5 6 7 8 FPGA Columns A start=0, end=3 B start=0, end=2 F start=5, end=8 E start=5, end=7 D start=4, end=6 C start=3, end=5 P0 Nets: C, D P1 Nets: A, B P2 Nets: E, F

Recursive bi-partitioning – Pre-processing start=0, end=3 E start=5, end=7 nets_start 1 2 3 4 5 6 7 8 B start=0, end=2 F start=5, end=8 C A C D E start=3, end=5 B F D start=4, end=6 1 2 3 4 5 6 7 8 FPGA Columns

Recursive bi-partitioning – Pre-processing start=0, end=3 E start=5, end=7 nets_end 1 2 3 4 5 6 7 8 B start=0, end=2 F start=5, end=8 C B A C D E F start=3, end=5 D start=4, end=6 1 2 3 4 5 6 7 8 FPGA Columns

Recursive bi-partitioning – Pre-processing total_workload_after 1 2 3 4 5 6 7 8 2 A start=0, end=3 3 E start=5, end=7 3 2 B start=0, end=2 F start=5, end=8 3 2 1 C start=3, end=5 3 2 1 D start=4, end=6 3 2 1 1 2 3 4 5 6 7 8 FPGA Columns 9 6 4 3 Prefix sum from the last element

Recursive bi-partitioning – Pre-processing total_workload_before 1 2 3 4 5 6 7 8 2 A start=0, end=3 1 2 E start=5, end=7 1 2 B start=0, end=2 F start=5, end=8 1 2 C start=3, end=5 1 2 D start=4, end=6 1 2 1 2 3 4 5 6 7 8 FPGA Columns 1 3 5 6 7 9 Prefix sum from the first element

Recursive bi-partitioning - Finding the optimal cut-line total_workload_before Absolute workload diff 1 2 3 4 5 6 7 8 Cutline 1 2 3 4 5 6 7 8 9 1 3 3 5 6 7 9 9 6 6 5 1 5 6 7 9 total_workload_after Min diff Min diff Min diff Min diff Min diff 1 2 3 4 5 6 7 8 9 6 6 6 4 3 P0 Nets: C, D 1 2 3 4 5 6 7 8 D B A C E F nets_end 1 2 3 4 5 6 7 8 nets_start A B C D E F P1 Nets: A, B P2 Nets: E, F

Net C and D are overlapping! Problem when routing Net C and D are overlapping! Thread 0 Thread 1 C Idle P0 Nets: C, D D A E P1 Nets: A, B P2 Nets: E, F B F

Multi-sink net decomposition and virtual net generation Less overlap between virtual nets C D C1 D C2 OR When we break D1 C1 C2

Overlap graph coloring and task graph generation Graph coloring to identify parallelism Overlap graph Graph coloring to identify parallelism Conversion to directed task graph to minimize idle time

Overlap graph coloring and task graph generation Graph coloring to identify parallelism Graph coloring to identify parallelism Conversion to directed task graph to minimize idle time Overlap graph

Routing the nets with scheduling Without scheduling Thread 0 Thread 1 Thread 0 Thread 1 C2 D C C1 Idle P0 Nets: C1, C2, C3, D C3 A E D VS P1 Nets: A, B P2 Nets: E, F B F A E B F

Ensuring convergence When the number of congested nets drops below a predefined threshold, ParaDro only reroutes congested nets When rerouting only congested nets, the bounding boxes of virtual nets are expanded back to their original size D C1 C2 C1 D C2

Experimental Setup Benchmark Total nets Minimum channel width Dual Intel® Xeon® E5-2680 v3 + 128 GB RAM RHEL 6.8 + Linux kernel 2.6.32 gcc 7 with -O3 flag Benchmark Total nets Minimum channel width stereovision1 10,797 104 LU8PEEng 15,990 114 stereovision2 34,501 154 LU32PEEng 53,199 174 neuron 54,056 206 stereo_vision 61,883 228 segmentation 125,592 292 denoise 257,425 310

Speedup of different ParaDRo variants Par: parallelism within a node as well Extra: bipartition each internal node once more in orthogonal direction Seq – no parallelism within the partition node Par – parallelize the nodes within the partition e.g. C and D Extra – orthogonal cuts in all the nodes, except the last level of the tree. Since the number of leaves is the same as the number of threads.

Critical path delay

Impact on total wirelength Two factors – Restrictive bounding box and our ParaDRo partitioning

Impact of enforcing sequential equivalence on speedup

Speedup comparison with existing deterministic works 18.7 Gort Enhanced – only in first iteration it routes all nets.. From the second iteration only congested routes are routed. Largest benchmark was Stereo-vision with 40K LUTs, may give convergence issues with larger benchmarks Shen uses Bellman-Ford, easier to parallelize than Dijkstra but it is not work-optimal (par extra)

Impact of ‘Determinism’ ParaFRo: FPL 2016 ParaDiMe: FCCM 2017

Conclusions Presented a deterministic parallel router - ParaDRo Uses recursive bi-partitioning to group the nets into nodes that can be routed in parallel Graph coloring and task scheduling are employed within the node to exploit further parallelism Determinism affects speed-up significantly

Thank you!

Where are FPGAs used? Aerospace Prototyping Automotive Broadcast Consumer Electronics Data Center HPC Industrial Medical Test & Measurement Wired Comms Wireless Comms So where are fpp

Verilog-to-Routing (VTR) Design tools Implement applications on FPGAs Aerospace Prototyping Automotive Broadcast Consumer Electronics Data Center HPC Industrial Medical Test & Measurement Wired Comms Wireless Comms Design tools is the bridge between fpga and their applications Verilog-to-Routing (VTR) Evaluate new FPGA architectures