Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the.

Slides:

Advertisements

Similar presentations

Multiprocessor Architecture for Image processing Mayank Kumar – 2006EE10331 Pushpendre Rastogi – 2006EE50412 Under the guidance of Dr.Anshul Kumar.

Advertisements

Multihoming and Multi-path Routing

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

1 Swiss Federal Institute of Technology Computer Engineering and Networks Laboratory Design Space Exploration of Embedded Systems © Lothar Thiele ETH Zurich.

NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius

1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department.

Trevor Burton6/19/2015 Multiprocessors for DSP SYSC5603 Digital Signal Processing Microprocessors, Software and Applications.

Chess Review May 10, 2004 Berkeley, CA A Comparison of Network Processor Programming Environments Niraj Shah William Plishker Kurt Keutzer.

Configurable System-on-Chip: Xilinx EDK

The Xilinx EDK Toolset: Xilinx Platform Studio (XPS) Building a base system platform.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

CAD and Design Tools for On- Chip Networks Luca Benini, Mark Hummel, Olav Lysne, Li-Shiuan Peh, Li Shang, Mithuna Thottethodi,

CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.

1 Chapter 14 Embedded Processing Cores. 2 Overview RISC: Reduced Instruction Set Computer RISC-based processor: PowerPC, ARM and MIPS The embedded processor.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

5 th Biennial Ptolemy Miniconference Berkeley, CA, May 9, 2003 MESCAL Application Modeling and Mapping: Warpath Andrew Mihal and the MESCAL team UC Berkeley.

Router modeling using Ptolemy Xuanming Dong and Amit Mahajan May 15, 2002 EE290N.

Torino (Italy) – June 25th, 2013 Ant Colony Optimization for Mapping, Scheduling and Placing in Reconfigurable Systems Christian Pilato Fabrizio Ferrandi,

Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.

WAN Technologies.

System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.

Router Architectures An overview of router architectures.

Better by a HAIR: Hardware-Amenable Internet Routing Brent Mochizuki University of Illinois at Urbana-Champaign Joint work with: Firat Kiyak (Illinois)

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Sarang Dharmapurikar With contributions from : Praveen Krishnamurthy,

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Applied research laboratory David E. Taylor Users Guide: Fast IP Lookup (FIPL) in the FPX Gigabit Kits Workshop 1/2002.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

EECB 473 DATA NETWORK ARCHITECTURE AND ELECTRONICS PREPARED BY JEHANA ERMY JAMALUDDIN Basic Packet Processing: Algorithms and Data Structures.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Part A Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

Hardware Accelerator for Combinatorial Optimization Fujian Li Advisor: Dr. Areibi.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

This material exempt per Department of Commerce license exception TSU Xilinx On-Chip Debug.

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Codesigned On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also.

Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.

1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown

Physically Aware HW/SW Partitioning for Reconfigurable Architectures with Partial Dynamic Reconfiguration Sudarshan Banarjee, Elaheh Bozorgzadeh, Nikil.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

On-Chip Logic Minimization Roman Lysecky & Frank Vahid* Department of Computer Science and Engineering University of California, Riverside *Also with the.

Optimizing Packet Lookup in Time and Space on FPGA Author: Thilan Ganegedara, Viktor Prasanna Publisher: FPL 2012 Presenter: Chun-Sheng Hsueh Date: 2012/11/28.

Wrap up. Structures and views Quality attribute scenarios Achieving quality attributes via tactics Architectural pattern and styles.

CS 268: Router Design Ion Stoica February 27, 2003.

A Methodology for System-on-a-Programmable-Chip Resources Utilization

Packet Switch Architectures

HIGH LEVEL SYNTHESIS.

Project proposal: Questions to answer

Presentation transcript:

Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N. R. Satish

2 6/23/2015 FromDevice(0) Discard ToDevice(0) FromDevice(1) FromDevice(2) FromDevice(3) Discard ToDevice(1) ToDevice(2) ToDevice(3) Discard … FromDevice(15) Lookup IPRoute ToDevice(15) …… IPVerify DecIPTTL Discard IPVerify DecIPTTL Discard IPVerify DecIPTTL … Discard DecIPTTL Discard DecIPTTL Design Space Exploration Flow MicroBlaze (soft) FS L OPB PLB Hardware acceleratio n Ethern et Off-chip SDRAM On- chip BRAM PECo-PEPECo-PE MEM PECo-PE MEM PERIPHERAL MEM Multiprocessor platform Applicationdescription PerformanceAnalysis PerformanceNumbers Task graph HW/SW generation Implementation Task Graph + profiles Allocation/Scheduling Platform Constraints Scheduling Constraints S1S1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 S2S2

3 6/23/2015 Investigative Approach u Demonstrate network applications on FPGA-based soft multiprocessors s Tomahawk exploration framework s Automated task allocation and scheduling u Extend framework to large multiprocessor systems s 1000’s-10,000’s of tasks s 100’s-1000’s of PE’s s RAMP

4 6/23/2015 What Is a FPGA-based Soft Multiprocessor System u A network of architecture building blocks on an FPGA u Multiprocessor architecture customized for target application s Number of processors s Interconnection network s Memory hierarchy s Custom co-processors u Cost reduction by avoiding custom silicon u Productivity gains due to software abstraction Processing Element Co-Processor Memory Architecture Building Blocks Bus Queue Xilinx Virtex-II Pro, Virtex-IV family of FPGAs PowerPC (hard) MicroBlaze (soft) FSL OPB PLB Hardware acceleration EthernetOff-chip SDRAM On-chip BRAM PECo-PEPECo-PE MEM PECo-PE MEM PERIPHERAL MEM Multiprocessor Configuration  Blaze(soft) PowerPC(hard) Hash engine Crypto engine BRAM(on-chip) SDRAM(off-chip) FSL OPB PLB

5 6/23/2015 Obstacles to Their Adoption: Hard to design u Complex micro-architecture design space s Processor choices s Memory hierarchy s Communication topology u Difficult mapping decisions s assigning computation to processing elements s data to exposed heterogeneous memories u To unlock potential of these systems, tools enabling efficiency and productivity are needed

6 6/23/2015 Makespan = 60 P2P2 P1P1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 Total time = 50Total time = 60 Optimal Design Makespan = 70 P2P2 P1P1 R2R2 L2L2 T2T2 Total time = 70Total time = 40 Design B R1R1 L1L1 T1T1 Explore Example: Design Difficulty R L T R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 Application Task Graph Execution Time (cycles) P1P1 P2P2 10 Architecture Model P1P1 P2P2 Queue Profile Makespan = 80 P1P1 P2P2 R1R1 L1L1 T1T1 Total time = 80 Design A R2R2 L2L2 T2T2

7 6/23/2015 Tomahawk: Network Applications onto Soft MPs FromDevice(0) Discard ToDevice(0) FromDevice(1) FromDevice(2) FromDevice(3) Discard ToDevice(1) ToDevice(2) ToDevice(3) Discard … FromDevice(15) Lookup IPRoute ToDevice(15) … … IPVerify DecIPTTL Discard IPVerify DecIPTTL Discard IPVerify DecIPTTL … Discard DecIPTTL Discard DecIPTTL Click Xilinx 2VP50 FPGA C programs and micro architecture specification MicroBlaze (soft) FSL OPB PLB Hardware acceleration Ethernet Off-chip SDRAM On-chip BRAM PECo-PEPECo-PE MEM PECo-PE MEM PERIPHERAL MEM Task graph Automated micro- architecture configuration Automated Mapping P1P1 P2P2 M1M1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 S1S1 S2S2 S1S1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 S2S2

8 6/23/2015 Possible Approaches for Automated Exploration u Randomized algorithms s probabilistic bounds, simulated annealing u Heuristic methods s list scheduling, force directed scheduling u Exact methods s enumeration and tabu search, branch-and-bound u Limitations of these approaches s Specific implementation constraints are hard to enforce s Most approaches require per-instance tuning and are hard to generalize – therefore poor for design space exploration

9 6/23/2015 Constraint Optimization Techniques for Automated Exploration u Constraint solver technologies s Integer linear programming (ILP) solvers s 0-1 Boolean reasoning solvers (SAT, PB-SAT) u Advantages s Constraint formulations are a formal, yet natural way to capture a mathematical optimization problem s Implementation constraints specific to a problem can be incorporated easily s Constraint solvers can exhaustively cover a search space without enumerating all solutions u Key strategies to improve solver performance: s Decomposition methods s Variable ordering s Improved lower and upper bounds s Symmetry representation

10 6/23/2015 ILP Formulation

11 6/23/2015 Example Application: IPv4 Packet Forwarding u Data plane of IPv4 packet forwarding (RFC-1812) s Campus network router, Home router s Medium sized route table (5,000 entries or less) s Route table small enough to fit in on-chip memory u Target platform s Xilinx Virtex-II Pro 2VP50 FPGA u Architecture Library s MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue Lookup next-hop (prefix match) Receive IPv4 packet Verify version, checksum and TTL Update checksum and TTL Transmit IPv4 packet Header Payload Header Ingress Egress Route Table Lookup: inspect destination address and find next hop –Longest prefix match –Implementation determined by route distribution, memory and performance constraints

12 6/23/2015 Hand-tuned Multiprocessor Design for IPv4 Forwarding u Achieved 1.8 Gbps throughput for header processing s using 12 MicroBlaze processors Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Route Table From source MicroBlaze 1 From source MicroBlaze 2 To source MicroBlaze 1 To source MicroBlaze 2 To source MicroBlaze 2 To source MicroBlaze 1 Key: MicroBlaze Block RAM Bus Queue Lookup2 Route Table To source MicroBlaze 1

13 6/23/2015 Improved Design after Automated Exploration u Resulting design achieved 2.0 Gbps throughput s surpassing performance of a 1.8 Gbps hand-tuned design s using one less MicroBlaze processor u The improvement was due to a less regular configuration and balanced workload of tasks across the processors Lookup1 Verify ver& ttl Route Table Lookup2 Lookup3 Verify checksum Lookup1 Verify ver& ttl Lookup2 Lookup1 Verify ver& ttl Lookup2 Lookup1 Verify ver& ttl Lookup2 Lookup3 Verify checksum Lookup3 Verify checksum Route Table Route Table From source MicroBlaze 1 From source MicroBlaze 2 To source MicroBlaze 1 To source MicroBlaze 2 To source MicroBlaze 2 To source MicroBlaze 1 Key: MicroBlaze Block RAM Bus Queue

14 6/23/2015 Justifying constraint optimization techniques u Our constraint optimization method can handle instances of the representative allocation and scheduling problem with up to 100’s of tasks onto 10’s of PE’s u Implementation constraints can be easily incorporated s Task groupings s Multiprocessor topology restrictions s Preferred allocations s Memory assignments s Mutual exclusion

15 6/23/2015 Following Moore’s Law Explore On-chip network PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI u Extend to more complex applications s 1000’s-10,000’s of tasks u Extend to bigger multiprocessor systems s 100’s-1000’s of PE’s

16 6/23/2015 What can we do for RAMP? u Challenges in deploying concurrent applications on a RAMP system s Task allocation and scheduling across 100’s – 1000’s of PEs s Fast mapping step to enable efficient design space exploration u Our optimization techniques for static task allocation and scheduling are a first step to address these challenges s A “compile-time” tool to guide the designer to explore efficient mappings s Flexible formulation to target diverse multiprocessors s Research in progress to extend our techniques to work on problems in the scale of RAMP systems

17 6/23/2015 Backup Slides

18 6/23/2015 Example u Optimal design found in less than 6 seconds on 400MHz Sparc II Architecture P11P11 P21P21 P12P12 M1M1 MicroBlazes Power PC BRAMs Communication FSLs Bus 2VP50 Optimal design explore Application

19 6/23/2015 Following Moore’s Law u Extend to more complex applications s 1000’s-10,000’s of tasks s DSLAM u Extend to bigger multiprocessor systems s 100’s-1000’s of PE’s s RAMP

20 6/23/2015 Challenges in Automated Exploration u Higher exploration complexity s Increases by 2 orders of magnitude u More emphasis on communication s Arbitration modeling s Routing constraints due to network topology u Statistical cost model for dynamic behavior

21 6/23/2015 Potential Approaches to Address these Challenges u Additional constraints can be easily added to incorporate new features u Constraint solver performance will slow down and thus become the bottleneck u Some strategies to improve constraint solver performance s Task graph based structural decompositions s Relaxation heuristics s Symmetry representation s Cutting planes and valid inequalities

22 6/23/2015 PE Processor interconnect Memory Network Interface Key On-chip network PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI