Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the.

Similar presentations


Presentation on theme: "Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the."— Presentation transcript:

1 Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the MESCAL Team especially Yujia Jin, Kaushik Ravindran, and N. R. Satish

2 2 6/23/2015 FromDevice(0) Discard ToDevice(0) FromDevice(1) FromDevice(2) FromDevice(3) Discard ToDevice(1) ToDevice(2) ToDevice(3) Discard … FromDevice(15) Lookup IPRoute ToDevice(15) …… IPVerify DecIPTTL Discard IPVerify DecIPTTL Discard IPVerify DecIPTTL … Discard DecIPTTL Discard DecIPTTL Design Space Exploration Flow MicroBlaze (soft) FS L OPB PLB Hardware acceleratio n Ethern et Off-chip SDRAM On- chip BRAM PECo-PEPECo-PE MEM PECo-PE MEM PERIPHERAL MEM Multiprocessor platform Applicationdescription PerformanceAnalysis PerformanceNumbers Task graph HW/SW generation Implementation Task Graph + profiles Allocation/Scheduling Platform Constraints Scheduling Constraints S1S1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 S2S2

3 3 6/23/2015 Investigative Approach u Demonstrate network applications on FPGA-based soft multiprocessors s Tomahawk exploration framework s Automated task allocation and scheduling u Extend framework to large multiprocessor systems s 1000’s-10,000’s of tasks s 100’s-1000’s of PE’s s RAMP

4 4 6/23/2015 What Is a FPGA-based Soft Multiprocessor System u A network of architecture building blocks on an FPGA u Multiprocessor architecture customized for target application s Number of processors s Interconnection network s Memory hierarchy s Custom co-processors u Cost reduction by avoiding custom silicon u Productivity gains due to software abstraction Processing Element Co-Processor Memory Architecture Building Blocks Bus Queue Xilinx Virtex-II Pro, Virtex-IV family of FPGAs PowerPC (hard) MicroBlaze (soft) FSL OPB PLB Hardware acceleration EthernetOff-chip SDRAM On-chip BRAM PECo-PEPECo-PE MEM PECo-PE MEM PERIPHERAL MEM Multiprocessor Configuration  Blaze(soft) PowerPC(hard) Hash engine Crypto engine BRAM(on-chip) SDRAM(off-chip) FSL OPB PLB

5 5 6/23/2015 Obstacles to Their Adoption: Hard to design u Complex micro-architecture design space s Processor choices s Memory hierarchy s Communication topology u Difficult mapping decisions s assigning computation to processing elements s data to exposed heterogeneous memories u To unlock potential of these systems, tools enabling efficiency and productivity are needed

6 6 6/23/2015 Makespan = 60 P2P2 P1P1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 Total time = 50Total time = 60 Optimal Design Makespan = 70 P2P2 P1P1 R2R2 L2L2 T2T2 Total time = 70Total time = 40 Design B R1R1 L1L1 T1T1 Explore Example: Design Difficulty 20102020 3010 R L T R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 Application Task Graph Execution Time (cycles) P1P1 P2P2 10 Architecture Model P1P1 P2P2 Queue Profile Makespan = 80 P1P1 P2P2 R1R1 L1L1 T1T1 Total time = 80 Design A R2R2 L2L2 T2T2

7 7 6/23/2015 Tomahawk: Network Applications onto Soft MPs FromDevice(0) Discard ToDevice(0) FromDevice(1) FromDevice(2) FromDevice(3) Discard ToDevice(1) ToDevice(2) ToDevice(3) Discard … FromDevice(15) Lookup IPRoute ToDevice(15) … … IPVerify DecIPTTL Discard IPVerify DecIPTTL Discard IPVerify DecIPTTL … Discard DecIPTTL Discard DecIPTTL Click Xilinx 2VP50 FPGA C programs and micro architecture specification MicroBlaze (soft) FSL OPB PLB Hardware acceleration Ethernet Off-chip SDRAM On-chip BRAM PECo-PEPECo-PE MEM PECo-PE MEM PERIPHERAL MEM Task graph Automated micro- architecture configuration Automated Mapping P1P1 P2P2 M1M1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 S1S1 S2S2 S1S1 R1R1 L1L1 T1T1 R2R2 L2L2 T2T2 S2S2

8 8 6/23/2015 Possible Approaches for Automated Exploration u Randomized algorithms s probabilistic bounds, simulated annealing u Heuristic methods s list scheduling, force directed scheduling u Exact methods s enumeration and tabu search, branch-and-bound u Limitations of these approaches s Specific implementation constraints are hard to enforce s Most approaches require per-instance tuning and are hard to generalize – therefore poor for design space exploration

9 9 6/23/2015 Constraint Optimization Techniques for Automated Exploration u Constraint solver technologies s Integer linear programming (ILP) solvers s 0-1 Boolean reasoning solvers (SAT, PB-SAT) u Advantages s Constraint formulations are a formal, yet natural way to capture a mathematical optimization problem s Implementation constraints specific to a problem can be incorporated easily s Constraint solvers can exhaustively cover a search space without enumerating all solutions u Key strategies to improve solver performance: s Decomposition methods s Variable ordering s Improved lower and upper bounds s Symmetry representation

10 10 6/23/2015 ILP Formulation

11 11 6/23/2015 Example Application: IPv4 Packet Forwarding u Data plane of IPv4 packet forwarding (RFC-1812) s Campus network router, Home router s Medium sized route table (5,000 entries or less) s Route table small enough to fit in on-chip memory u Target platform s Xilinx Virtex-II Pro 2VP50 FPGA u Architecture Library s MicroBlazes, PowerPC, on-chip Block RAM, IBM CoreConnect buses, queue Lookup next-hop (prefix match) Receive IPv4 packet Verify version, checksum and TTL Update checksum and TTL Transmit IPv4 packet Header Payload Header Ingress Egress Route Table Lookup: inspect destination address and find next hop –Longest prefix match –Implementation determined by route distribution, memory and performance constraints

12 12 6/23/2015 Hand-tuned Multiprocessor Design for IPv4 Forwarding u Achieved 1.8 Gbps throughput for header processing s using 12 MicroBlaze processors Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Verify ver & ttl checksum Lookup1 Route Table From source MicroBlaze 1 From source MicroBlaze 2 To source MicroBlaze 1 To source MicroBlaze 2 To source MicroBlaze 2 To source MicroBlaze 1 Key: MicroBlaze Block RAM Bus Queue Lookup2 Route Table To source MicroBlaze 1

13 13 6/23/2015 Improved Design after Automated Exploration u Resulting design achieved 2.0 Gbps throughput s surpassing performance of a 1.8 Gbps hand-tuned design s using one less MicroBlaze processor u The improvement was due to a less regular configuration and balanced workload of tasks across the processors Lookup1 Verify ver& ttl Route Table Lookup2 Lookup3 Verify checksum Lookup1 Verify ver& ttl Lookup2 Lookup1 Verify ver& ttl Lookup2 Lookup1 Verify ver& ttl Lookup2 Lookup3 Verify checksum Lookup3 Verify checksum Route Table Route Table From source MicroBlaze 1 From source MicroBlaze 2 To source MicroBlaze 1 To source MicroBlaze 2 To source MicroBlaze 2 To source MicroBlaze 1 Key: MicroBlaze Block RAM Bus Queue

14 14 6/23/2015 Justifying constraint optimization techniques u Our constraint optimization method can handle instances of the representative allocation and scheduling problem with up to 100’s of tasks onto 10’s of PE’s u Implementation constraints can be easily incorporated s Task groupings s Multiprocessor topology restrictions s Preferred allocations s Memory assignments s Mutual exclusion

15 15 6/23/2015 Following Moore’s Law Explore On-chip network PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI u Extend to more complex applications s 1000’s-10,000’s of tasks u Extend to bigger multiprocessor systems s 100’s-1000’s of PE’s

16 16 6/23/2015 What can we do for RAMP? u Challenges in deploying concurrent applications on a RAMP system s Task allocation and scheduling across 100’s – 1000’s of PEs s Fast mapping step to enable efficient design space exploration u Our optimization techniques for static task allocation and scheduling are a first step to address these challenges s A “compile-time” tool to guide the designer to explore efficient mappings s Flexible formulation to target diverse multiprocessors s Research in progress to extend our techniques to work on problems in the scale of RAMP systems

17 17 6/23/2015 Backup Slides

18 18 6/23/2015 Example u Optimal design found in less than 6 seconds on 400MHz Sparc II Architecture P11P11 P21P21 P12P12 M1M1 MicroBlazes Power PC BRAMs Communication FSLs Bus 2VP50 Optimal design explore Application

19 19 6/23/2015 Following Moore’s Law u Extend to more complex applications s 1000’s-10,000’s of tasks s DSLAM u Extend to bigger multiprocessor systems s 100’s-1000’s of PE’s s RAMP

20 20 6/23/2015 Challenges in Automated Exploration u Higher exploration complexity s Increases by 2 orders of magnitude u More emphasis on communication s Arbitration modeling s Routing constraints due to network topology u Statistical cost model for dynamic behavior

21 21 6/23/2015 Potential Approaches to Address these Challenges u Additional constraints can be easily added to incorporate new features u Constraint solver performance will slow down and thus become the bottleneck u Some strategies to improve constraint solver performance s Task graph based structural decompositions s Relaxation heuristics s Symmetry representation s Cutting planes and valid inequalities

22 22 6/23/2015 PE Processor interconnect Memory Network Interface Key On-chip network PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI PE M NI


Download ppt "Mapping Task Graphs to Processors in Large Multiprocessor Systems Mapping Task Graphs to Processors in Large Multiprocessor Systems Kurt Keutzer and the."

Similar presentations


Ads by Google