Download presentation
Presentation is loading. Please wait.
Published byKelley Alexander Modified over 9 years ago
1
1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou
2
2 Accelerator-Rich Architectures: ARC, CHARM, BiN
3
3 Goals u Implement the architecture features & supports into the prototype system Architecture Proposals Architecture-rich CMPs CHARM Hybrid cache Buffer-in NUCA etc Bridge different thrusts in CDSC
4
4 Server-Class Platform: HC-1ex Architecture Xeon Quad Core LV5408 40W TDP Tesla C1060 100GB/s off-chip bandwidth 200W TDP 4XC6vlx760 FPGAs 80GB/s off-chip bandwidth 90W Design Power
5
5 Drawback of the Commodity Systems u Limited ability to customize from the architecture point of view u Board-level integration rather than chip-level integration u Commodity systems can only reach certain-level, we need further innovations
6
6 CHP Prototyping Plan u Create the working hardware and software Use FPGA Extensible Processing Platform ( EPP) as the platform Reuse existing FPGA IPs as much as possible u Working in multiple phases
7
7 Target Platforms: Xilinx ML605 and Zynq u Dual-core A9 with programmable logics u Virtex6-based board
8
8 CHP Prototyping Phases u ARC Implementation Phase 1: Basic platform Accelerator and Software GAM Phase 2: Adding modularity using available IP E.g. Xilinx DMAC IP Phase 3: First step toward BiN Shared buffer Customized modules (e.g. DMA-controller, plug-n-play accelerator) Phase 4: System Enhancement Crossbar AXI implementation u CHARM Implementation
9
9 ARC Phase 1 Goals u Setting up a basic environment Multi-core + simple accelerators + OS Understanding the system interactions in more detail Simple controller as GAM (global accelerator manager) Supports sharing at system-level for multiple accelerators of a same type
10
10 Microblaze-0 (Linux with MMU) Microblaze-1 (GAM) (Bare-metal; no MMU) AXI4 (xbar) AXI4lite (bus) DDR3 Mailbox (vecadd) FSL vecadd timeruartmutex FSL vecsub Mailbox (vecsub) FSL ARC Phase 1 Example System Diagram
11
11 ARC Phase-2 Goals u Implementing a system similar to ARC original design GAM, Accelerator, DMA-Controller, SPM u Adding modularity using available IP E.g. Xilinx DMAC IP
12
12 ARC Phase-2 Architecture
13
ARC Phase-2 Performance and Power Results u Benchmarking kernel: u Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA ML605 @ 100MHz 1,746217,570X 2x Quad-core Intel Xeon CPU E5405 x64 @ 2.00GHz, 1 FPU per core 562801,365X Dual-core Intel Xeon CPU 5150 x32 @ 2.66GHz, 1 FPU per core 10,0616594X 16-Core UltraSPARC T1 @ 1.2 GHz, 1 shared FPU 852,163721X
14
ARC Phase-2 Runtime Breakdown
15
ARC Phase-2 Area Breakdown u Slice Logic Utilization Number of Slice Registers: 45,283 out of 301,440: 15% Number of Slice LUTs: 40,749 out of 150,720: 27% Number used as logic: 32,505 out of 150,720: 21% Number used as logic: 32,505 out of 150,720: 21% Number used as Memory: 5,248 out of 58,400: 8% Number used as Memory: 5,248 out of 58,400: 8% u Slice Logic Distribution: Number of occupied Slices: 17,621 out of 37,680: 46% Number of LUT Flip Flop pairs used: 54,323 Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused LUT: 13,574 out of 54,323: 24% Number with an unused LUT: 13,574 out of 54,323: 24% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48%
16
ARC Phase-3 Goals u First step toward BiN: Shared buffer u Designing our customized modules Customized DMA-controller Handles batch TLB misses Handles batch TLB misses Plug-n-play accelerator design Making the interface general enough at least for a class of accelerators Making the interface general enough at least for a class of accelerators
17
ARC Phase-3 Architecture u A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6) Global accelerator manager (GAM) for accelerator sharing Shared on-chip buffers: Much more accelerators than buffer bank resources Virtual addressing in the accelerators, accelerator virtualization Virtual addressing DMA, with on-demand TLB filling from core No network-on-chip, no buffer sharing with cache, no customized instruction in the core
18
Performance and Power Results u Benchmarking kernel: u Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA ML605 @ 100MHz 1,80228,050,786X 2x Quad-core Intel Xeon CPU E5405 x64 @ 2.00GHz, 1 FPU per core 562802,069,261X Dual-core Intel Xeon CPU 5150 x32 @ 2.66GHz, 1 FPU per core 10,061657,947X 16-Core UltraSPARC T1 @ 1.2 GHz, 1 shared FPU 852,163721X
19
Impact of Communication & Computation Overlapping 19% Pipelined Communication & Computation No pipeline
20
Overhead of Buffer Sharing: Bank Access Contention (1) 3.2% The 4 logic buffers are allocated to 4 separate buffer banks The 4 logic buffers are allocated to 1 buffer bank Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time
21
Overhead of Buffer Sharing: Bank Access Contention (2) 2.7% The 4 logic buffers are allocated to 4 separate buffer banks The 4 logic buffers are allocated to 1 buffer bank
22
Area Breakdown u Slice Logic Utilization Number of Slice Registers: 105,969 out of 301,440: 35% Number of Slice LUTs: 93,755 out of 150,720: 62% Number used as logic: 80,410 out of 150,720: 53% Number used as logic: 80,410 out of 150,720: 53% Number used as Memory: 7,406 out of 58,400: 12% Number used as Memory: 7,406 out of 58,400: 12% u Slice Logic Distribution: Number of occupied Slices: 32,779 out of 37,680: 86% Number of LUT Flip Flop pairs used: 112,772 Number with an unused Flip Flop: 25,037 out of 112,772: 22% Number with an unused Flip Flop: 25,037 out of 112,772: 22% Number with an unused LUT: 19,017 out of 112,772: 16% Number with an unused LUT: 19,017 out of 112,772: 16% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60%
23
Phase-4 ARC Goals u Finding bottlenecks and system enhancement u Communication bottleneck Crossbar design instead of AXI-bus Speed-up AXI non-burst implementation
24
24 u Crossbar In addition to previously proposed now support partial configuration will not affect working LCAs will not affect working LCAs Passed on-board test u Hierarchical DMACs Data transfer between Main memory Main memory Shared buffer banks Shared buffer banks # of buffer banks can be large want to keep AXI bus size Hierarchical DMACs and buses Accelerator Memory System Design IOMMU Buffer bank1 Buffer bank2 Buffer bank3 Buffer bank4 Buffer bank9 AXI buses DMAC1 DMAC2 DMAC3 Select-bit Receiver GAM Main AXI bus to DDR LCA1 LCA2 LCA3 LCA4 OC core
25
25 u Crossbar Results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.