1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou

2 Accelerator-Rich Architectures: ARC, CHARM, BiN

3 Goals u Implement the architecture features & supports into the prototype system  Architecture Proposals Architecture-rich CMPs CHARM Hybrid cache Buffer-in NUCA etc  Bridge different thrusts in CDSC

4 Server-Class Platform: HC-1ex Architecture Xeon Quad Core LV5408 40W TDP Tesla C1060 100GB/s off-chip bandwidth 200W TDP 4XC6vlx760 FPGAs 80GB/s off-chip bandwidth 90W Design Power

5 Drawback of the Commodity Systems u Limited ability to customize from the architecture point of view u Board-level integration rather than chip-level integration u Commodity systems can only reach certain-level, we need further innovations

6 CHP Prototyping Plan u Create the working hardware and software  Use FPGA Extensible Processing Platform ( EPP) as the platform Reuse existing FPGA IPs as much as possible u Working in multiple phases

7 Target Platforms: Xilinx ML605 and Zynq u Dual-core A9 with programmable logics u Virtex6-based board

8 CHP Prototyping Phases u ARC Implementation  Phase 1: Basic platform Accelerator and Software GAM  Phase 2: Adding modularity using available IP E.g. Xilinx DMAC IP  Phase 3: First step toward BiN Shared buffer Customized modules (e.g. DMA-controller, plug-n-play accelerator)  Phase 4: System Enhancement Crossbar AXI implementation u CHARM Implementation

9 ARC Phase 1 Goals u Setting up a basic environment  Multi-core + simple accelerators + OS Understanding the system interactions in more detail  Simple controller as GAM (global accelerator manager) Supports sharing at system-level for multiple accelerators of a same type

10 Microblaze-0 (Linux with MMU) Microblaze-1 (GAM) (Bare-metal; no MMU) AXI4 (xbar) AXI4lite (bus) DDR3 Mailbox (vecadd) FSL vecadd timeruartmutex FSL vecsub Mailbox (vecsub) FSL ARC Phase 1 Example System Diagram

11 ARC Phase-2 Goals u Implementing a system similar to ARC original design  GAM, Accelerator, DMA-Controller, SPM u Adding modularity using available IP  E.g. Xilinx DMAC IP

12 ARC Phase-2 Architecture

ARC Phase-2 Performance and Power Results u Benchmarking kernel: u Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA ML605 @ 100MHz 1,746217,570X 2x Quad-core Intel Xeon CPU E5405 x64 @ 2.00GHz, 1 FPU per core 562801,365X Dual-core Intel Xeon CPU 5150 x32 @ 2.66GHz, 1 FPU per core 10,0616594X 16-Core UltraSPARC T1 @ 1.2 GHz, 1 shared FPU 852,163721X

ARC Phase-2 Runtime Breakdown

ARC Phase-2 Area Breakdown u Slice Logic Utilization  Number of Slice Registers: 45,283 out of 301,440: 15%  Number of Slice LUTs: 40,749 out of 150,720: 27% Number used as logic: 32,505 out of 150,720: 21% Number used as logic: 32,505 out of 150,720: 21% Number used as Memory: 5,248 out of 58,400: 8% Number used as Memory: 5,248 out of 58,400: 8% u Slice Logic Distribution:  Number of occupied Slices: 17,621 out of 37,680: 46%  Number of LUT Flip Flop pairs used: 54,323 Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused Flip Flop: 14,617 out of 54,323: 26% Number with an unused LUT: 13,574 out of 54,323: 24% Number with an unused LUT: 13,574 out of 54,323: 24% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48% Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48%

ARC Phase-3 Goals u First step toward BiN:  Shared buffer u Designing our customized modules  Customized DMA-controller Handles batch TLB misses Handles batch TLB misses  Plug-n-play accelerator design Making the interface general enough at least for a class of accelerators Making the interface general enough at least for a class of accelerators

ARC Phase-3 Architecture u A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)  Global accelerator manager (GAM) for accelerator sharing  Shared on-chip buffers: Much more accelerators than buffer bank resources  Virtual addressing in the accelerators, accelerator virtualization  Virtual addressing DMA, with on-demand TLB filling from core  No network-on-chip, no buffer sharing with cache, no customized instruction in the core

Performance and Power Results u Benchmarking kernel: u Results Runtime (us) Power (W) EDP (Energy delay product) Gain CHP prototye on Xilinx FPGA ML605 @ 100MHz 1,80228,050,786X 2x Quad-core Intel Xeon CPU E5405 x64 @ 2.00GHz, 1 FPU per core 562802,069,261X Dual-core Intel Xeon CPU 5150 x32 @ 2.66GHz, 1 FPU per core 10,061657,947X 16-Core UltraSPARC T1 @ 1.2 GHz, 1 shared FPU 852,163721X

Impact of Communication & Computation Overlapping 19% Pipelined Communication & Computation No pipeline

Overhead of Buffer Sharing: Bank Access Contention (1) 3.2% The 4 logic buffers are allocated to 4 separate buffer banks The 4 logic buffers are allocated to 1 buffer bank Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time

Overhead of Buffer Sharing: Bank Access Contention (2) 2.7% The 4 logic buffers are allocated to 4 separate buffer banks The 4 logic buffers are allocated to 1 buffer bank

Area Breakdown u Slice Logic Utilization  Number of Slice Registers: 105,969 out of 301,440: 35%  Number of Slice LUTs: 93,755 out of 150,720: 62% Number used as logic: 80,410 out of 150,720: 53% Number used as logic: 80,410 out of 150,720: 53% Number used as Memory: 7,406 out of 58,400: 12% Number used as Memory: 7,406 out of 58,400: 12% u Slice Logic Distribution:  Number of occupied Slices: 32,779 out of 37,680: 86%  Number of LUT Flip Flop pairs used: 112,772 Number with an unused Flip Flop: 25,037 out of 112,772: 22% Number with an unused Flip Flop: 25,037 out of 112,772: 22% Number with an unused LUT: 19,017 out of 112,772: 16% Number with an unused LUT: 19,017 out of 112,772: 16% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60% Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60%

Phase-4 ARC Goals u Finding bottlenecks and system enhancement u Communication bottleneck  Crossbar design instead of AXI-bus  Speed-up AXI non-burst implementation

24 u Crossbar  In addition to previously proposed  now support partial configuration will not affect working LCAs will not affect working LCAs  Passed on-board test u Hierarchical DMACs  Data transfer between Main memory Main memory Shared buffer banks Shared buffer banks  # of buffer banks can be large  want to keep AXI bus size  Hierarchical DMACs and buses Accelerator Memory System Design IOMMU Buffer bank1 Buffer bank2 Buffer bank3 Buffer bank4 Buffer bank9 AXI buses DMAC1 DMAC2 DMAC3 Select-bit Receiver GAM Main AXI bus to DDR LCA1 LCA2 LCA3 LCA4 OC core

25 u Crossbar Results

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Similar presentations

Presentation on theme: "1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou.

Similar presentations

Presentation on theme: "1 CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou."— Presentation transcript:

Similar presentations

About project

Feedback