1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008

2 FPGA  FPGAs increasingly implement SoCs with CPUs Soft processors: processors in the FPGA fabric Commercial soft processors: NIOS-II and Microblaze Processor Easier to program than HDL Customizable Soft Processors in FPGAs DDR controller Ethernet MAC DDR controller

3 Our Focus: Throughput-oriented applications – Composed of many threads – Examples: - packet processing - device control - event monitoring Goal: Exploit FPGA Resources to Maximize Throughput

4 Rough Expected Results FPGA Area FPGA Capacity 1 Throughput Single-Threaded CPU better Tradeoffs in combining more cache/threads/CPUs Add cache capacity Add threads (multithreading) Add CPUs (multiprocessing)

5 Overview Single-Threaded Processor Cache Multithreaded Processor Cache Single-Threaded Processor Cache Multithreaded Processor Cache Multiprocessor PPPP

6 Evaluation Infrastructure  20 EEMBC benchmarks – Run copies of the same application to create threads – Similar to packet processing or streaming workloads – Future work: parallel applications with synchronization  Compilation: modified GCC 4.0.2 for MIPS-1 ISA  Platform: – Transmogrifier 4 (U. of Toronto) – Stratix-I S80 – 133 MHZ DDR SDRAM  Measure: – Performance (in cycles) – Area – Performance per Area

7 Overview Single-Threaded Processor Multithreaded Processor Cache Multiprocessor PPPP Cache

8 Instr. Mem PCPC +4 Reg. File Data Mem Hazard Detection Logic ALU Single-Threaded Processor 3 stage pipeline is most area efficient [CASES’05] Data Cache Instr. Cache

9 Caches in Single-Threaded Processors Cache-Miss latency is not very large  20 times smaller than on desktop 8 cycles Scaling the capacity of caches has limited impact  4KB direct-mapped cache: 89% of ideal cache performance Latency (cycles)

11 Avoiding Processor Stall Cycles Single-Thread Traditional execution BEFORE F E F E MM DD WW F E M D W 5 stages Time Ideally, eliminates all stalls Multithreading: execute streams of independent instructions Legend Thread1 Thread2 Thread3 Thread4 AFTER F F E E F E MM M F E M 5 stages Time D D DD WW W W F E M D W 4 threads eliminate hazards in 5-stage pipeline Data or control hazard

12 Handling cache misses with Multithreading Techniques to avoid stalling other threads: Separate stall pipeline [Fort’05] Elaborate thread scheduling [Moussali’07] F E M 5 stages Time D F F E F DD Load Miss Complex stall-avoidance is costly W E M D F

13 Instruction Replay F E M D W Replayed Load Miss F E M 5 stages Time D FF E F DD Load Instruction replay is cheap in hardware Multithreaded processor can hide some memory miss latency E MM E M D WWW Hit

14 Multithreaded Processor with Only On-Chip Memory [FPL’07] Instr. Mem Hazard Detection Logic Forwarding Lines +4 PCPC Reg. File Data Mem Ctrl. Data Mem 5 stage pipeline is most area efficient [FPL’07] PC0 PC1 PC2 PC3 RF0 RF1 RF2 RF3 ALU

15 Multithreaded Soft Processor with Off-Chip Memory Instr. Mem +4 Data Mem Ctrl. Data Mem Data Cache PC0 PC1 PC2 PC3 RF0 RF1 RF2 RF3 How the data cache is organized is very important Instr. Cache ALU

16 Data Cache Organization in Multithreaded Processors Shared Partitioned T0 T1 T3 T2 Resource Conflict on Write Hits Cache line interference Overhead Area T0 T3 Private T1 T2 Data Cache T0 T1 T3 T2

17 Single Processor Results Shared Partitioned Private Single better All organizations contribute to Pareto frontier

18 Single Processor Area Efficiency better Sweet Spot point for each processor

20 Scaling to Multiprocessor PPPP Off-chip DDR I$D$ Proc N I$D$ Proc 1 I$D$ Proc 0 Shared bus Understand trade-off space: Each processor has its instruction and data cache We replicated the most area efficient soft processors Cache capacity Threads/processor Processor count Cache capacity Threads/processor Processor count

21 Multiprocessor Comparison better PPPP 68 threads!!! 21 Maximal Designs

22 Harnessing remaining Area for added Throughput Resource diversification: Use another kind of block memories still available (MRAM) Exhausted memory blocks (caches and reg. files) Need to: – Recuperate a few memory blocks – Create processors with different memory requirements PPPP Maximizing on-chip memory creates heterogeneous multiprocessors Can accommodate much more threads But marginal throughput improvement

23 Conclusions Scaling techniques to support more threads and off-chip storage help span the throughput/area design space Private performs best, but area hungry PPPP Single perform best for a given total area lose their latency-hiding advantage as the processor count increases Heterogeneity extends the design space Opportunities to adapt application for heterogeneous cores A small direct-mapped cache performs close (89%) to an ideal cache Off-chip memory latency is small (8cycles)

24 Current Work Add support for dependent threads – Real workload – Real sharing and synchronization between threads Measure performance using NetFPGA – Stanford/Xilinx network card – 4 x Gigabit Ethernet Perform real high bandwidth experiments

25 Thank you Martin Labrecque (martinl@eecg.utoronto.ca) Peter Yiannacouras and Gregory Steffan ECE Dept. University of Toronto

26 Related work ● Without a memory hierarchy multithreading is more area efficient [FPL’07] ● Many multiprocessors but 1 st study encompassing: – Single and multi-threading – Data cache organization and sizing – Large-scale Multiprocessor – Maximize memory channel and usage of FPGA resources Area efficiency: a combined metric Performance Area  Instr. Count xx Frequency Cycle Count x Area

27 Memory Write Operation Load/Store Tag ArrayValid/Dirty Bits Cycle 1: Lookup Cycle 2: Access Data hit miss Cycle 2  N: Access Off-Chip Write-back policy Non-allocating writes

28 Adding Threads to Multithreaded Core

29 Multiprocessor Comparison PPPP better

30 Multiprocessor Comparison PPPP better

31 CPI as a measure of Throughput  System has fixed frequency  50 MHz because of Transmogrifier board reasons  Compilation differs for ST and MT  CPI normalizes across different instruction counts  Measure cycles from start of 1 st threads to end of last  Total instructions committed by all threads CPI =Mean number of cycles to commit one instruction PPPP

32 d) Heterogeneous Multiprocessors better PPPP

33 Scale Hardware to Maximize Throughput Threads Software On-Chip RAM Hardware Scale number of threads Scale memory storage Scale number of threads Scale memory storage Maximize throughput given a single off-chip memory channel On-Chip Cache Off-Chip RAM

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Similar presentations

Presentation on theme: "1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

Similar presentations

Presentation on theme: "1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008."— Presentation transcript:

Similar presentations

About project

Feedback