Download presentation
Presentation is loading. Please wait.
Published byFrederica Holland Modified over 8 years ago
1
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008
2
2 FPGA FPGAs increasingly implement SoCs with CPUs Soft processors: processors in the FPGA fabric Commercial soft processors: NIOS-II and Microblaze Processor Easier to program than HDL Customizable Soft Processors in FPGAs DDR controller Ethernet MAC DDR controller
3
3 Our Focus: Throughput-oriented applications – Composed of many threads – Examples: - packet processing - device control - event monitoring Goal: Exploit FPGA Resources to Maximize Throughput
4
4 Rough Expected Results FPGA Area FPGA Capacity 1 Throughput Single-Threaded CPU better Tradeoffs in combining more cache/threads/CPUs Add cache capacity Add threads (multithreading) Add CPUs (multiprocessing)
5
5 Overview Single-Threaded Processor Cache Multithreaded Processor Cache Single-Threaded Processor Cache Multithreaded Processor Cache Multiprocessor PPPP
6
6 Evaluation Infrastructure 20 EEMBC benchmarks – Run copies of the same application to create threads – Similar to packet processing or streaming workloads – Future work: parallel applications with synchronization Compilation: modified GCC 4.0.2 for MIPS-1 ISA Platform: – Transmogrifier 4 (U. of Toronto) – Stratix-I S80 – 133 MHZ DDR SDRAM Measure: – Performance (in cycles) – Area – Performance per Area
7
7 Overview Single-Threaded Processor Multithreaded Processor Cache Multiprocessor PPPP Cache
8
8 Instr. Mem PCPC +4 Reg. File Data Mem Hazard Detection Logic ALU Single-Threaded Processor 3 stage pipeline is most area efficient [CASES’05] Data Cache Instr. Cache
9
9 Caches in Single-Threaded Processors Cache-Miss latency is not very large 20 times smaller than on desktop 8 cycles Scaling the capacity of caches has limited impact 4KB direct-mapped cache: 89% of ideal cache performance Latency (cycles)
10
10 Overview Single-Threaded Processor Multithreaded Processor Cache Multiprocessor PPPP Cache
11
11 Avoiding Processor Stall Cycles Single-Thread Traditional execution BEFORE F E F E MM DD WW F E M D W 5 stages Time Ideally, eliminates all stalls Multithreading: execute streams of independent instructions Legend Thread1 Thread2 Thread3 Thread4 AFTER F F E E F E MM M F E M 5 stages Time D D DD WW W W F E M D W 4 threads eliminate hazards in 5-stage pipeline Data or control hazard
12
12 Handling cache misses with Multithreading Techniques to avoid stalling other threads: Separate stall pipeline [Fort’05] Elaborate thread scheduling [Moussali’07] F E M 5 stages Time D F F E F DD Load Miss Complex stall-avoidance is costly W E M D F
13
13 Instruction Replay F E M D W Replayed Load Miss F E M 5 stages Time D FF E F DD Load Instruction replay is cheap in hardware Multithreaded processor can hide some memory miss latency E MM E M D WWW Hit
14
14 Multithreaded Processor with Only On-Chip Memory [FPL’07] Instr. Mem Hazard Detection Logic Forwarding Lines +4 PCPC Reg. File Data Mem Ctrl. Data Mem 5 stage pipeline is most area efficient [FPL’07] PC0 PC1 PC2 PC3 RF0 RF1 RF2 RF3 ALU
15
15 Multithreaded Soft Processor with Off-Chip Memory Instr. Mem +4 Data Mem Ctrl. Data Mem Data Cache PC0 PC1 PC2 PC3 RF0 RF1 RF2 RF3 How the data cache is organized is very important Instr. Cache ALU
16
16 Data Cache Organization in Multithreaded Processors Shared Partitioned T0 T1 T3 T2 Resource Conflict on Write Hits Cache line interference Overhead Area T0 T3 Private T1 T2 Data Cache T0 T1 T3 T2
17
17 Single Processor Results Shared Partitioned Private Single better All organizations contribute to Pareto frontier
18
18 Single Processor Area Efficiency better Sweet Spot point for each processor
19
19 Overview Single-Threaded Processor Multithreaded Processor Cache Multiprocessor PPPP Cache
20
20 Scaling to Multiprocessor PPPP Off-chip DDR I$D$ Proc N I$D$ Proc 1 I$D$ Proc 0 Shared bus Understand trade-off space: Each processor has its instruction and data cache We replicated the most area efficient soft processors Cache capacity Threads/processor Processor count Cache capacity Threads/processor Processor count
21
21 Multiprocessor Comparison better PPPP 68 threads!!! 21 Maximal Designs
22
22 Harnessing remaining Area for added Throughput Resource diversification: Use another kind of block memories still available (MRAM) Exhausted memory blocks (caches and reg. files) Need to: – Recuperate a few memory blocks – Create processors with different memory requirements PPPP Maximizing on-chip memory creates heterogeneous multiprocessors Can accommodate much more threads But marginal throughput improvement
23
23 Conclusions Scaling techniques to support more threads and off-chip storage help span the throughput/area design space Private performs best, but area hungry PPPP Single perform best for a given total area lose their latency-hiding advantage as the processor count increases Heterogeneity extends the design space Opportunities to adapt application for heterogeneous cores A small direct-mapped cache performs close (89%) to an ideal cache Off-chip memory latency is small (8cycles)
24
24 Current Work Add support for dependent threads – Real workload – Real sharing and synchronization between threads Measure performance using NetFPGA – Stanford/Xilinx network card – 4 x Gigabit Ethernet Perform real high bandwidth experiments
25
25 Thank you Martin Labrecque (martinl@eecg.utoronto.ca) Peter Yiannacouras and Gregory Steffan ECE Dept. University of Toronto
26
26 Related work ● Without a memory hierarchy multithreading is more area efficient [FPL’07] ● Many multiprocessors but 1 st study encompassing: – Single and multi-threading – Data cache organization and sizing – Large-scale Multiprocessor – Maximize memory channel and usage of FPGA resources Area efficiency: a combined metric Performance Area Instr. Count xx Frequency Cycle Count x Area
27
27 Memory Write Operation Load/Store Tag ArrayValid/Dirty Bits Cycle 1: Lookup Cycle 2: Access Data hit miss Cycle 2 N: Access Off-Chip Write-back policy Non-allocating writes
28
28 Adding Threads to Multithreaded Core
29
29 Multiprocessor Comparison PPPP better
30
30 Multiprocessor Comparison PPPP better
31
31 CPI as a measure of Throughput System has fixed frequency 50 MHz because of Transmogrifier board reasons Compilation differs for ST and MT CPI normalizes across different instruction counts Measure cycles from start of 1 st threads to end of last Total instructions committed by all threads CPI =Mean number of cycles to commit one instruction PPPP
32
32 d) Heterogeneous Multiprocessors better PPPP
33
33 Scale Hardware to Maximize Throughput Threads Software On-Chip RAM Hardware Scale number of threads Scale memory storage Scale number of threads Scale memory storage Maximize throughput given a single off-chip memory channel On-Chip Cache Off-Chip RAM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.