Download presentation
Presentation is loading. Please wait.
Published byLoren Mason Modified over 9 years ago
1
NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University of Toronto CS Dept.
2
2 Real-Life Customers ● Hardware: – NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA ● Collaboration with CS researchers – Interested in performing network experiments – Not in coding Verilog – Want to use GigE link at maximum capacity Requirements: Easy to program system Efficient system What would the ideal solution look like?
3
3 Processor Envisioned System (Someday) ● Many Compute Engines ● Delivers the expected performance ● Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Processor control-flow parallelism data-level parallelism Processors inside an FPGA?
4
4 FPGA Soft processors: processors in the FPGA fabric FPGAs increasingly implement SoCs with CPUs Commercial soft processors: NIOS-II and Microblaze Processor Easier to program than HDL Customizable Soft Processors in FPGAs DDR controller Ethernet MAC DDR controller What is the performance requirement?
5
5 Performance In Packet Processing ● The application defines the throughput required Home networking (~100 Mbps/link) Edge routing (≥ 1 Gbps/link) Scientific instruments (< 100 Mbps/link) ● Our measure of throughput: – Bisection search of the minimum packet inter-arrival – Must not drop any packet Are soft processors fast enough?
6
6 Realistic Goals ● 10 9 bps stream with normal inter-frame gap of 12 bytes ● 2 processors running at 125 MHz ● Cycle budget: – 152 cycles for minimally-sized 64B packets; – 3060 cycles for maximally-sized 1518B packets Soft processors: non-trivial processing at line rate! How can they efficiently be organized?
7
Key Design Features
8
8 Efficient Network Processing Memory system with specialized memories Multithreaded soft processor Multiple processors support
9
9 Multiprocessor System Diagram Input Buffer Data Cache Output Buffer Synch. Unit packet input packet output Instr. Data Input mem. Output mem. I$ processor 4-threads Off-chip DDR I$ processor 4-threads - Overcomes the 2-port limitation of block RAMs - Shared data cache is not the main bottleneck in our experiments
10
10 Performance of Single-Threaded Processors ● Single-issue, in order pipeline ● Should commit 1 instruction every cycle, but: – stall on instruction dependences – stall on memory, I/O, accelerators accesses ● Throughput depends on sequential execution: – packet processing – device control – event monitoring Solution to Avoid Stalls: Multithreading many concurrent threads
11
11 Avoiding Processor Stall Cycles Single-Thread Traditional execution BEFORE F E F E MM DD WW F E M D W 5 stages Time Ideally, eliminates all stalls Multithreading: execute streams of independent instructions Legend Thread1 Thread2 Thread3 Thread4 AFTER F F E E F E MM M F E M 5 stages Time D D DD WW W W F E M D W 4 threads eliminate hazards in 5-stage pipeline Data or control hazard 5-stage pipeline is 77% more area efficient [FPL ’ 07]
12
Multithreading Evaluation
13
13 Infrastructure Compilation: –modified versions of GCC 4.0.2 and Binutils 2.16 for the MIPS-I ISA Timing: –no free PLL: processors run at the speed of the Ethernet MACs, 125MHz Platform: –2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM –Virtex II Pro 50 (speed grade 7ns) –16KB private instruction caches and shared data write-back cache –Capacity would be increased on a more modern FPGA Validation: –Reference trace from MIPS simulator –Modelsim and online instruction trace collection - PC server can send ~0.7 Gbps maximally size packets - Simple packet echo application can keep up - Complex applications are the bottleneck, not the architecture
14
14 Our benchmarks BenchmarkDescriptionDynamic Instructions per packet x1000 Variance of Instructions per packet x1000 UDHCPDHCP server3536 ClassifierRegular expression + QOS 1335 NATNetwork Address Translation+ Accounting 67 Realistic non-trivial applications, dominated by control flow
15
15 What is limiting performance? Let’s focus on the underlying problem: Synchronization Packet Backlog due to Synchronization Serializing Tasks
16
Addressing Synchronization Overhead
17
17 Real Threads Synchronize All threads execute the same code Concurrent threads may access shared data Critical sections ensure correctness Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads? Thread1 Thread2 Thread3 Thread4
18
18 Multithreaded processor with Synchronization 5 stages Time F E M D W F F E E M M F E M D D D W W W F E M D W F F E E M M F E M D D D W W W F E M D W Acquire lock Release lock
19
19 Synchronization Wrecks Round-Robin Multithreading 5 stages Time F E M D W F E M D W F E M D W Acquire lock Release lock With round-robin thread scheduling and contention on locks: < 4 threads execute concurrently > 18% cycles are wasted while blocked on synchronization
20
20 D W Better Handling of Synchronization 5 stages Time F E M D W F E M D W F E M D W E M M E M D W W F F E E M M F E D D W W F D F BEFORE E M M E M D W W W Time F E M D W F E M F E M DD WW F E M D W F F E E M M D D W W F E M D W AFTER F F E E M M F E D D D W W F D F DESCHEDULE Thread3 Thread4 5 stages
21
21 Thread scheduler Suspend any thread waiting for a lock Round-robin among the remaining threads Unlock operation resumes threads across processors - Multithreaded processor hides hazards across active threads - Fewer than N threads requires hazard detection But, hazard detection was on critical path of single threaded processor Is there a low cost solution?
22
22 Static Hazard Detection Hazards can be determined at compile time Hazard distance (maximum 2) Min. issue cycle addi r1,r1,r4 00 addi r2,r2,r5 11 or r1,r1,r8 03 or r2,r2,r9 04 - Hazard distances are encoded as part of the instructions Static hazard detection allows scheduling without an extra pipeline stage Very low area overhead (5%), no frequency penalty
23
Thread Scheduler Evaluation
24
24 Results on 3 benchmark applications - Thread scheduling improves throughput by 63%, 31%, and 41% - Why isn’t the 2 nd processor always improving throughput?
25
25 Cycle Breakdown in Simulation UDHCP ClassifierNAT - Removed cycles stalled waiting for a lock - What is the bottleneck?
26
26 Impact of Allowing Packet Drops - System still under-utilized - Throughput still dominated by serialization
27
27 Future Work Adding custom hardware accelerators –Same interconnect as processors –Same synchronization interface Evaluate speculative threading –Alleviate need for fine grained-synchronization –Reduce conservative synchronization overhead
28
28 Conclusions Efficient multithreaded design –Parallel threads hide stalls on one thread –Thread scheduler mitigates synchronization costs System Features –System is easy to program in C –Performance from parallelism is easy to get On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
29
29 Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University of Toronto CS Dept. NetThreads available with compiler at: http://netfpga.org/netfpgawiki/index.php/Projects:NetThreads
30
30 Backup
31
31 Software Network Processing Not meant for: –Straightforward tasks accomplished at line speed in hardware –E.g. basic switching and routing Advantages compared to Hardware –Complex applications are best described in a high-level software –Easier to design and fast time-to-market –Can interface with custom accelerators, controllers –Can be easily updated Our focus: stateful applications –Data structures modified by most packets –Difficult to pipeline the code into balanced stages Run-to-Completion/Pool-of-Threads model for parallelism: −Each thread processes a packet from beginning to end −No thread-specific behavior
32
32 Impact of allowing packet drops NAT benchmark t
33
33 Cycle Breakdown in Simulation UDHCP ClassifierNAT - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization
34
34 More Sophisticated Thread Scheduling Add pipeline stage to pick hazard-free instruction Result: –Increased instruction latency –Increased hazard window –Increased branch mis-prediction cost FetchThread Selection Register Read ExecuteWriteback MUX Add hazard detection without an extra pipeline stage? Memory
35
35 Implementation Where to store the hazard distance bits? –Block RAMs are multiple of 9 bits wide –36 bits word leaves 4 bits available Also encode lock and unlock flags Lock/ Unlock + Hazard Distance Instruction 4 Bits 32 Bits x 36 bits I$ processor 4-threads Off-chip DDR I$ processor 4-threads x 36 bitsx 32 bits How to convert instructions from 36 bits to 32 bits?
36
36 Instruction Compaction 36 32 bits R-Type Instructions opcode (6)rs (5)rt (5)rd (5)sa (5)function (6) opcode (6)target (26) J-Type Instructions Example: add rd, rs, rt Example: j label - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline opcode (6)rs (5)rt (5)immediate (16) Example: addi rt, rs, immediate I-Type Instructions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.