NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Geoff Salmon, Monia Ghobadi, Yashar Ganjali, Martin Labrecque, J. Gregory Steffan University of Toronto.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Multithreaded ASC Kevin Schaffer and Robert A. Walker ASC Processor Group Computer Science Department Kent State University.
Multiscalar processors
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
What is Concurrent Programming? Maram Bani Younes.
The Case for Hardware Transactional Memory in Software Packet Processing Martin Labrecque Prof. Gregory Steffan University of Toronto ANCS, October 26.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
EEE440 Computer Architecture
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Processor Architecture
Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
NetTM FPGA-based processors Martin Labrecque Connections 2010.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
FAST CRITICAL SECTIONS VIA THREAD SCHEDULING FOR FPGA-BASED MULTITHREADED PROCESSORS Martin Labrecque Gregory Steffan ECE Dept. University of Toronto FPL.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
CS 352H: Computer Systems Architecture
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
Hardware Multithreading
Morgan Kaufmann Publishers Computer Organization and Assembly Language
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Guest Lecturer TA: Shreyas Chand
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Instruction Execution Cycle
Dynamic Hardware Prediction
CSC Multiprocessor Programming, Spring, 2011
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

NetThreads: Programming NetFPGA with Threaded Software Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University of Toronto CS Dept.

2 Real-Life Customers ● Hardware: – NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA ● Collaboration with CS researchers – Interested in performing network experiments – Not in coding Verilog – Want to use GigE link at maximum capacity Requirements:  Easy to program system  Efficient system What would the ideal solution look like?

3 Processor Envisioned System (Someday) ● Many Compute Engines ● Delivers the expected performance ● Hardware handles communication and synchronizaton Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Hardware Accelerator Processor control-flow parallelism data-level parallelism Processors inside an FPGA?

4 FPGA Soft processors: processors in the FPGA fabric  FPGAs increasingly implement SoCs with CPUs Commercial soft processors: NIOS-II and Microblaze Processor Easier to program than HDL Customizable Soft Processors in FPGAs DDR controller Ethernet MAC DDR controller What is the performance requirement?

5 Performance In Packet Processing ● The application defines the throughput required Home networking (~100 Mbps/link) Edge routing (≥ 1 Gbps/link) Scientific instruments (< 100 Mbps/link) ● Our measure of throughput: – Bisection search of the minimum packet inter-arrival – Must not drop any packet Are soft processors fast enough?

6 Realistic Goals ● 10 9 bps stream with normal inter-frame gap of 12 bytes ● 2 processors running at 125 MHz ● Cycle budget: – 152 cycles for minimally-sized 64B packets; – 3060 cycles for maximally-sized 1518B packets Soft processors: non-trivial processing at line rate! How can they efficiently be organized?

Key Design Features

8 Efficient Network Processing Memory system with specialized memories Multithreaded soft processor Multiple processors support

9 Multiprocessor System Diagram Input Buffer Data Cache Output Buffer Synch. Unit packet input packet output Instr. Data Input mem. Output mem. I$ processor 4-threads Off-chip DDR I$ processor 4-threads - Overcomes the 2-port limitation of block RAMs - Shared data cache is not the main bottleneck in our experiments

10 Performance of Single-Threaded Processors ● Single-issue, in order pipeline ● Should commit 1 instruction every cycle, but: – stall on instruction dependences – stall on memory, I/O, accelerators accesses ● Throughput depends on sequential execution: – packet processing – device control – event monitoring Solution to Avoid Stalls: Multithreading many concurrent threads

11 Avoiding Processor Stall Cycles Single-Thread Traditional execution BEFORE F E F E MM DD WW F E M D W 5 stages Time Ideally, eliminates all stalls Multithreading: execute streams of independent instructions Legend Thread1 Thread2 Thread3 Thread4 AFTER F F E E F E MM M F E M 5 stages Time D D DD WW W W F E M D W 4 threads eliminate hazards in 5-stage pipeline Data or control hazard 5-stage pipeline is 77% more area efficient [FPL ’ 07]

Multithreading Evaluation

13 Infrastructure Compilation: –modified versions of GCC and Binutils 2.16 for the MIPS-I ISA Timing: –no free PLL: processors run at the speed of the Ethernet MACs, 125MHz Platform: –2 processors, 4 MAC + 1 DMA ports, 64 Mbytes 200 MHz DDR2 SDRAM –Virtex II Pro 50 (speed grade 7ns) –16KB private instruction caches and shared data write-back cache –Capacity would be increased on a more modern FPGA Validation: –Reference trace from MIPS simulator –Modelsim and online instruction trace collection - PC server can send ~0.7 Gbps maximally size packets - Simple packet echo application can keep up - Complex applications are the bottleneck, not the architecture

14 Our benchmarks BenchmarkDescriptionDynamic Instructions per packet x1000 Variance of Instructions per packet x1000 UDHCPDHCP server3536 ClassifierRegular expression + QOS 1335 NATNetwork Address Translation+ Accounting 67 Realistic non-trivial applications, dominated by control flow

15 What is limiting performance? Let’s focus on the underlying problem: Synchronization Packet Backlog due to Synchronization Serializing Tasks

Addressing Synchronization Overhead

17 Real Threads Synchronize All threads execute the same code Concurrent threads may access shared data Critical sections ensure correctness Lock(); shared_var = f(); Unlock(); Impact on round-robin scheduled threads? Thread1 Thread2 Thread3 Thread4

18 Multithreaded processor with Synchronization 5 stages Time F E M D W F F E E M M F E M D D D W W W F E M D W F F E E M M F E M D D D W W W F E M D W Acquire lock Release lock

19 Synchronization Wrecks Round-Robin Multithreading 5 stages Time F E M D W F E M D W F E M D W Acquire lock Release lock With round-robin thread scheduling and contention on locks: < 4 threads execute concurrently > 18% cycles are wasted while blocked on synchronization

20 D W Better Handling of Synchronization 5 stages Time F E M D W F E M D W F E M D W E M M E M D W W F F E E M M F E D D W W F D F BEFORE E M M E M D W W W Time F E M D W F E M F E M DD WW F E M D W F F E E M M D D W W F E M D W AFTER F F E E M M F E D D D W W F D F DESCHEDULE Thread3 Thread4 5 stages

21 Thread scheduler Suspend any thread waiting for a lock Round-robin among the remaining threads Unlock operation resumes threads across processors - Multithreaded processor hides hazards across active threads - Fewer than N threads requires hazard detection But, hazard detection was on critical path of single threaded processor Is there a low cost solution?

22 Static Hazard Detection Hazards can be determined at compile time Hazard distance (maximum 2) Min. issue cycle addi r1,r1,r4 00 addi r2,r2,r5 11 or r1,r1,r8 03 or r2,r2,r Hazard distances are encoded as part of the instructions Static hazard detection allows scheduling without an extra pipeline stage Very low area overhead (5%), no frequency penalty

Thread Scheduler Evaluation

24 Results on 3 benchmark applications - Thread scheduling improves throughput by 63%, 31%, and 41% - Why isn’t the 2 nd processor always improving throughput?

25 Cycle Breakdown in Simulation UDHCP ClassifierNAT - Removed cycles stalled waiting for a lock - What is the bottleneck?

26 Impact of Allowing Packet Drops - System still under-utilized - Throughput still dominated by serialization

27 Future Work Adding custom hardware accelerators –Same interconnect as processors –Same synchronization interface Evaluate speculative threading –Alleviate need for fine grained-synchronization –Reduce conservative synchronization overhead

28 Conclusions Efficient multithreaded design –Parallel threads hide stalls on one thread –Thread scheduler mitigates synchronization costs System Features –System is easy to program in C –Performance from parallelism is easy to get On the lookout for relevant applications suitable for benchmarking NetThreads available with compiler at:

29 Martin Labrecque Gregory Steffan ECE Dept. Geoff Salmon Monia Ghobadi Yashar Ganjali University of Toronto CS Dept. NetThreads available with compiler at:

30 Backup

31 Software Network Processing Not meant for: –Straightforward tasks accomplished at line speed in hardware –E.g. basic switching and routing Advantages compared to Hardware –Complex applications are best described in a high-level software –Easier to design and fast time-to-market –Can interface with custom accelerators, controllers –Can be easily updated Our focus: stateful applications –Data structures modified by most packets –Difficult to pipeline the code into balanced stages Run-to-Completion/Pool-of-Threads model for parallelism: −Each thread processes a packet from beginning to end −No thread-specific behavior

32 Impact of allowing packet drops NAT benchmark t

33 Cycle Breakdown in Simulation UDHCP ClassifierNAT - Removed cycles stalled waiting for a lock - Throughput still dominated by serialization

34 More Sophisticated Thread Scheduling Add pipeline stage to pick hazard-free instruction Result: –Increased instruction latency –Increased hazard window –Increased branch mis-prediction cost FetchThread Selection Register Read ExecuteWriteback MUX Add hazard detection without an extra pipeline stage? Memory

35 Implementation Where to store the hazard distance bits? –Block RAMs are multiple of 9 bits wide –36 bits word leaves 4 bits available Also encode lock and unlock flags Lock/ Unlock + Hazard Distance Instruction 4 Bits 32 Bits x 36 bits I$ processor 4-threads Off-chip DDR I$ processor 4-threads x 36 bitsx 32 bits How to convert instructions from 36 bits to 32 bits?

36 Instruction Compaction 36  32 bits R-Type Instructions opcode (6)rs (5)rt (5)rd (5)sa (5)function (6) opcode (6)target (26) J-Type Instructions Example: add rd, rs, rt Example: j label - De-compaction: 2 block RAMs + some logic between DDR and cache - Not a critical path of the pipeline opcode (6)rs (5)rt (5)immediate (16) Example: addi rt, rs, immediate I-Type Instructions