Frame Shared Memory: Line-Rate Networking on Commodity Hardware

Slides:

Advertisements

Similar presentations

University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

Advertisements

purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

SE-292 High Performance Computing

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

University of Colorado at Boulder Core Research Lab ZDDs for Dynamic Trace Analysis Graham Price Manish Vachharajani John Giacomoni John Michalakes Sreyasi.

Lecture 6: Multicore Systems

Computer Abstractions and Technology

Multiprocessors and Multithreading – classroom slides.

Chapter 6 Security Kernels.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Extensible Networking Platform IWAN 2005 Extensible Network Configuration and Communication Framework Todd Sproull and John Lockwood

Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任指導教授 : 楊武 Abstract Multicore.

University of Colorado at Boulder Core Research Lab Operating System Support for Pipeline Parallelism on Multicore Architectures Manish Vachharajani University.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

University of Colorado at Boulder Core Research Lab Tipp Moseley, Graham Price, Brian Bushnell, Manish Vachharajani, and Dirk Grunwald University of Colorado.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

OS Fall ’ 02 Introduction Operating Systems Fall 2002.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Slide 1-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 1.

Department of Electrical and Computer Engineering Kekai Hu, Harikrishnan Chandrikakutty, Deepak Unnikrishnan, Tilman Wolf, and Russell Tessier Department.

1 A survey on Reconfigurable Computing for Signal Processing Applications Anne Pratoomtong Spring2002.

Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

Software Routers: NetSlice Hakim Weatherspoon Assistant Professor, Dept of Computer Science CS 5413: High Performance Systems and Networking October 15,

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.

Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.

1 Nios II Processor Architecture and Programming CEG 4131 Computer Architecture III Miodrag Bolic.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.

Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.

To be smart or not to be? Siva Subramanian Polaris R&D Lab, RTP Tal Lavian OPENET Lab, Santa Clara.

Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR

A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.

Chapter 17 Internetworking: Concepts, Architecture, and Protocols

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.

 Program Abstractions  Concepts  ACE Structure.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Multi-core, Mega-nonsense. Will multicore cure cancer? Given that multicore is a reality –…and we have quickly jumped from one core to 2 to 4 to 8 –It.

Serialization Sets A Dynamic Dependence-Based Parallel Execution Model Matthew D. Allen Srinath Sridharan Gurindar S. Sohi University of Wisconsin-Madison.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

PipeliningPipelining Computer Architecture (Fall 2006)

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Virtualization Neependra Khare

SDN challenges Deployment challenges

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Chapter 6 Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism Topic 11 Amazon Web Services Prof. Zhang Gang

Performance Tuning Team Chia-heng Tu June 30, 2009

The Multikernel A new OS architecture for scalable multicore systems

Too Much Milk With Locks

If you build it, will they buy it?

Threads CSE 2431: Introduction to Operating Systems

Presentation transcript:

Frame Shared Memory: Line-Rate Networking on Commodity Hardware John Giacomoni John K. Bennett, Douglas C. Sicker, and Manish Vachharajani University of Colorado at Boulder Alexander L. Wolf - Imperial College London Antonio Carzaniga - University of Lugano 2007.12.03

? ? ? ? ? ? Problem Description How do we route? How do we protect? Link Mbps fps ns/frame T-1 1.5 2,941 340,000 T-3 45.0 90,909 11,000 OC-3 155.0 333,333 3,000 OC-12 622.0 1,219,512 820 GigE 1,000.0 1,488,095 672 OC-48 2,500.0 5,000,000 200 10 GigE 10,000.0 14,925,373 67 OC-192 9,500.0 19,697,843 51 In the past things were simpler due to low frame rates. Focus on GigE, common LAN link type. Future of IEEE 802.3ba with 100 and 40 GigE, we’re all going to hurt How do we route? How do we protect? How do we correlate?

? ? ? ? ? ? ASIC Solutions How do we route? How do we protect? Link Mbps fps ns/frame T-1 1.5 2,941 340,000 T-3 45.0 90,909 11,000 OC-3 155.0 333,333 3,000 OC-12 622.0 1,219,512 820 GigE 1,000.0 1,488,095 672 OC-48 2,500.0 5,000,000 200 10 GigE 10,000.0 14,925,373 67 OC-192 9,500.0 19,697,843 51 Expensive in terms of design, cheap to bulk produce. Yay MIT RAW tiles :) How do we route? How do we protect? How do we correlate?

Programmable Network Processors Lower design cost than ASICs Notice that design is less complicated than a general purpose processor but every component is exposed and must be managed by the programmer. Still a powerful technique, Cloudshield built carnivore for OC-48 using a large array of intel IXP1200s Cyrus story. Still not future proof. Unit costs don’t scale as well as ASICs for large volume products Intel® IXP2855

:( I’ve painted a rather bleak picture, huh? ;)

Multicore Systems GPP Multicore systems Intel (2x2-core) Individual cores less powerful than UP 10s-100s-1000s of cores Full OS & Library Support Asymmetric (Alpha) Heterogeneous (AMD, Intel) Convergence of general purpose processors and embedded processors. Special purpose instruction sets have a long history. Asymmetric processors have been considered in the past (Alpha) Heterogeneous processors are presently being considered (AMD, Intel) Intel (2x2-core) MIT RAW (16-core) 100-core 400-core

Moore’s Corollary vs. Moore’s Law Not a flash in the pan, we’ve fallen off moore’s corollary SPEC Benchmark Suite Performance Predicted vs. actual Graph Courtesy Tipp Moseley

Soft Network Processing (Soft-NP) Show replacing of switch with a commodity machine. Note, replacing full switches is likely to remain the domain of very specialized systems. How do we get the necessary performance.

Soft-NP Technique Frame Generation Generate 1 (Gen) Application (App) Output (OP) Examine a frame generation pipeline. Other configuration are obvious. Get full pipeline overlap. 3x processing time per frame with no loss of throughput. Communication is critical, if cost is too expensive, we lose everything.

AMD Opteron System Overview What does a modern commodity platform look like? Multicore Multiprocessor Lots of processing power per core NUMA simple network interface devices.

Data Flow Frame Generation OP App Gen OS Assumption, communication via shared memory instead of specialized hardware. Let’s look at how data moves about the system to better understand the communication problem. Notice that all communication is through the memory subsystem, and in particular the caches.

Communication Overhead

Communication Overhead Locks  200ns Standard lock based handoffs are expensive. <=40% of frame processing time available per stage for 64B frames at GigE GigE

Communication Overhead Hardware  10ns Locks  200ns Hardware queues can give us fantastic performance but they aren’t prevalent and are expensive to implement. Multicore systems maybe problematic for hardware queues. GigE

Communication Overhead Hardware  10ns Lamport  160ns Locks  200ns Lamport’s queue gives us better performance…. But not really good enough and we can do better :) GigE

Communication Overhead Hardware  10ns FastForward  28ns Lamport  160ns Locks  200ns FastForward is much better. GigE

FastForward enqueue(data) { lock(queue); if (NEXT(head) == tail) { unlock(queue); return EWOULDBLOCK; } buffer[head] = data; head = NEXT(head); return 0; enqueue_fastforward(data) { if (NULL != buffer[head]) { return EWOULDBLOCK; } buffer[head] = data; head = NEXT(head); return 0; Lamport doesn’t work on many modern machines with weak-memory models, proven only under sequential consistency. Critical details are discussed in this paper. For more details see our upcoming PPoPP paper that includes detailed performance characteristics of the queues and a proof of correctness. Cache-optimized CLF queues Works with strong to weak consistency models Hides die-die communication Giacomoni, Moseley, and Vachharajani. “FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue.” To appear: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2008

Frame Shared Memory (FShm) Pure software stack communicating via shared memory Abstracted at the driver/NIC boundary Cross-Domain modules (Kernel/Process, T/T, P/P, K/K) Compatible with existing OS/library/language services Can communicate with any device on the memory interconnect Notice How this makes upgrading between platform revisions relatively painless. FastForward hides core-to-core communication

FShm Driver API struct ifdirect { void (*if_direct_tick) (void *softc); void (*if_direct_attach) (struct ifnet *, void *); void (*if_direct_detach) (struct ifnet *, void *); int (*if_direct_tx) (void *softc, struct mbuf *txbuf); void (*if_direct_tx_post) (void *softc); void (*if_direct_tx_clean_pre) (void *softc); struct mbuf* (*if_direct_tx_clean) (void *softc); void (*if_direct_tx_clean_post) (void *softc); void (*if_direct_rx_pre) (void *softc); struct mbuf* (*if_direct_rx) (void *, struct mbuf *new_rxbuf); void (*if_direct_rx_post) (void *softc); }; Straightforward driver abstraction, similar to the one used in linux.

FShm Evaluation Methodology AMD Opteron 2.0 GHz Dual-Processor & Dual-Core Compute average time per call TSC Get rid of picture, and perf counters

Frame Generation Data Flow OP App Gen OS Remind people what the frame generation setup looked like

FShm Generate (linux pktgen) Send stage is marked as zero time as no application code should be executed once destination NIC has been targeted. Limitation in evaluation hardware platform, probably PCI-X, limits frame sizes below 74B from reaching theoretic max. 64B*  1.36 Mfps

FShm Capture (IDS) 64B*  1.36 Mfps

FShm Forward (Bridge) 64B*  1.36 Mfps

FShm’s Future Hardware  10ns FastForward  28ns Lamport  160ns Locks  200ns How will FShm scale to faster networks? OC-48 = 200ns per stage including comm time. ~120 ns which is sufficient for many fast path applications. Improved performance can be expected as 120ns assumes processors, memory, and interconnects remain the same speed as today. Finally, two points: 1) These numbers are for pipelines; Data-parallel techniques + additional processors will let FShm scale stage length as done today. 2) We’re currently investigating improved improving performance with cache forwarding techniques for payload data and shared application state. GigE OC-48

Questions? john.giacomoni@colorado.edu