Scalable Networking for Next-Generation Computing Platforms Yoshio Turner , Tim Brecht ‡, Greg Regnier §, Vikram Saletore §, John Janakiraman *, Brian.

Slides:

Advertisements

Similar presentations

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.

Advertisements

Threads, SMP, and Microkernels

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Multithreaded Programming.

Threads. Objectives To introduce the notion of a thread — a fundamental unit of CPU utilization that forms the basis of multithreaded computer systems.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.

Chapter 7 Protocol Software On A Conventional Processor.

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.

ECE 526 – Network Processing Systems Design Software-based Protocol Processing Chapter 7: D. E. Comer.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Chapter 13 Embedded Systems

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

Figure 1.1 Interaction between applications and the operating system.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Chapter 4: Threads READ 4.1 & 4.2 NOT RESPONSIBLE FOR 4.3 &

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.

14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.

Operating System Principles Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh.

Multithreading Allows application to split itself into multiple “threads” of execution (“threads of execution”). OS support for creating threads, terminating.

14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania

LWIP TCP/IP Stack 김백규.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.

Srihari Makineni & Ravi Iyer Communications Technology Lab

3.1 Silberschatz, Galvin and Gagne ©2009Operating System Concepts with Java – 8 th Edition Chapter 3: Processes.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 4: Threads Modified from the slides of the text book. TY, Sept 2010.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Lecture 3 Threads Erick Pranata © Sekolah Tinggi Teknik Surabaya 1.

Full and Para Virtualization

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Saurav Karmakar. Chapter 4: Threads  Overview  Multithreading Models  Thread Libraries  Threading Issues  Operating System Examples  Windows XP.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Operating System Concepts

© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Contents 1.Overview 2.Multithreading Model 3.Thread Libraries 4.Threading Issues 5.Operating-system Example 2 OS Lab Sun Suk Kim.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages

Introduction to threads

Chapter 5: Threads Overview Multithreading Models Threading Issues

Chapter 4: Multithreaded Programming

Threads & multithreading

Chapter 4: Threads.

Multithreaded Programming

Chapter 4: Threads.

Presentation transcript:

Scalable Networking for Next-Generation Computing Platforms Yoshio Turner *, Tim Brecht *‡, Greg Regnier §, Vikram Saletore §, John Janakiraman *, Brian Lynn * * Hewlett Packard Laboratories § Intel Corporation ‡ University of Waterloo

page 214 Feb 2004 SAN-3 workshop – HPCA-10 Outline Motivation: Enable applications to scale to next- generation network and I/O performance on standard computing platforms Proposed technology strategy: – Embedded Transport Acceleration (ETA) – Asynchronous I/O (AIO) programming model Web server application evaluation vehicle Evaluation Plan Conclusions

page 314 Feb 2004 SAN-3 workshop – HPCA-10 Motivation: Next-Generation Platform Requirements Low overhead packet and protocol processing for next- generation commodity interconnects (e.g., 10 gigE) – Current systems: performance impeded by interrupts, context switches, data copies – Existing proposals include: TCP Offload Engines (TOE): special hardware, cost/time to market issues RDMA: new protocol, requires support at both endpoints Increased I/O concurrency for high link utilization – I/O bandwidth is increasing – I/O latency is fixed or slowly decreasing toward limit  Need larger number of in-flight operations to fill pipe

page 414 Feb 2004 SAN-3 workshop – HPCA-10 Proposed Technology Strategy Embedded Transport Acceleration (ETA) architecture – Intel Labs project: prototype architecture dedicates one or more processors to perform all network packet processing -- ``Packet Processing Engines’’ (PPEs) – Low overhead processing: PPE interacts with network interfaces/applications directly via cache-coherent shared memory (bypass the OS kernel) – Application interface: VIA-style user-level communication Asynchronous I/O (AIO) programming model – Split two-phase file/socket operations Post an I/O operation request: non-blocking call Asynchronously receive completion event information – High I/O concurrency even for single-threaded application – Initial focus: ETA socket AIO (future extensions to file AIO)

page 514 Feb 2004 SAN-3 workshop – HPCA-10 Key Advantages Potentially enables Ethernet and TCP to approach latency and throughput performance of System Area Networks Uses standard system processor/memory resources: – Automatically tracks semiconductor cost-performance trends – Leverages microarchitecture trends: multiple cores, hardware multi-threading – Leverages standard software development environments  rapid development Extensibility: fully programmable PPE to support evolving data center functionality – Unified IP-based fabric for all I/O – RDMA AIO increases network-centric application scalability

page 614 Feb 2004 SAN-3 workshop – HPCA-10 Overview of the ETA Architecture Partitioned server architecture: – Host: application execution – Packet Processing Engine (PPE) Host-PPE Direct Transport Interface (DTI) – VIA/Infiniband-like queuing structures in cache coherent shared host memory (OS bypass) – Optimized for sockets/TCP Direct User Socket Interface (DUSI) – Thin software layer to support user level applications

page 714 Feb 2004 SAN-3 workshop – HPCA-10 Host CPU(s) PPE LAN Storage IPC Network Fabric ETA Host Interface iSCSI File System Kernel Applications User Applications Legacy Sockets Direct Access TCP/IP Driver Shared Memory ETA Overview: Partitioned Architecture

page 814 Feb 2004 SAN-3 workshop – HPCA-10 ETA Overview: Direct Transport Interface (DTI) Queuing Structure Asynch socket operations: connect, accept, listen, etc. TCP buffering semantics – anonymous buffer pool supports non-pre-posted or OOO receive packets Packet Processing Engine HOST Shared Host Memory DTI Tx Queue Data Buffers DTI Event Queue DTI Rx Queue Anonymous Buffer Pool DTI Doorbells

page 914 Feb 2004 SAN-3 workshop – HPCA-10 API for Asynchronous I/O (AIO) Layer socket AIO API above ETA architecture – Investigate impact of AIO API features on application structure and performance Initial focus: ETA Direct User Socket Interface (DUSI) API – provides asynchronous socket operations: connect, listen, accept, send, receive AIO examples: – File/socket: Windows AIO w/completion ports, POSIX AIO – File I/O: Linux AIO recently introduced – Socket I/O with OS bypass: ETA DUSI, OpenGroup Sockets API Extensions

page 1014 Feb 2004 SAN-3 workshop – HPCA-10 ETA Direct User Socket Interface (DUSI) AIO API Queuing structure setup for sockets: – One Direct Transfer Interface (DTI) per socket – Event queues: created separately from DTIs Memory registration: – Pin user space memory regions, provide address translation information to ETA for zero-copy transfers – Provide access keys (protection tags) Application posts socket I/O operation requests to DTI Tx and Rx work queues PPE delivers operation completion events to DTI event queues Both operation posting and event delivery are lightweight (no OS involvement)

page 1114 Feb 2004 SAN-3 workshop – HPCA-10 AIO Event Queue Binding AIO API design issue: assignment of events to event queues – Flexible binding enables applications to separate or group events to facilitate operation scheduling DUSI: each DTI work queue can be bound at socket creation to any event queue – Allows separating or grouping events from different sockets – Allows separating events by type (transmit, receive) Alternatives for event queue binding: – Windows: per-socket – Linux and POSIX AIO: per-operation – OpenGroup Sockets API Extensions: per-operation-type

page 1214 Feb 2004 SAN-3 workshop – HPCA-10 Retrieving AIO Completion Events AIO API design issue: application interface for retrieving events DUSI: lightweight mechanism bypassing OS – Event queues in shared memory – Callbacks: similar to Windows – Event tags Application monitoring of multiple event queues – Poll for events (OK for small number of queues) – No events  block in OS on multiple queues Uncommon case in a busy server  acceptable in this case to use OS signaling mechanism Useful for simultaneous use of different AIO APIs – Race conditions: user level responsibility

page 1314 Feb 2004 SAN-3 workshop – HPCA-10 AIO for Files and Sockets File AIO support – OS (e.g., Linux AIO, POSIX AIO) – Future: ETA support for file I/O (e.g., via iSCSI or DAFS) Unified application processing of file/socket events – ETA PPE and OS kernel may both supply event queues Blocking on event queues of different types facilitated by use of OS signal signal mechanism (as in DUSI) Unified event queues may be desirable: require efficient coordination of ETA and OS access to event queues – Support for zero-copy sendfile(): integration of ETA with OS management of the shared file buffer in system memory

page 1414 Feb 2004 SAN-3 workshop – HPCA-10 Initial Demonstration Vehicle: Web Server Application Plan: demonstrate value of ETA/AIO for network-centric applications Initial target: web server application – Single request may require multiple I/Os – Stresses system resources (esp. OS resources) – Must multiplex thousands/tens of thousands concurrent connections Web server architecture alternatives: – SPED (single process event-driven) – MP (multi-process) or MT (multi-threaded) – Hybrid approach: AMPED (asymmetric multi-process event-driven)  AIO model favors SPED for raw performance

page 1514 Feb 2004 SAN-3 workshop – HPCA-10 The userver Open source micro web server Extensive tracing and statistics facilities SPED model -- run one process per host CPU Previous support for Unix non-blocking socket I/O and event notification via Linux epoll() Modified to support socket AIO (eventually file AIO) – Generic AIO interface: can be mapped to a variety of underlying AIO APIs (DUSI, Linux AIO, etc.) Comparison: web server performance with and without ETA engine – With Standard Linux: processes share file buffer cache using sendfile() for zero-copy file transfer – With ETA: mmap() files into shared address space

page 1614 Feb 2004 SAN-3 workshop – HPCA-10 Web Server Event Scheduling Balance accepting new connections with processing of existing connections Scheduling: – Separate queues for accept(), read(), and write()/close() completion events – Process based on current queue lengths Early results with non-blocking I/O – accept processing frequency Throughput impact of frequency of accepting new connections

page 1714 Feb 2004 SAN-3 workshop – HPCA-10 Evaluation Plans Goal: evaluate approach, compare to design alternatives Construct functional prototype of proposed stack (Linux) – Extend existing ETA prototype kernel-level interface to user level with OS bypass (DUSI) – Extend the userver to use socket AIO, mapping layer to DUSI – Evaluate on 10 gigE –based client/server setup using SPECweb type workload Current ETA prototype: promising kernel-level micro- benchmark performance Expectation: ETA + AIO will show significantly higher scalability than existing Linux network implementation

page 1814 Feb 2004 SAN-3 workshop – HPCA-10 UDP TCP RAW IP Linux Kernel DTI Data Path User Kernel ETA Direct User Sockets Interface (DUSI) Packet Driver Linux Sockets Library uServer - AIO ETA Packet Processing Engine Control Path AIO Mapping Network Interfaces ETA Kernel Agent uServer - sockets Proposed Stack/Comparison

Kernel-Level ETA Prototype

page 2114 Feb 2004 SAN-3 workshop – HPCA-10 Evaluation Plans: Analyses and Comparisons Compare proposed stack to well-tuned conventional system: checksum offload, TCP segmentation offload, interrupt moderation (NAPI) Examine micro-architectural impacts: VTune/oprofile to get CPU, memory, cache usage, interrupts, data copies, context switches Comparison to TOE Extend analysis to application domains beyond web server: e.g., storage, transaction processing Port highly scalable user-level threading package (UC Berkeley Capriccio project) to ETA – Benefit: familiar threaded programming model with efficient ``under the hood’’ underlying AIO and OS bypass

page 2214 Feb 2004 SAN-3 workshop – HPCA-10 Summary Proposed technology strategy combining ETA and AIO to enable industry standard platforms to scale to next- generation network performance Cost-performance, time to market, flexibility advantages over alternative approaches Ethernet/TCP to approach performance levels of today’s SANs – toward unified data center I/O fabric based on commodity hardware Status – Promising initial experimental results for kernel-level ETA – Prototype implementation of proposed stack nearly complete – Testing environment setup based on 10 gigE

page 2314 Feb 2004 SAN-3 workshop – HPCA-10 Backup Slides

ETA Packet Processing Engine Software Gigabit NICs (5) ETA Host Interface Kernel Test Program CPU 0 Host CPU 1 PPE Off-the-shelf Linux Servers Host Memory Clients Test Clients Kernel Abstraction Layer

UDP TCP RAW IP OS Kernel Sockets Provider DTI (User) User Kernel ETA Direct User Sockets Provider ETA Kernel Sockets Provider Packet Driver Kernel Applications ETA Kernel Adaptation Layer ETA User Adaptation Layer ETA Kernel Agent OSV User Sockets Provider Service Level Provider Switch Kernel Sockets Provider Switch User-level Sockets Applications & Services ETA Packet Processing Engine DTI (Kernel)