Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Multiple Processor Systems
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
Figure 1.1 Interaction between applications and the operating system.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
ECE 526 – Network Processing Systems Design
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
I/O Acceleration in Server Architectures
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
Chapter 8 Input/Output. Busses l Group of electrical conductors suitable for carrying computer signals from one location to another l Each conductor in.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Lecture 3 Review of Internet Protocols Transport Layer.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
1 Liquid Software Larry Peterson Princeton University John Hartman University of Arizona
Multi-core architectures. Single-core computer Single-core CPU chip.
RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Make Hosts Ready for Gigabit Networks. Hardware Requirement To allow a host to fully utilize Gbps bandwidth, its hardware system must be ready for Gbps.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
TCP Offload Through Connection Handoff Hyong-youb Kim and Scott Rixner Rice University April 20, 2006.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Chapter 6 Storage and Other I/O Topics. Chapter 6 — Storage and Other I/O Topics — 2 Introduction I/O devices can be characterized by Behaviour: input,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Operating Systems: Summary INF1060: Introduction to Operating Systems and Data Communication.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Memory COMPUTER ARCHITECTURE
The Multikernel: A New OS Architecture for Scalable Multicore Systems
CS 286 Computer Organization and Architecture
Hyperthreading Technology
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Internetworking: Hardware/Software Interface
QNX Technology Overview
CSC3050 – Computer Architecture
CSC3050 – Computer Architecture
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Presentation transcript:

Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability2 Rice Computer Architecture  Faculty –Scott Rixner  Students –Mike Calhoun –Hyong-youb Kim –Jeff Shafer –Paul Willmann  Research Focus –System architecture –Embedded systems  Architecture/ Architecture/

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability3 Network Servers Today  Content types –Mostly text, small images –Low quality video ( Kbps) Internet 1 Gbps Network Server Clients 3 Mbps

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability4 Network Servers in the Future  Content types –Diverse multimedia content –DVD quality video (10 Mbps) Internet 100 Gbps Clients 100 Mbps Network Server

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability5 TCP Performance Issues  Network Interfaces –Limited flexibility –Serialized access  Computation –Only about 3000 instructions per packet –However, very low IPC, parallelization difficulties  Memory –Large connection data structures (about 1KB each) –Low locality, high DRAM latency

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability6 Selected Research  Network Interfaces –Programmable NIC design –Firmware parallelization –Network interface data caching  Operating Systems –Connection handoff to the network interface –Parallizing network stack processing  System Architecture –Memory controller design

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability7 Designing a 10 Gigabit NIC  Programmability for performance –Computation offloading improves performance  NICs have power, area concerns –Architecture solutions should be efficient  Above all, must support 10 Gbps links –What are the computation and memory requirements? –What architecture efficiently meets them? –What firmware organization should be used?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability8 Aggregate Requirements 10 Gbps – Maximum-sized Frames Instruction Throughput Control Data Bandwidth Frame Data Bandwidth TX Frame229 MIPS2.6 Gbps19.75 Gbps RX Frame206 MIPS2.2 Gbps19.75 Gbps Total435 MIPS4.8 Gbps39.5 Gbps 1514-byte Frames at 10 Gbps 812,744 Frames/s

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability9 Meeting 10 Gbps Requirements  Processor Architecture –At least 435 MIPS within embedded device –Limited instruction-level parallelism –Abundant task-level parallelism  Memory Architecture –Control data needs low latency, small capacity –Frame data needs high bandwidth, large capacity –Must partition storage

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability10 Processor Architecture Perfect 1BPNo BP In-order Out-order  2x performance costly –Branch prediction, reorder buffer, renaming logic, wakeup logic –Overheads translate to greater than 2x core power, area costs –Great for a GP processor; not for an embedded device  Are there other opportunities for parallelism? –Many steps to process a frame – run them simultaneously –Many frames need processing – process simultaneously  Solution: use parallel single-issue cores

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability11 Control Data Caching SMPCache trace analysis of a 6-processor NIC architecture

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability12 A Programmable10Gbps NIC Instruction Memory I-Cache 0 CPU 0 (P+4)x(S) Crossbar (32-bit) PCI Interface Ethernet Interface PCI Bus DRAM Ext. Mem. Interface (Off-Chip) Scratchpad 0Scratchpad 1S-pad S-1 CPU P-1 I-Cache 1I-Cache P-1 CPU 1

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability13 Network Interface Firmware  NIC processing steps are well defined  Must provide high latency tolerance –DMA to host –Transfer to/from network  Event mechanism is the obvious choice –How do you process and distribute events?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability14 Task Assignment with an Event Register PCI Read BitSW Event Bit… Other Bits PCI Interface Finishes Work Processor(s) inspect transactions Processor(s) need to enqueue TX Data Processor(s) pass data to Ethernet Interface

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability15 Task-level Parallel Firmware Transfer DMAs 0-4 0Idle PCI Read Bit PCI Read HW Status Proc 0Proc 1 1 Transfer DMAs Time Process DMAs 0-4 Idle Process DMAs Idle

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability16 Frame-level Parallel Firmware Transfer DMAs 0-4 Idle PCI RD HW Status Proc 0Proc 1 Transfer DMAs 5-9 Time Process DMAs 0-4 Build Event Idle Process DMAs 5-9 Build Event Idle

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability17 Scaling in Two Dimensions Gbps

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability18 A Programmable 10 Gbps NIC  This NIC architecture relies on: –Data Memory System – Partitioned organization, not coherent caches –Processor Architecture – Parallel scalar processors –Firmware – Frame-level parallel organization –RMW Instructions – reduce ordering overheads  A programmable NIC: A substrate for offload services

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability19 NIC Offload Services  Network Interface Data Caching  Connection Handoff  Virtual Network Interfaces  …

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability20 Network Interface Data Caching  Cache data in network interface  Reduces interconnect traffic  Software-controlled cache  Minimal changes to the operating system  Prototype web server –Up to 57% reduction in PCI traffic –Up to 31% increase in server performance –Peak 1571 Mbps of content throughput Breaks PCI bottleneck

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability21 Results: PCI Traffic ~1260 Mb/s is limit! ~60 % Content traffic PCI saturated 60 % utilization 1198 Mb/s of HTTP content 30 % Overhead

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability22 Content Locality  Block cache with 4KB block size 8-16MB caches capture locality

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability23 Results: PCI Traffic Reduction Low temporal reuse Low PCI utilization Good temporal reuse CPU bottleneck % reduction with four traces Up to 31% performance improvement

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability24 Connection Handoff to the NIC  No magic processor on NIC –OS must control work between itself and NIC  Move established connections between OS and NIC –Connection: unit of control –OS decides when and what  Benefits –Sockets are intact – no need to change applications –Zero-copy –No port allocation or routing on NIC –Can adapt to route changes TCP IP Handoff TCP IP Ethernet Handoff Sockets Ethernet / Lookup Driver NIC OS Handoff interface: 1.Handoff 2.Send 3.Receive 4.Ack 5.…

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability25 Connection Handoff  Traditional offload –NIC replicates entire network stack –NIC can limit connections due to resource limitations  Connection handoff –OS decides which subset of connections NIC should handle –NIC resource limitations limit amount of offload, not number of connections OS NIC

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability26 Establishment and Handoff  OS establishes connections  OS decides whether or not to handoff each connection OS Connection NIC Connection 1.Establish a connection 2. Handoff

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability27 Data Transfer  Offloaded connections require minimal support from OS for data transfers –Socket layer for interface to applications –Driver layer for interrupts, buffer management OS Connection NIC Connection 3. Send, Receive, Ack, … Data

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability28 Connection Teardown  Teardown requires both NIC and OS to deallocate connection data structures OS Connection NIC Connection 4. De-alloc 5. De-alloc

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability29 Connection Handoff Status  Working prototype built on FreeBSD  Initial results for web workloads –Reductions in cycles and cache misses on host –Transparently handle multiple NICs –Fewer messages on PCI 1.4 per packet to 0.6 per packet Socket-level instead of packet-level communication –~17% throughput increase (simulations)  To do –Framework for offload policies –Test zero-copy, more workloads –Port to Linux

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability30 Virtual Network Interfaces  Traditionally used for user-level network access –Each process has its own “virtual NIC” –Provide protection among processes  Can we use this concept to improve network stack performance within the OS? –Possibly, but we need to understand the behavior of the OS on networking workloads first

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability31 Networking Workloads  Performance is influenced by –The operating system’s network stack –The increasing number of connections –Microprocessor architecture trends

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability32 Networking Performance  Bound by TCP/IP processing  2.4GHz Intel Xeon: 2.5 Gbps for one nttcp stream - Hurwitz and Feng, IEEE Micro 2004

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability33 Throughput vs. Connections  Faster links  more connections  More connections  worse performance CS IBM NASA WC

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability34 The End of the Uniprocessor?  Uniprocessors have become too complicated –Clock speed increases have slowed down –Increasingly complicated architectures for performance  Multi-core processors are becoming the norm –IBM Power 4 – 2 cores (2001) –Intel Pentium 4 – 2 hyperthreads (2002) –Sun UltraSPARC IV – 2 cores (2004) –AMD Opteron – 2 cores (2005)  Sun Niagra – 8 cores, 4 threads each (est. 2006)  How do we use these cores for networking?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability35 Parallelism with Data-Synchronized Stacks Linux , FreeBSD 5+

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability36 DragonflyBSD, Solaris 10 Parallelism with Control-Synchronized Stacks

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability37 Parallelization Challenges  Data-Synchronous –Lots of thread parallelism –Significant locking overheads  Control-Synchronous –Reduces locking –Load balancing issues  Which approach is better? –Throughput? Scalability? –We’re optimizing both schemes in FreeBSD 5 to find out  Network Interface –Serialization point –Can virtualization help?

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability38 Memory Controller Architecture  Improve DRAM efficiency –Memory access scheduling –Virtual Channels  Improve copy performance –45-61% of kernel execution time can be copies –Best copy algorithm dependent on copy size, cache residency, cache state –Probe copy –Hardware copy acceleration  Improve I/O performance…

Rice Computer Architecture Group © Scott Rixner, 2005Network Server Performance and Scalability39 Summary  Our focus is on system-level architectures for networking  Network interfaces must evolve –No longer just a PCI-to-Ethernet bridge –Need to provide capabilities to help the operating system  Operating systems must evolve –Future systems will have 10s to 100s of processors –Networking must be parallelized – many bottlenecks remain  Synergy between the NIC and OS cannot be ignored  Memory performance is also increasingly a critical factor