Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon Willy Zwaenepoel EPFL, Lausanne Jose Renato Santos Yoshio Turner.
Starfish: A Self-tuning System for Big Data Analytics.
Lecture 101 Lecture 10: Kernel Modules and Device Drivers ECE 412: Microcomputer Laboratory.
Assessment of Data Path Implementations for Download and Streaming Pål Halvorsen 1,2, Tom Anders Dalseng 1 and Carsten Griwodz 1,2 1 Department of Informatics,
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
Efficient Virtual Memory for Big Memory Servers U Wisc and HP Labs ISCA’13 Architecture Reading Club Summer'131.
IO-Lite: A Unified Buffering and Caching System By Pai, Druschel, and Zwaenepoel (1999) Presented by Justin Kliger for CS780: Advanced Techniques in Caching.
Virtual Memory and I/O Mingsheng Hong. I/O Systems Major I/O Hardware Hard disks, network adaptors … Problems related with I/O Systems Various types of.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
Instant Profiling: Instrumentation Sampling for Profiling Datacenter Applications Hyoun Kyu Cho 1, Tipp Moseley 2, Richard Hank 2, Derek Bruening 2, Scott.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
G Robert Grimm New York University Disco.
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.
Operating Systems Final Exam Review. Topics F Virtual Memory F File Systems F I/O Devices F Project 3: Macro Shell.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
Adaptive Content Delivery for Scalable Web Servers Authors: Rahul Pradhan and Mark Claypool Presented by: David Finkel Computer Science Department Worcester.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Computer Science 1 Resource Overbooking and Application Profiling in Shared Hosting Platforms Bhuvan Urgaonkar Prashant Shenoy Timothy Roscoe † UMASS Amherst.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Networked File System CS Introduction to Operating Systems.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Livio Soares, Michael Stumm University of Toronto
1 Design and Performance of a Web Server Accelerator Eric Levy-Abegnoli, Arun Iyengar, Junehwa Song, and Daniel Dias INFOCOM ‘99.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)
Database Replication Policies for Dynamic Content Applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto EuroSys 2006: Leuven,
LiNK: An Operating System Architecture for Network Processors Steve Muir, Jonathan Smith Princeton University, University of Pennsylvania
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Dynamic Reconfiguration Dynamic selection of handler functionality: currently through use of parameterizable handlers or by selecting from a set of existing.
MapReduce How to painlessly process terabytes of data.
Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.
1 Specification and Implementation of Dynamic Web Site Benchmarks Sameh Elnikety Department of Computer Science Rice University.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Background: I/O Concurrency Brad Karp UCL Computer Science CS GZ03 / M030 2 nd October, 2008.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
Difference of Degradation Schemes among Operating Systems -Experimental analysis for web application servers- Hideaki Hibino*(Tokyo Tech) Kenichi Kourai.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Instrumentation of Xen VMs for efficient VM scheduling and capacity planning in hybrid clouds. Kurt Vermeersch Coordinator: Sam Verboven.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Providing Differentiated Levels of Service in Web Content Hosting Jussara Almeida, etc... First Workshop on Internet Server Performance, 1998 Computer.
1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.
1 Admission Control and Request Scheduling in E-Commerce Web Sites Sameh Elnikety, EPFL Erich Nahum, IBM Watson John Tracey, IBM Watson Willy Zwaenepoel,
Files & File system. A Possible File System Layout Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved
ND The research group on Networks & Distributed systems.
Kernel mode linux. Why? ● The pros – There is a C like interface inside the Linux kernel already. – The speed. – The size of AP. ● The cons – Missing.
Solutions for the First Quiz COSC 6360 Spring 2014.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Processes and Virtual Memory
LAIO: Lazy Asynchronous I/O For Event Driven Servers Khaled Elmeleegy Alan L. Cox.
An Efficient Threading Model to Boost Server Performance Anupam Chanda.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
CSE598c - Virtual Machines - Spring Diagnosing Performance Overheads in the Xen Virtual Machine EnvironmentPage 1 CSE 598c Virtual Machines “Diagnosing.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Taeho Kgil, Trevor Mudge Advanced Computer Architecture Laboratory The University of Michigan Ann Arbor, USA CASES’06.
Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)
Effective Data-Race Detection for the Kernel
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Data Path through host/ANP.
Xen Network I/O Performance Analysis and Opportunities for Improvement
Admission Control and Request Scheduling in E-Commerce Web Sites
Presentation transcript:

Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University

2 Outline Motivation Design & implementation Case study More results

3 Beginning of the Story Flash Web Server on the SPECWeb99 benchmark yield very low score – : Standard Apache 400: Customized Apache 575: Best - Tux on comparable hardware However, Flash Web Server is known to be fast Motivation

4 Performance Debugging of OS-Intensive Applications Web Servers (+ full SpecWeb99) High throughput Large working sets Dynamic content QoS requirements Workload scaling How do we debug such a system? CPU/network disk activity multiple programs latency-sensitive overhead-sensitive Motivation

5  Guesswork, inaccuracy Online  Measurement calls  getrusage()  gettimeofday( )  High overhead > 40% Detailed  Call graph profiling  gprof  Completeness Fast  Statistical sampling  DCPI, Vtune, Oprofile  Guesswork, inaccuracy Online  Measurement calls  getrusage()  gettimeofday( )  High overhead > 40% Detailed  Call graph profiling  gprof  Completeness Fast  Statistical sampling  DCPI, Vtune, Oprofile Current Tradeoffs Motivation

6 In-kernel measurement Example of Current Tradeoffs gettimeofday( ) Wall-clock time of an non-blocking system call Motivation

7 Design Goals Correlate kernel information with application-level information Low overhead on “useless” data High detail on useful data Allow application to control profiling and programmatically react to information Design

8 DeBox: Splitting Measurement Policy & Mechanism Add profiling primitives into kernel Return feedback with each syscall First-class result, like errno Allow programs to interactively profile Profile by call site and invocation Pick which processes to profile Selectively store/discard/utilize data Design

9 DeBox Architecture … res = read (fd, buffer, count) … errnobuffer read ( ) { Start profiling; … End profiling Return; } App Kernel DeBox Info or online offline Design

10 DeBox Data Structure DeBoxControl ( DeBoxInfo *resultBuf, int maxSleeps, int maxTrace ) Trace Info Sleep Info Basic DeBox Info 0 1 … 0 1 … 2 Performance summary Blocking information of the call Full call trace & timing Implementation

11 Sample Output: Copying a 10MB Mapped File Basic DeBox Info PerSleepInfo … CallTrace # of CallTrace used0 # of PerSleepInfo used2 # of page faults989 Call time (usec) System call # ( write( ) )4 Basic DeBox Info # processes on exit0 # processes on entry1 Line where blocked2727 File where blockedkern/vfs_bio.c Resource labelbiowr Time blocked (usec) # occurrences1270 PerSleepInfo[0] fo_write( ) dofilewrite( ) fhold( )5 holdfp( )25 write( ) CallTrace …… … (depth) Implementation

12 In-kernel Implementation 600 lines of code in FreeBSD Monitor trap handler System call entry, exit, page fault, etc. Instrument scheduler Time, location, and reason for blocking Identify dynamic resource contention Full call path tracing + timings More detail than gprof’s call arc counts Implementation

13 General DeBox Overheads Implementation

14 Case Study: Using DeBox on the Flash Web server Experimental setup Mostly 933MHz PIII 1GB memory Netgear GA621 gigabit NIC Ten PII 300MHz clients SPECWeb99 Results orig VM Patch sendfile Dataset size (GB): Case Study

15 Example 1: Finding Blocking System Calls open()/190 readlink()/84 read()/12 stat()/6 unlink()/1 … Blocking system calls, resource blocked and their counts inode - 28biord - 162inode - 84 biord - 3inode - 9inode - 6biord - 1 Case Study

16 Example 1 Results & Solution Mostly metadata locking Direct all metadata calls to name convert helpers Pass open FDs using sendmsg( ) SPECWeb99 Results orig VM Patch sendfile FD Passing Dataset size (GB): Case Study

17 Example 2: Capturing Rare Anomaly Paths FunctionX( ) { … open( ); … } FunctionY( ) { open( ); … open( ); } When blocking happens: abort( ); (or fork + abort) Record call path 1.Which call caused the problem 2.Why this path is executed (App + kernel call trace) Case Study

18 Example 2 Results & Solution Cold cache miss path Allow read helpers to open cache missed files More benefit in latency SPECWeb99 Results orig VM Patch sendfile FD Passing Dataset size (GB): Case Study

19 Example 3: Tracking Call History & Call Trace Call time of fork( ) as a function of invocation Call trace indicates: File descriptors copy - fd_copy( ) VM map entries copy - vm_copy( ) Case Study

20 Example 3 Results & Solution Similar phenomenon on mmap( ) Fork helper eliminate mmap( ) SPECWeb99 Results orig VM Patch sendfile FD Passing Fork helper No mmap() Dataset size (GB): Case Study

21 Sendfile Modifications (Now Adopted by FreeBSD) Cache pmap/sfbuf entries Return special error for cache misses Pack header + data into one packet Case Study

22 SPECWeb99 Scores orig sendfile FD Passing Fork helper No mmap() VM Patch New CGI Interface New sendfile( ) Dataset size (GB) Case Study

23 New Flash Summary Application-level changes FD passing helpers Move fork into helper process Eliminate mmap cache New CGI interface Kernel sendfile changes Reduce pmap/TLB operation New flag to return if data missing Send fewer network packets for small files Case Study

24 Throughput on SPECWeb Static Workload MB1.5 GB3.3 GB Dataset Size Throughput (Mb/s) ApacheFlashNew-Flash Other Results

25 Latencies on 3.3GB Static Workload 44X 47X Mean latency improvement: 12X Other Results

26 Throughput Portability (on Linux) (Server: 3.0GHz P4, 1GB memory) MB1.5 GB3.3 GB Dataset Size Throughput (Mb/s) HaboobApacheFlashNew-Flash Other Results

27 Latency Portability (3.3GB Static Workload) 3.7X 112X Mean latency improvement: 3.6X Other Results

28 Summary DeBox is effective on OS-intensive application and complex workloads Low overhead on real workloads Fine detail on real bottleneck Flexibility for application programmers Case study SPECWeb99 score quadrupled Even with dataset 3x of physical memory Up to 36% throughput gain on static workload Up to 112x latency improvement Results are portable

Thank you

30 SpecWeb99 Scores Standard Flash200 Standard Apache220 Apache + special module420 Highest 1GB/1GHz score575 Improved Flash820 Flash + dynamic request module 1050

31 SpecWeb99 on Linux Standard Flash600 Improved Flash1000 Flash + dynamic request module GHz P4 with 1GB memory

32 New Flash Architecture