Making the “Box” Transparent: System Call Performance as a First-class Result Yaoping Ruan, Vivek Pai Princeton University
2 Outline Motivation Design & implementation Case study More results
3 Beginning of the Story Flash Web Server on the SPECWeb99 benchmark yield very low score – : Standard Apache 400: Customized Apache 575: Best - Tux on comparable hardware However, Flash Web Server is known to be fast Motivation
4 Performance Debugging of OS-Intensive Applications Web Servers (+ full SpecWeb99) High throughput Large working sets Dynamic content QoS requirements Workload scaling How do we debug such a system? CPU/network disk activity multiple programs latency-sensitive overhead-sensitive Motivation
5 Guesswork, inaccuracy Online Measurement calls getrusage() gettimeofday( ) High overhead > 40% Detailed Call graph profiling gprof Completeness Fast Statistical sampling DCPI, Vtune, Oprofile Guesswork, inaccuracy Online Measurement calls getrusage() gettimeofday( ) High overhead > 40% Detailed Call graph profiling gprof Completeness Fast Statistical sampling DCPI, Vtune, Oprofile Current Tradeoffs Motivation
6 In-kernel measurement Example of Current Tradeoffs gettimeofday( ) Wall-clock time of an non-blocking system call Motivation
7 Design Goals Correlate kernel information with application-level information Low overhead on “useless” data High detail on useful data Allow application to control profiling and programmatically react to information Design
8 DeBox: Splitting Measurement Policy & Mechanism Add profiling primitives into kernel Return feedback with each syscall First-class result, like errno Allow programs to interactively profile Profile by call site and invocation Pick which processes to profile Selectively store/discard/utilize data Design
9 DeBox Architecture … res = read (fd, buffer, count) … errnobuffer read ( ) { Start profiling; … End profiling Return; } App Kernel DeBox Info or online offline Design
10 DeBox Data Structure DeBoxControl ( DeBoxInfo *resultBuf, int maxSleeps, int maxTrace ) Trace Info Sleep Info Basic DeBox Info 0 1 … 0 1 … 2 Performance summary Blocking information of the call Full call trace & timing Implementation
11 Sample Output: Copying a 10MB Mapped File Basic DeBox Info PerSleepInfo … CallTrace # of CallTrace used0 # of PerSleepInfo used2 # of page faults989 Call time (usec) System call # ( write( ) )4 Basic DeBox Info # processes on exit0 # processes on entry1 Line where blocked2727 File where blockedkern/vfs_bio.c Resource labelbiowr Time blocked (usec) # occurrences1270 PerSleepInfo[0] fo_write( ) dofilewrite( ) fhold( )5 holdfp( )25 write( ) CallTrace …… … (depth) Implementation
12 In-kernel Implementation 600 lines of code in FreeBSD Monitor trap handler System call entry, exit, page fault, etc. Instrument scheduler Time, location, and reason for blocking Identify dynamic resource contention Full call path tracing + timings More detail than gprof’s call arc counts Implementation
13 General DeBox Overheads Implementation
14 Case Study: Using DeBox on the Flash Web server Experimental setup Mostly 933MHz PIII 1GB memory Netgear GA621 gigabit NIC Ten PII 300MHz clients SPECWeb99 Results orig VM Patch sendfile Dataset size (GB): Case Study
15 Example 1: Finding Blocking System Calls open()/190 readlink()/84 read()/12 stat()/6 unlink()/1 … Blocking system calls, resource blocked and their counts inode - 28biord - 162inode - 84 biord - 3inode - 9inode - 6biord - 1 Case Study
16 Example 1 Results & Solution Mostly metadata locking Direct all metadata calls to name convert helpers Pass open FDs using sendmsg( ) SPECWeb99 Results orig VM Patch sendfile FD Passing Dataset size (GB): Case Study
17 Example 2: Capturing Rare Anomaly Paths FunctionX( ) { … open( ); … } FunctionY( ) { open( ); … open( ); } When blocking happens: abort( ); (or fork + abort) Record call path 1.Which call caused the problem 2.Why this path is executed (App + kernel call trace) Case Study
18 Example 2 Results & Solution Cold cache miss path Allow read helpers to open cache missed files More benefit in latency SPECWeb99 Results orig VM Patch sendfile FD Passing Dataset size (GB): Case Study
19 Example 3: Tracking Call History & Call Trace Call time of fork( ) as a function of invocation Call trace indicates: File descriptors copy - fd_copy( ) VM map entries copy - vm_copy( ) Case Study
20 Example 3 Results & Solution Similar phenomenon on mmap( ) Fork helper eliminate mmap( ) SPECWeb99 Results orig VM Patch sendfile FD Passing Fork helper No mmap() Dataset size (GB): Case Study
21 Sendfile Modifications (Now Adopted by FreeBSD) Cache pmap/sfbuf entries Return special error for cache misses Pack header + data into one packet Case Study
22 SPECWeb99 Scores orig sendfile FD Passing Fork helper No mmap() VM Patch New CGI Interface New sendfile( ) Dataset size (GB) Case Study
23 New Flash Summary Application-level changes FD passing helpers Move fork into helper process Eliminate mmap cache New CGI interface Kernel sendfile changes Reduce pmap/TLB operation New flag to return if data missing Send fewer network packets for small files Case Study
24 Throughput on SPECWeb Static Workload MB1.5 GB3.3 GB Dataset Size Throughput (Mb/s) ApacheFlashNew-Flash Other Results
25 Latencies on 3.3GB Static Workload 44X 47X Mean latency improvement: 12X Other Results
26 Throughput Portability (on Linux) (Server: 3.0GHz P4, 1GB memory) MB1.5 GB3.3 GB Dataset Size Throughput (Mb/s) HaboobApacheFlashNew-Flash Other Results
27 Latency Portability (3.3GB Static Workload) 3.7X 112X Mean latency improvement: 3.6X Other Results
28 Summary DeBox is effective on OS-intensive application and complex workloads Low overhead on real workloads Fine detail on real bottleneck Flexibility for application programmers Case study SPECWeb99 score quadrupled Even with dataset 3x of physical memory Up to 36% throughput gain on static workload Up to 112x latency improvement Results are portable
Thank you
30 SpecWeb99 Scores Standard Flash200 Standard Apache220 Apache + special module420 Highest 1GB/1GHz score575 Improved Flash820 Flash + dynamic request module 1050
31 SpecWeb99 on Linux Standard Flash600 Improved Flash1000 Flash + dynamic request module GHz P4 with 1GB memory
32 New Flash Architecture