Capriccio:Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, George Necula and Eric Brewer Presented by Guoyang Chen
Overview Motivation Threads VS Events User-Level Threads Capriccio Implementation Linked Stack Management Resource-Aware Scheduling Evaluation
Motivation High demand for web content Web services is getting more complex, requiring maximum server performance Well-conditioned Service?
Well-conditioned Service A well-conditioned service behave like a simple pipeline As offered load increases, throughput increases proportionally When saturated, throughput does not degrade substantially Ideal Peak: some resource at max Performance Overload: some resource thrashing Load (concurrent tasks)
Threads VS Events Thread-Based Concurrency To build the programming model ,there are two kinds of methods. Threads based programming and event-based proramming. Let’s start with thread based programming. Each incoming request is dispatched to a separate thread, which processes the request and returns a result to the client. Besides, other I/O operations, such as disk access, are not shown here, but would be incorporated into each threads‘ request processing. However, Too many threads will lead to High resource usage, context switch overhead, contended. The traditional solution is to bound total number of threads. This question is intractable High resource usage, context switch overhead, contended locks Traditional solution: Bound total number of threads But, how do you determine the ideal number of threads?
Threads VS Events Event-Based Concurrency Small number of event-processing threads with many FSMs Yields efficient and scalable concurrency Many examples: Click router, Flash web server, TP Monitors, etc.
SEDA Staged Event-Driven Architecture (SEDA) Decompose service into stages separated by queues Each stage performs a subset of request processing Stages internally event-driven Each stage contains a thread pool to drive stage execution However, threads are not exposed to applications Dynamic control grows/shrinks thread pools with demand
Drawbacks of Events Events systems hide the control flow Difficult to understand and debug Eventually evolved into call-and-return event pairs Programmers need to match related events Need to save/restore states Events require manual state management Capriccio: instead of event-based model, fix the thread-based model
Thread VS Event Why Thread? More natural programming model Control flow is more apparent Exception handling is easier State management is automatic Better fit with current tools & hardware Better existing infrastructure
Capriccio Goals Mechanisms Simplify the programming model Thread per concurrent activity Scalability (100K+ threads) Support existing APIs and tools Automate application-specific customization Mechanisms User-level threads Plumbing: avoid O(n) operations Compile-time analysis Run-time analysis
Thread Design Principles Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App User Threads OS
Thread Design Principles Decouple programming model and OS Kernel threads Abstract hardware Expose device concurrency User-level threads Provide clean programming model Expose logical concurrency App Threads User OS
Thread Design and Scalability User-Level Threads Flexibility Capriccio can use the new asynchronous I/O mechanisms without changing App code. User-level thread scheduler can be built along with the application. Extremely lightweight Performance Reduce the overhead of thread synchronization Do not require kernel crossings for mutex acquisition or release More efficient memory management at user level
Drawbacks of User-Level Threads An increased number of kernel crossings A blocking I/O call, will be replaced by non-blocking mechanisms(epoll) A wrapper layer for translating blocking mechanisms to non-blocking. Difficult to use multiple processors. Synchronization is no longer “for free”
Capriccio Implementation A user-level threading library. All thread operations are O(1) Linked stacks Address the problem of stack allocation for large numbers of threads Combination of compile-time and run-time analysis Resource-aware scheduler
Context Switches Built on top of Edgar Toernig’s coroutine library Fast context switches when threads yield
I/O Capriccio intercepts blocking I/O calls Uses epoll for non-blocking I/O
Scheduling Very much like an event-driven application Events are hidden from programmers
Synchronization Supports cooperative threading on single-CPU machines Requires only Boolean checks
Threading Microbenchmarks SMP, two 2.4 GHz Xeon processors 1 GB memory two 10 K RPM SCSI Ultra II hard drives Linux 2.5.70 Compared Capriccio, LinuxThreads, and Native POSIX Threads for Linux
Latencies of Thread Primitives NPTL: Native POSIX Thread Library Capriccio LinuxThreads NPTL Thread creation 21.5 17.7 Thread context switch 0.24 0.71 0.65 Uncontended mutex lock 0.04 0.14 0.15
Thread Scalability Producers put empty messages into a shared buffer, and consumers “process” each message by looping for a random amount of time.
I/O Performance Network performance By passing a number of TOKEN among pipes Simulates the effect of slow client links 10% overhead compared to epoll Twice as fast as both LinuxThreads and NPTL when more than 1000 threads Disk I/O is comparable to kernel threads
I/O Performance epoll_wait()
I/O Performance Benefit from kernel’s disk head scheduling algorithm since Capriccio uses asynchronous I/O primitives
Disk I/O with Buffer Cache At low miss rate, capriccio’s throughput is 50% of NPTL. The source of the overhead is asynchronous I/O interface
Linked Stack Management LinuxThreads allocates 2MB per stack 1 GB of VM holds only 500 threads Fixed Stacks
Safety: Linked Stacks The problem: fixed stacks Overflow vs. wasted space The solution: linked stacks Allocate space as needed Compiler analysis Add runtime checkpoints Guarantee enough space until next check overflow waste Linked Stack
Linked Stacks: Algorithm Build weighted call graph Insert checkpoints What is checkpoint? Determine whether there is enough stack space left to reach the next checkpoint. 3 3 2 5 2 4 3 6
Placing Checkpoints One checkpoint in every cycle’s back edge in the call graph Bound the size between checkpoints with the deepest call path
Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack 3 3 2 5 c1 2 4 3 6 MaxPath = 8
Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 B 3 D C 2 5 E c1 c2 2 4 F 3 6 H G MaxPath = 8
Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c2 c1 2 4 F 3 6 G MaxPath = 8
Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8
Linked Stacks: Algorithm Parameters MaxPath MinChunk Steps Break cycles Trace back Special Cases Function pointers External calls Use large stack c5 A 3 c3 B 3 D C 2 5 E c4 c1 c2 2 4 F 3 6 G MaxPath = 8
Dealing with Special Cases Function pointers Don’t know what procedure to call at compile time Can find a potential set of procedures
Dealing with Special Cases External functions Allow programmers to annotate external library functions with trusted stack bounds Allow larger stack chunks to be linked for external functions
Tuning the Algorithm Stack space will be wasted Tradeoffs Internal and external wasted space Tradeoffs Number of stack linkings (maxpath) External fragmentation (minchunk)) Internal external
Memory Benefits No preallocation of large stacks Reduce requirement to run a large numbers of threads Better paging behavior Stacks—LIFO
Case Study: Apache 2.0.44 MaxPath: 2KB MinChunk: 4KB Apache under SPECweb99 Overall slowdown is about 3% Dynamic allocation 0.1% Link to large chunks for external functions 0.5% Stack removal 10%
Scheduling: The Blocking Graph Web Server Lessons from event systems Capriccio does this for threads Each node is a location in the program that blocked Deduce stage with stack traces at blocking points Record information about thread behavior Accept Read Open Read Close Write Close
Scheduling: The Blocking Graph Annotate average running time for each edge(cycle counter) Annotate value for how long the next edge will take on average for each node Annotate the changes in resource usage Web Server Accept Read Open Read Close Write Close
Resource-Aware Scheduling Keep track of resource usage levels and decide dynamically if each resource is at its limit. Annotate each node with the resources used on its outgoing edges so we can predict the impact on each resource should we schedule threads from that node. Dynamically prioritize nodes (and thus threads) for scheduling based on information from the first two parts. Increase use when underutilized Decrease use near saturation Advantages Operate near thrashing Automatic admission control Web Server Accept Read Open Read Close Write Close
Track Resources Memory usage: File descriptors: By using malloc() family Resource limit for memory: By watching page fault activity File descriptors: By tracking open() and close() calls Resource limit: By estimating the number of open connections at which response time jumps up
Pitfalls Tricky to determine the maximum capacity of a resource Thrashing depends on the workload Disk can handle more requests that are sequential instead of random Resources interact VM vs. disk Applications may manage memory themselves
Yield Profiling User threads are problematic if a thread fails to yield They are easy to detect, since their running times are orders of magnitude larger Yield profiling identifies places where programs fail to yield sufficiently often
Web Server Performance 4x500 MHz Pentium server 2GB memory Intel e1000 Gigabit Ethernet card Linux 2.4.20 Workload: requests for 3.2 GB of static file data
Web Server Performance Request frequencies match those of the SPECweb99 A client connects to a server repeated and issue a series of five requests, separated by 20ms pauses Apache’s performance improved by 15% with Capriccio
Web Server Performance
Runtime Overhead Tested Apache 2.0.44 Stack linking 78% slowdown for null call 3-4% overall Resource statistics 2% (on all the time) 0.1% (with sampling) Stack traces 8% overhead
Resource-Aware Admission Control Touching pages too quickly will cause thrashing. Producer threads loop, adding memory to a global pool and randomly touching pages to force them to stay in memory Consumer threads loop, removing memory from the global pool and freeing it. Capriccio can quickly detect the overload conditions and limit the number of producers
Related Work Programming Models for High Concurrency User-Level Threads(Capriccio is unique) Blocking graph Resource-aware scheduling Target at a large number of blocking threads POSIX compliant Application-Specific Optimization Stack Management Resource-Aware Scheduling
Future Work Multi-CPU machines Improve resource-aware scheduler and stack analysis Produce profiling tools to help tune Capriccio’s stack parameters to the application’s needs.
Thanks! Questions?